Filtering records
The ParseDb.py tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.
Removing non-productive sequences
After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:
1ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T
2ParseDb.py split -d HD13M_db-pass.tsv -f productive
The first line above uses the select subcommand to output a single file
labeled parse-select
containing only records with the value of T
(-u T
) in the productive
column
(-f productive
).
Alternatively, the second line above uses the split subcommand to output
multiple files with each file containing records with one of the values found in the
productive
column (-f productive
). This will
generate two files labeled productive-T
and productive-F
.
Removing disagreements between the C-region primers and the reference alignment
If you have data that includes both heavy and light chains in the same library,
the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not
always agree with the isotype assignments from the C-region primers. In these cases,
you can filter out such reads with the select subcommand of ParseDb.py.
An example function call using an imaginary file db.tsv
is provided below:
1ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \
2 --logic all --regex --outname heavy
3ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \
4 --logic all --regex --outname light
These commands will require that all of the v_call
, j_call
and c_call
fields (-f v_call j_call c_call
and
--logic all
) contain the string IGH
(lines 1-2)
or one of IGK
or IGL
(lines 3-4). The --regex
argument allows for partial matching and interpretation of regular expressions. The
output from these two commands are two files, one containing only heavy chains
(heavy_parse-select.tsv
) and one containg only light chains (light_parse-select.tsv
).
Exporting records to FASTA files
You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.
Standard FASTA
The fasta subcommand allows you to export sequences and annotations to FASTA formatted files in the pRESTO annototation scheme:
ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count
Where the column containing the sequence identifier is specified by
--if sequence_id
, the nucleotide sequence column is
specified by --sf sequence_id
, and additional annotations
to be added to the sequence header are specified by
--mf v_call duplicate_count
.
BASELINe FASTA
The baseline subcommand generates a FASTA derivative format required by the
BASELINe web tool. Generating these
files is similar to building standard FASTA files, but requires a few more options.
An example function call using an imaginary file db.tsv
is provided below:
ConvertDb.py baseline -d db.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count \
--cf clone_id --gf germline_alignment_d_mask
The additional arguments required by the baseline subcommand include the
clonal grouping (--cf clone_id
) and germline sequence
(--gf germline_alignment_d_mask
) columns added by
the DefineClones and CreateGermlines tasks,
respectively.
Note
The baseline subcommand requires the CLONE
column to be sorted.
DefineClones.py generates a sorted CLONE
column by default. However,
you needed to alter the order of the CLONE
column at some point,
then you can re-sort the clonal assignments using the sort
subcommand of ParseDb.py. An example function call using an imaginary
file db.tsv
is provided below:
ParseDb.py sort -d db.tsv -f clone_id
Which will sort records by the value in the clone_id
column
(-f clone_id
).