Algorithms to reconstruct peptide sequences in Python
Based on work carried out in the ComBi team of the LS2N (Laboratoire des Sciences du Numérique de Nantes) and at INRAe Nantes on mass spectrometry. We seek to solve a problem of reconstructing peptide sequences from partial, and sometimes erroneous, data provided as input in the form of sequences that we call baitModels.
baitModels are sequences composed of three types of elements :
- characters representing amino acids
- numerical values in square brackets representing masses
- characters in square brackets
The square brackets indicate that there has been a modification
(insertion, deletion or substitution) of one or more amino acid(s).
baitModels can represent the same amino acid sequence (i.e. a peptide
that we will call bait) more or less. The goal of the algorithms
(i.e. linearBaitFusion.py
, alignBaitFusion.py
) is, on the basis of
the baitModels provided, to reconstruct the peptide using (or ignoring)
the data carried by the baitModels.
The alignBaitFusion.py
method uses MSA (Multiple Sequence Alignment)
algorithms (i.e. ClustalW, MUSCLE) to align the amino acids after
replacing the masses in square brackets by dashes ('-'). Then, the
baitModels are "fused" using an election system per column of amino
acids.
The linearBaitFusion.py
method uses one cursor (see illustration above)
per baitModels to choose amino acids and then use an election system
to elect a candidate amino acid among the chosen amino acids.
The following parameters are available for both methods :
verbose
: Boolean set toFalse
by default. Used to activate verbose mode.fulltable
: Boolean set toTrue
by default. Used to print the tables with details.trace
: Floating number. Used to ignore all masses within the specified amount (in Dalton).sensitivity
: Floating number. An amount in Da used in conjunction withtolerance
.tolerance
: Floating number. When looking up a mass in the mass table look up all entries within the specified amount (in Dalton) with a step ofsensitivity
Da.simplification
: Boolean set toTrue
by default. Pre-processing to simplify baitModels. When a baitModels has a mass that isn't found in the mass table then the corresponding mass, the next mass and the amino acids between the two masses are replaced by the sum of the masses of the aforementionned elements.ignoreDuplicateBM
: Boolean set toFalse
by default. Used to ignore redundant baitModels when using the methods.onlythisbait
: String. Only used method on this bait.minNumBaits
: Integer. Only load baits and corresponding baitModels if there is at least the specified amount of baitModels.maxNumBaits
: Integer. Only load baits and corresponding baitModels if there is at most the specified amount of baitModels.
The method linearBaitFusion.py
have the following additional parameters :
valid
,probation
,invalid
: Integers [4, 1, 0 by default]. Weights that are used when electing a candidate (an amino acid coming from a valid baitModel should have more votes than an amino acid from a probation baitModel)secondpass
: Boolean set toTrue
by default. If thelinearBaitFusion.py
method fails to provide a complete sequence by iterating through the baitModels from left to right, it will try to generate a sequence by iterating through the baitModels from right to left.concatenation
: Boolean set toTrue
by default. Ifsecondpass
is set toTrue
and the pass from right to left fails to provide a complete sequence then the method will concatenate both incomplete sequences (the one obtained from left to right and the one obtained from right to left). If the mass of the resulting sequence is greater than the average mass of the baitModels by a certain margin then amino acids from the sequence right-to-left are deleted until the mass of the concatenated sequence is equal to the average mass of the baitModels.simplifyBothWays
: Boolean set toTrue
by default. Ifsimplification
is set toTrue
, the simplified baitModels are used in the second pass of the method too.replaceDashByMass
: Boolean set toTrue
by default. Post-processing to replace the gaps of incomplete sequences by amino acids if the gaps correspond to the mass of amino acids, by a sequence of amino acids in curly brackets if the gaps correspond to the mass of a single sequence of amino acids, or by the mass of the gaps otherwise.
The method alignBaitFusion.py
have the following additinal parameters :
useMuscle
: Boolean set toFalse
by default. Use MUSCLE instead of ClustalW (the default is ClustalW).clustalopt
: String. Options to provide to ClustalW ("-QUICKTREE -MATRIX=ID -GAPOPEN=5 -GAPEXT=1 -NOHGAP -NOWEIGHTS -CLUSTERING=UPGMA" by default)muscleopt
: String. Options to provide to MUSCLE (Empty by default).
Use the command :
python3 <method> <path_to_stats_file> <path_to_mass_table>
Where :
<method>
is eitherlinearBaitFusion.py
oralignBaitFusion.py
.<path_to_stats_file>
is the path to the stats file<path_to_mass_table>
is the path to the mass table
For example :
python3 linearBaitFusion.py instances/stats/stats_bait10000.txt mass\ tables/mass_table_8.csv
The methods output four plots and a csv file containing all baits with the output sequences of the method, the length of the baits, the longest stretch between the baits and the output sequences and the number of amino acids that are not covered by the longest stretch.