OpenMS
Loading...
Searching...
No Matches
ProteinQuantifier

Compute peptide and protein abundances from annotated feature/consensus maps or from identification results.

potential predecessor tools → ProteinQuantifier → potential successor tools
IDMapper external tools
e.g. for statistical analysis
FeatureLinkerUnlabeled
(or another feature grouping tool)

Reference:
Weisser et al.: An automated pipeline for high-throughput label-free quantitative proteomics (J. Proteome Res., 2013, PMID: 23391308).

Input: featureXML or consensusXML

Quantification is based on the intensity values of the features in the input files. Feature intensities are first accumulated to peptide abundances, according to the peptide identifications annotated to the features/feature groups. Then, abundances of the peptides of a protein are aggregated to compute the protein abundance.

The peptide-to-protein step uses the (e.g. 3) most abundant proteotypic peptides per protein to compute the protein abundances. This is a general version of the "top 3 approach" (but only for relative quantification) described in:
Silva et al.: Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition (Mol. Cell. Proteomics, 2006, PMID: 16219938).

Only features/feature groups with unambiguous peptide annotation are used for peptide quantification. It is possible to resolve ambiguities before applying ProteinQuantifier using one of several equivalent mechanisms in OpenMS: IDConflictResolver, ConsensusID (algorithm best), or FileFilter (option id:keep_best_score_id).

Similarly, only proteotypic peptides (i.e. those matching to exactly one protein) are used for protein quantification by default. Peptide/protein IDs from multiple identification runs can be handled, but will not be differentiated (i.e. protein accessions for a peptide will be accumulated over all identification runs). See section "Optional input: Protein inference/grouping results" below for exceptions to this.

Peptides with the same sequence, but with different modifications are quantified separately on the peptide level, but treated as one peptide for the protein quantification (i.e. the contributions of differently-modified variants of the same peptide are accumulated).

Input: idXML

Quantification based on identification results uses spectral counting, i.e. the abundance of each peptide is the number of times that peptide was identified from an MS2 spectrum (considering only the best hit per spectrum). Different identification runs in the input are treated as different samples; this makes it possible to quantify several related samples at once by merging the corresponding idXML files with IDMerger. Depending on the presence of multiple runs, output format and applicable parameters are the same as for featureXML and consensusXML, respectively.

The notes above regarding quantification on the protein level and the treatment of modifications also apply to idXML input. In particular, this means that the settings top 0 and aggregate sum should be used to get the "classical" spectral counting quantification on the protein level (where all identifications of all peptides of a protein are summed up).

Optional input: Protein inference/grouping results

By default only proteotypic peptides (i.e. those matching to exactly one protein) are used for protein quantification. However, this limitation can be overcome: Protein inference results for the whole sample set can be supplied with the protein_groups option (or included in a featureXML input). In that case, the peptide-to-protein references from that file are used (rather than those from in), and groups of indistinguishable proteins will be quantified. Each reported protein quantity then refers to the total for the respective group.

In order for everything to work correctly, it is important that the protein inference results come from the same identifications that were used to annotate the quantitative data. We suggest to use the OpenMS tool ProteinInference @TOPP_ProteinInference.

More information below the parameter specification.

Note
Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

INI file documentation of this tool:

Output format

The output files produced by this tool have a table format, with columns as described below:

Protein output (one protein/set of indistinguishable proteins per line):

  • protein: Protein accession(s) (as in the annotations in the input file; separated by "/" if more than one).
  • n_proteins: Number of indistinguishable proteins quantified (usually "1").
  • protein_score: Protein score, e.g. ProteinProphet probability (if available).
  • n_peptides: Number of proteotypic peptides observed for this protein (or group of indistinguishable proteins) across all samples. Note that not necessarily all of these peptides contribute to the protein abundance (depending on parameter top).
  • abundance: Computed protein abundance. For consensusXML input, there will be one column per sample ("abundance_1", "abundance_2", etc.).

Peptide output (one peptide or - if best_charge_and_fraction is set - one charge state and fraction of a peptide per line):

  • peptide: Peptide sequence. Only peptides that occur in unambiguous annotations of features are reported.
  • protein: Protein accession(s) for the peptide (separated by "/" if more than one).
  • n_proteins: Number of proteins this peptide maps to. (Same as the number of accessions in the previous column.)
  • charge: Charge state quantified in this line. "0" (for "all charges") unless best_charge_and_fraction was set.
  • abundance: Computed abundance for this peptide. If the charge in the preceding column is 0, this is the total abundance of the peptide over all charge states; otherwise, it is only the abundance observed for the indicated charge (in this case, there may be more than one line for the peptide sequence). Again, for consensusXML input, there will be one column per sample ("abundance_1", "abundance_2", etc.). Also for consensusXML, the reported values are already normalized if consensus:normalize was set.

Protein quantification examples

While quantification on the peptide level is fairly straight-forward, a number of options influence quantification on the protein level - especially for consensusXML input. The three parameters top:N, top:include_all and consensus:fix_peptides determine which peptides are used to quantify proteins in different samples.

As an example, consider a protein with four proteotypic peptides. Each peptide is detected in a subset of three samples, as indicated in the table below. The peptides are ranked by abundance (1: highest, 4: lowest; assuming for simplicity that the order is the same in all samples).

sample 1 sample 2 sample 3
peptide 1 X X
peptide 2 X X
peptide 3 X X X
peptide 4 X X

Different parameter combinations lead to different quantification scenarios, as shown here:

parameters
"*": no effect in this case
peptides used for quantification
"(...)": not quantified here because ...
explanation
top include_all c.:fix_peptides sample 1 sample 2 sample 3
0 * no 1, 2, 3, 4 2, 3, 4 1, 3 all peptides
1 * no 1 2 1 single most abundant peptide
2 * no 1, 2 2, 3 1, 3 two most abundant peptides
3 no no 1, 2, 3 2, 3, 4 (too few peptides) three most abundant peptides
3 yes no 1, 2, 3 2, 3, 4 1, 3 three or fewer most abundant peptides
4 no * 1, 2, 3, 4 (too few peptides) (too few peptides) four most abundant peptides
4 yes * 1, 2, 3, 4 2, 3, 4 1, 3 four or fewer most abundant peptides
0 * yes 3 3 3 all peptides present in every sample
1 * yes 3 3 3 single peptide present in most samples
2 no yes 1, 3 (peptide 1 missing) 1, 3 two peptides present in most samples
2 yes yes 1, 3 3 1, 3 two or fewer peptides present in most samples
3 no yes 1, 2, 3 (peptide 1 missing) (peptide 2 missing) three peptides present in most samples
3 yes yes 1, 2, 3 2, 3 1, 3 three or fewer peptides present in most samples

Further considerations for parameter selection

With best_charge_and_fractions and aggregate, there is a trade-off between comparability of protein abundances within a sample and of abundances for the same protein across different samples.
Setting best_charge_and_fraction may increase reproducibility between samples, but will distort the proportions of protein abundances within a sample. The reason is that ionization properties vary between peptides, but should remain constant across samples. Filtering by charge state can help to reduce the impact of feature detection differences between samples.
For aggregate, there is a qualitative difference between (intensity weighted) mean/median and sum in the effect that missing peptide abundances have (only if include_all is set or top is 0): (intensity weighted) mean and median ignore missing cases, averaging only present values. If low-abundant peptides are not detected in some samples, the computed protein abundances for those samples may thus be too optimistic. sum implicitly treats missing values as zero, so this problem does not occur and comparability across samples is ensured. However, with sum the total number of peptides ("summands") available for a protein may affect the abundances computed for it (depending on top), so results within a sample may become unproportional.