Tandem MS is a technique for compound identification in untargeted metabolomics experiments. Because of a lack of reference spectra, most molecules cannot be identified, and many spectra cannot be used. We present MS2LDA, an unsupervised method (inspired by text-mining) that extracts common patterns of mass fragments and neutral losses—Mass2Motifs—from collections of fragmentation spectra. Structurally characterized Mass2Motifs can be used to annotate molecules for which no reference spectra exist and expose biochemical relationships between molecules. For four beer extracts, without training data, we show that, with 30 structurally characterized Mass2Motifs, we can annotate approximately three times as many molecules as with library matching. These Mass2Motifs were validated in reference spectra from Global Natural Products Social Molecular Networking (GNPS) and MassBank.
PNAS Publication: Topic modeling for untargeted substructure exploration in metabolomics. The data and codes for the paper can be found at http://dx.doi.org/10.5525/gla.researchdata.313.
An improved code for MS2LDA that allows for topics (i.e. Mass2Motifs) to be inferred across multiple document collections (i.e. fragmentation files) at once can be found at http://github.com/sdrogers/lda. The rest of the pipeline codes to process and load fragmentation data into the pipeline can also be found there. The codes for this website itself, alongside various visualisation modules, can be found at http://github.com/sdrogers/ms2ldaviz
How does it work? In MS2LDA, discrete fragment and neutral loss features are extracted from fragmentation spectra. Related features that tend to co-occur are detected using the Latent Dirichlet Allocation model. The figure below shows the analogy between LDA for text and MS2LDA for fragment and neutral loss features. LDA finds topics interpreted as ‘football related’, ‘business-related’ and ‘environment related’. MS2LDA finds sets of concurring mass fragments or losses (Mass2Motifs) that can be interpreted as ‘Asparagine-related’, ‘Hexose-related’ and ‘Adenine-related’.