As the cost of DNA sequencing continues to fall, an increasing amount of information on human genetic variation is being produced that could help progress precision medicine. However, information about such mutations is typically first made available in the scientific literature, and is then later manually curated into more standardized genomic databases. This curation process is expensive, time-consuming and many variants do not end up being fully curated, if at all. Detecting mutations in the literature is the first key step towards automating this process. However, most of the current methods have focused on identifying mutations that follow existing nomenclatures. In this work, we show that there is a large number of mutations that are missed by using this standard approach. Furthermore, we implement the first mutation annotator to cover an extended mutation landscape, and we show that its F1 performance is the same performance as human annotation (F1 78.29 for manual annotation vs F1 79.56 for automatic annotation).

Learning Objective 1: Formulate an approach to extract genetic variation from the scientific literature, which is relevant to understand the genetic causes of some diseases or to understand potential resitance to treatment for some patients.


Antonio Jimeno Yepes (Presenter)
IBM Research Australia

Andrew MacKinlay, IBM Research Australia
Natalie Gunn, IBM Research Australia
Christine Schieber, IBM Research Australia
Noel Faux, IBM Research Australia
Matthew Downton, IBM Research Australia
Benjamin Goudey, IBM Research Australia
Richard Martin, IBM

Presentation Materials: