Lung cancer is the second most common cancer and by far the leading cause of cancer-related death in both males and females, accounting for 1 in 4 cancer deaths in U.S. Accurate identification of lung cancer related information is very important for epidemiological studies, therefore is critical for improving cancer outcomes. Since epidemiologists use electronic medical records (EMR) with rich longitudinal data on large populations for epidemiologic research but manual extraction from large volumes of text materials is time consuming and labor intensive, considerable efforts have emerged to automatically extract information from text for lung cancer patients using natural language processing (NLP). In this study, we developed and evaluated an NLP system in capturing information on stage, histology, grade, chemotherapy, radiotherapy and surgery in lung cancer patients using various narrative data sources from EMR including clinical notes, pathology reports and surgery reports.
<!--[endif]---->We used an existing cohort including 2,311 lung cancer patients with information about stage, histology, grade, and therapies manually ascertained. Based on the cohort, a NLP system was developed using the open source clinical NLP pipeline MedTagger as the platform 1. Specifically we utilized the sentence detection and tokenization parts in MedTagger. Then the NLP system integrated rules and algorithm to output final normalized concept names for each data element. We finally evaluated the output of NLP system against the human abstracted results from the existing dataset. Deep learning was used to predict values for data elements using sentences labeled by NLP system as input. Then we analyzed NLP results, deep learning prediction results and the reference standard from the existing cohort for error analysis in terms of histology extraction. Results
Evaluation showed promising results with the recalls for stage, histology, grade, and therapies achieving 89%, 98%, 78%, and 100% respectively and the precisions were 70%, 88%, 90%, and 100% respectively. Error analysis in 100 patients indicated that the NLP system helped to identify more specific histological types, e.g., adenocarcinoma in 8 patients that were not provided in the reference standard, and identify the correct histology type in 1 patient who was mistakenly identified as another type in the reference standard. Findings showed that Among 4 cases misidentified by NLP system and deep learning, 2 had no related information recorded and 2 had related information extracted but missed due to the priority of different data sources in our algorithm.
This study demonstrated the feasibility and accuracy of extracting related information from clinical narratives for lung cancer research.
Learning Objective 1: This study presents the audience how to developed and evaluated an NLP system in capturing information on stage, histology, grade, chemotherapy, radiotherapy and surgery in lung cancer patients using various narrative data sources from EMR.
Liwei Wang, Mayo Clinic
Lei Luo, Mayo Clinic
Yanshan Wang, Mayo Clinic
Jason A Wampfler, Mayo Clinic
Ping Yang, Mayo Clinic
Hongfang Liu (Presenter)