Objective: To develop a highly accurate method for extraction of colorectal polyp number, size, and location from colonoscopy electronic medical records.
Materials: Our algorithm was developed using 588 colonoscopy records from a tertiary academic medical center in New Hampshire with 463 records used for training and 125 records used for testing. Approximately 80% of the records in both the training and testing records had semi-structured formats, and the remaining 20% were in a free-text format.
Methods: Using a natural language (NLP) approach with regular expressions and rule-based methods, we developed an information extraction pipeline to recover polyp location (e.g., cecum, sigmoid, transverse), size (mm), and number from both free-text and semi-structured colonoscopy records.
Results: Using an all-or-none manual evaluation on a test of 125 records that included both free-text and semi-structured formats, our method extracted polyp size, location, and number with an 0.90 accuracy. Additionally, model precision, accuracy, recall, and F1-score were respectively 0.94, 0.94, 0.99, and 0.96 to detect the correct number of polyp observations. Finally, accuracy to capture the correct polyp location, size and number individually exceeded 0.90. As expected, model performance was higher for semi-structured compared to free-text records.
Conclusion: Our rule-based method can potentially help to automate extraction of polyp size number, and location with high accuracy for applications such as updating colonoscopy registries, developing decision support systems, and evaluating clinical hypotheses for colorectal polyp prevention and treatment. Our method improves upon existing approaches1,2 that only looked at general regions of the colon (e.g., left versus right) and the size of the largest adenoma as we extract exact size, location, and number for all recorded polyps.
Brief Abstract: Extraction of relevant descriptors of polyps – such as number, size, and location – from colonoscopy records is a time-consuming and potentially error-prone task. We have developed a natural language processing (NLP) approach using rule-based methods and regular expressions to automatically extract these descriptors from colonoscopy records. Our method exhibits promising accuracy, precision, recall and F1-score.

1. Imler TD, Morea J, Kahi C, Imperiale TF. Natural language processing accurately categorizes findings from colonoscopy and pathology reports. Clin Gastroenterol Hepatol. 2013;11(6):689–94.
2. Raju GS, Lum PJ, Slack RS, Thirumurthi S, Lynch PM, Miller E, et al. Natural language processing as an alternative to manual reporting of colonoscopy quality metrics. Gastrointest Endosc. 2015;82(3):512–9.

Learning Objective 1: Apply rigorous natural language processing to electronic medical records to assess colorectal cancer risk in a vulnerable population.


Lia Harrington, Dartmouth College
Arief Suriawinata, Dartmouth
Todd MacKenzie, Dartmouth College
John Higgins, Dartmouth College
Saeed Hassanpour (Presenter)
Dartmouth College

Presentation Materials: