Much of the critical information in a patient’s electronic health record (EHR) is hidden in unstructured text. As such, there is an increasing role for automated text extraction and summarization to make this information available in a way that can be quickly and easily understood. While many clinical note text extraction techniques have been examined, most existing techniques are either narrowly targeted or focus primarily on concept-level extraction, potentially missing important contextual information. In contrast, in this work we examine the extraction of several clinical categories at the phrase level, attempting to provide the necessary context while still keeping the extracted elements concise. To do so, we employ a three-stage pipeline which extracts categorized phrases of interest using clinical concepts as anchor points. Results suggest the proposed method achieves performance comparable to that of individual human annotators.

Learning Objective 1: After participating in this session, the learner should be better able to understand the difficulties associated with automatic extraction of text phrases of arbitrary length from clinical notes, and discuss possible machine learning based solutions.


Tyler Baldwin (Presenter)
IBM Almaden Research Center

Yufan Guo, IBM Almaden Research Center
Vandana Mukherjee, IBM Almaden Research Center
Tanveer Syeda-Mahmood, IBM Almaden Research Center

Presentation Materials: