There is increasing interest in developing prediction models capable of identifying rare disease patients in population-scale databases such as electronic health records (EHRs). Deriving these models is challenging for many reasons, perhaps the most important being the limited number of patients with ‘gold standard’ confirmed diagnoses from which to learn. This paper presents a novel cascade learning methodology which induces accurate prediction models from noisy ‘silver standard’ labeled data – patients provisionally labeled as positive for the target disease based upon unconfirmed evidence. The algorithm combines unsupervised feature selection, supervised ensemble learning, and unsupervised clustering to enable robust learning from noisy labels. The efficacy of the approach is illustrated through a case study involving the detection of lipodystrophy patients in a country-scale database of EHRs. The case study demonstrates our algorithm outperforms state-of-the-art prediction techniques and permits discovery of previously undiagnosed patients in large EHR databases.

Learning Objective 1: After participating in this session, the learner should:

* Better understand the need to detect undiagnosed rare disease patients and the challenges of carrying out such detection in large electronic health record (EHR) databases.

* Be able to discuss existing solutions to this problem, including their respective strengths and weaknesses.

* Learn about a novel approach to detecting rare disease patients in EHRs which does not require knowledge of patients with confirmed diagnoses for model training.

* Better be able to apply modern machine learning methods to the tasks of rare disease patient identification and cohort selection in clinical settings.


Rich Colbaugh (Presenter)
Volv Global

Kristin Glass, Volv Global
Christopher Rudolf, Volv Global
Mike Tremblay, Volv Global

Presentation Materials: