Biases in electronic health records

Title Biases in electronic health records
Summary To evaluate the impact of sample bias on the predictive value of machine learning models built using EHR data
Keywords Machine Learning, Electronic health records, Sample bias
TimeFrame Spring 2019
References 1. Verheij, Robert A., et al. "Possible Sources of Bias in Primary Care Electronic Health Record Data Use and Reuse." Journal of medical Internet research 20.5 (2018).

2. Gianfrancesco, Milena A., et al. "Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data." JAMA internal medicine (2018).

3. Johnson, Alistair EW, et al. "MIMIC-III, a freely accessible critical care database." Scientific data 3 (2016): 160035.

Prerequisites Good knowledge of applied mathematics. An ability to implement state-of-the-art algorithms in a suitable programming environment. An interest in machine learning algorithms and medical data analysis.
Supervisor Awais Ashfaq, Sławomir Nowaczyk
Level Master
Status Open

Predictive modeling with electronic health records (EHRs) is considered an essential step towards precision medicine and improving care quality. However, there is a potential risk of building biased and incorrect prediction models if the complexities and limitations of EHR data are not completely studied. For instance, data collection in EHRs depends on individual patient needs and health state. Sicker patients tend to have more data in EHRs than normal patients. Thus prediction models are likely to be biased towards the sicker population. In the realm of machine learning, this is referred to as 'sample bias' because the distribution of the available data does not reflect the true environment. The goal of the project is to evaluate the impact of sample bias on different prediction models built using EHR data. You will use the MIMIC-III database (see references) for this project.

Tentative project plan:

1- Scan through sources of potential bias in EHRs. (See references for a start)

2- Build models (Logistic regression, Neural nets, random forests etc.) to predict an outcome of interest (like in-hospital death) using the MIMIC-III database.

3- Sub-sample the training and testing set based on your knowledge (from step 1) and re-run the developed models. For instance, you may sub-sample the data based on age, gender, the frequency of visits, time, place of visit etc.

4- Evaluate the change (if any) in model performance and discuss it in detail.

5- Suggest possible solutions to overcome the impact of sample bias when building EHR driven prediction models.


1- A succinct review of different biases in electronic health records and their impact on predictive models.

2- A summary of recent (2017-2018) studies designed for predicting in-hospital deaths using EHR data.

3- Results: Prediction performance of developed models using complete EHR data and its sub-samples.

4- A critical analysis of the bias problem in light of your results.

Would you require more information about the project, feel free to contact Awais Ashfaq (