[Paper] BEHRT: Transformer for Electronic Health Records (2020)

22 Jan 2024 #bio #ehr #transformer

Li, Yikuan, et al. “BEHRT: transformer for electronic health records.” Scientific reports 10.1 (2020): 7155.

Paper Link

Points

BEHRT (BERT for EHR): BERT-based architecture
- In EHR, certain diseases can be reversed, or the time interval between two diagnoses can be shorter or longer than recorded.
  
  → Bidirectional contextual awareness of the model’s representation is a big advantage with EHR data.
Transfer Learning: pre-training on predicting of masked disease words, such as Masked Language Modeling (MLM), and then fine-tuning on three disease prediction tasks
Disease embeddings: show the relations between the various diseases

Background

In traditional research on EHR data, individuals are represented by models as features. Experts had to define the appropriate features.
Studies applying Deep Learning (DL) to EHR started to show that DL models can outperform the traditional Machine Learning (ML) methods
With DL architectures for sequence data, such as Recurrent Neural Networks (RNNs), the application of DL models for EHR was improved in terms of capturing the long-term dependencies among events.
Similarities between sequences in EHR and natural language lead to the successful transferability of techniques

Method

Data

Clinical Practice Research Datalink (CPRD)

one of the largest linked primary care EHR systems
Contains longitudinal data from a network of 674 general practitioner practices in the UK

스크린샷 2024-02-29 오후 8.29.23

from 8 million patients (eligible for linkage to HES, meet CPRD’s quality standards) to 1.6 million patients (having at least 5 visits in the EHR)
Only the data from GP practices considered in this study, which consented to record linkage with HES

Input Features

스크린샷 2024-02-29 오후 8.30.30

Only consider the diagnoses and ages

The patient $p$’s EHR: \[ V_p={v^1_p},{v^2_p},{v^3_p}, …,{v^n_p} \]

The patient $p$’s EHR of the $j$th visit: $v^j_p$

$v^j_p$ is a list consisting of single or multiple $m$ diagnoses: $v^j_p={d_1, …d_m}$

Input sequence: \[ I_p={CLS, v^1_p, SEP, v^2_p, SEP,…,v^{n_p}_p, SEP} \]

$CLS$ token: the start of a medical history
$SEP$ token: between visits

Embeddings

스크린샷 2024-02-29 오후 8.31.09

A combination of 4 embeddings: disease + position + age + visit segment
Disease embeddings: past diseases can improve the accuracy of the prediction for future diagnoses
Positional encodings: determine the relative position in the EHR sequence

Tasks

Pre-training: prediction of masked disease words (MLM)

86.5% of the disease words were unchanged, 12% of them replaced with the mask token, and the remaining 1.5% words replaced with randomly-chosen disease words

Fine-tuning: 3 different disease prediction tasks

Prediction of diseases in the next visit (T1)
1. Randomly choose an index $j (3<j<n_p)$ for each patient
2. Form input as $x_p={v^1_p, …, v^j_p}$
3. Output $y_p=w_{j+1}$, $w_{j+1}$ is a multi-one-hot vector, indexed for disease that exist in $v^{j+1}_p$
- Each patient contributes only an input-output pair to the training and evaluation process.
Prediction of diseases in the next 6 months (T2) & in the next 12 months (T3)
1. Not include patients that don’t have 6 or 12 months worth of EHR in the analysis
2. choose $j$ randomly from $(3, n*)$, where $n*$ is the highest index after 6 or 12 months
3. output $y_p=w_{6m}$ and $y_p=w_{12m}$
The number of patients for each task: 699K, 391K, 342K

Results

Disease Embeddings

스크린샷 2024-03-05 오전 9.39.18

T-SNE Results

The natural stratification of gender-specific diseases
The natural clusters are formed that in most cases consist fo disease of the same chapter
10 closest diseases by cosine similarity of their embeddings are founded as similar as those provided by a clinical researcher (0.757 overlab) - Supplementary table S3

Attention and Interpretability

Found the relationships among events which goes beyond temporal/sequence adjacency

Analysed the attention-based patterns by Vig

Strong connections between rheumatoid arthritis and enthesopathies and synovial disorders → Attention can go beyond recent events and find long-range dependencies among diseases

Disease Prediction

Metrics: average precision score(APS), AUROC

APS: a weighted mean of precision and recall achieved at different thresholds

Performance scores

→ Outperformed the best model by more than around 8% in predicting for a range of more than 300 diseases
Comparing performance for each disease

APS and AUROC scores with the all $y_p$ and $y*_p$ vectors of a disease

스크린샷 2024-03-05 오전 10.08.39

스크린샷 2024-03-05 오전 10.08.53

Discussion

Flexibility of BEHRT from employed 4 key concepts from EHR: disease, age, segment, position
- gains insights about underlying generating process of EHR
- Distributed/complex representations that are capable of capturing concepts of disease
- Future work: adding new concepts to the embeddings
Disease embeddings provide great insights into how various diseases are related to each other: co-occurrence and closeness of diseases
- Could be used for future research as reliable disease vectors
Important features of EHR for prediction
- Robust, gender-specific predictions without inclusion of gender
- Position was important
- Age embeddings might be vital in diagnosing age-related diseases

Coffee Chat Brewing AI Knowledge