Coffee Chat Brewing AI Knowledge

eng kor

[Paper] BEHRT: Transformer for Electronic Health Records (2020)

Li, Yikuan, et al. “BEHRT: transformer for electronic health records.” Scientific reports 10.1 (2020): 7155.

Paper Link

Points

  1. BEHRT (BERT for EHR): BERT-based architecture

    • In EHR, certain diseases can be reversed, or the time interval between two diagnoses can be shorter or longer than recorded.

      → Bidirectional contextual awareness of the model’s representation is a big advantage with EHR data.

  2. Transfer Learning: pre-training on predicting of masked disease words, such as Masked Language Modeling (MLM), and then fine-tuning on three disease prediction tasks
  3. Disease embeddings: show the relations between the various diseases

Background

  • In traditional research on EHR data, individuals are represented by models as features. Experts had to define the appropriate features.

  • Studies applying Deep Learning (DL) to EHR started to show that DL models can outperform the traditional Machine Learning (ML) methods
  • With DL architectures for sequence data, such as Recurrent Neural Networks (RNNs), the application of DL models for EHR was improved in terms of capturing the long-term dependencies among events.
  • Similarities between sequences in EHR and natural language lead to the successful transferability of techniques

Method

Data

Clinical Practice Research Datalink (CPRD)

  • one of the largest linked primary care EHR systems

  • Contains longitudinal data from a network of 674 general practitioner practices in the UK

스크린샷 2024-02-29 오후 8.29.23

  • from ​8 million patients (eligible for linkage to HES, meet CPRD’s quality standards) to 1.6 million patients (having at least 5 visits in the EHR)

  • Only the data from GP practices considered in this study, which consented to record linkage with HES

Input Features

스크린샷 2024-02-29 오후 8.30.30

  • Only consider the diagnoses and ages

The patient $p$’s EHR: \[ V_p={v^1_p},{v^2_p},{v^3_p}, …,{v^n_p} \]

The patient $p$’s EHR of the $j$th visit: $v^j_p$

$v^j_p$ is a list consisting of single or multiple $m$ diagnoses: $v^j_p={d_1, …d_m}$

Input sequence: \[ I_p={CLS, v^1_p, SEP, v^2_p, SEP,…,v^{n_p}_p, SEP} \]

  • $CLS$ token: the start of a medical history

  • $SEP$ token: between visits

Embeddings

스크린샷 2024-02-29 오후 8.31.09

  • A combination of 4 embeddings: disease + position + age + visit segment

  • Disease embeddings: past diseases can improve the accuracy of the prediction for future diagnoses

  • Positional encodings: determine the relative position in the EHR sequence

Tasks

Pre-training: prediction of masked disease words (MLM)

  • 86.5% of the disease words were unchanged, 12% of them replaced with the mask token, and the remaining 1.5% words replaced with randomly-chosen disease words

Fine-tuning: 3 different disease prediction tasks

  • Prediction of diseases in the next visit (T1)

    1. Randomly choose an index $j (3<j<n_p)$ for each patient
    2. Form input as $x_p={v^1_p, …, v^j_p}$
    3. Output $y_p=w_{j+1}$, $w_{j+1}$ is a multi-one-hot vector, indexed for disease that exist in $v^{j+1}_p$
    • Each patient contributes only an input-output pair to the training and evaluation process.
  • Prediction of diseases in the next 6 months (T2) & in the next 12 months (T3)

    1. Not include patients that don’t have 6 or 12 months worth of EHR in the analysis
    2. choose $j$ randomly from $(3, n*)$, where $n*$ is the highest index after 6 or 12 months
    3. output $y_p=w_{6m}$ and $y_p=w_{12m}$
  • The number of patients for each task: 699K, 391K, 342K

Results

Disease Embeddings

스크린샷 2024-03-05 오전 9.39.18

T-SNE Results

  • The natural stratification of gender-specific diseases

  • The natural clusters are formed that in most cases consist fo disease of the same chapter

  • 10 closest diseases by cosine similarity of their embeddings are founded as similar as those provided by a clinical researcher (0.757 overlab) - Supplementary table S3

Attention and Interpretability

Found the relationships among events which goes beyond temporal/sequence adjacency

Analysed the attention-based patterns by Vig

  • Strong connections between rheumatoid arthritis and enthesopathies and synovial disorders → Attention can go beyond recent events and find long-range dependencies among diseases

Disease Prediction

Metrics: average precision score(APS), AUROC

  • APS: a weighted mean of precision and recall achieved at different thresholds
  1. Performance scores

    스크린샷 2024-03-05 오전 9.49.39

    → Outperformed the best model by more than around 8% in predicting for a range of more than 300 diseases

  2. Comparing performance for each disease

    APS and AUROC scores with the all \(y_p\) and \(y*_p\) vectors of a disease

스크린샷 2024-03-05 오전 10.08.39

스크린샷 2024-03-05 오전 10.08.53

Discussion

  • Flexibility of BEHRT from employed 4 key concepts from EHR: disease, age, segment, position
    • gains insights about underlying generating process of EHR
    • Distributed/complex representations that are capable of capturing concepts of disease
    • Future work: adding new concepts to the embeddings
  • Disease embeddings provide great insights into how various diseases are related to each other: co-occurrence and closeness of diseases
    • Could be used for future research as reliable disease vectors
  • Important features of EHR for prediction

    • Robust, gender-specific predictions without inclusion of gender

    • Position was important

    • Age embeddings might be vital in diagnosing age-related diseases