26 Dec 2023
#bio
#brainImaging
#mri
I’d like to write posts summarizing various aspects of MRI and its modalities, as well as the Alzheimer’s disease-related features observable through MRI as I studied.
Magnetic Resonance Imaging (MRI)
In a device composed of magnets, high-frequency waves are directed at the human body, resonating hydrogen atomic nuclei in the body’s tissues, and converting the differences in signals emanating from each tissue into digital information, resulting in images.
Principles of MRI Imaging
Human tissues contain a significant amount of water. Hydrogen nuclei within water molecules possess magnetic properties. By emitting high-frequency waves, these hydrogen nuclei can be resonated. When a radiofrequency (RF) pulse is emitted and then turned off (RF pulse), the atomic nuclei absorb and subsequently release the high-frequency signal. Analyzing the differences in the signals returning to the MRI device and maximizing them, a two-dimensional image is formed, which is the essence of MRI.
The magnitude and waveform of the emitted signal vary depending on factors such as the concentration of water molecules, blood flow, and the binding state with surrounding chemical structures. Consequently, the relaxation times, T1 and T2, differ based on the composition of tissues and blood. Since the composition varies with different diseases, the signals obtained also differ accordingly. By capturing these signal variations, various types of MRI images can be obtained, including T1-weighted images (T1WI), T2-weighted images (T2WI), FLAIR, and others.
The T1 and T2 relaxation times are measured based on different criteria after applying a 90-degree RF pulse to the protons. When the magnetization of the protons is flipped from the longitudinal axis ($Mz$) to the transverse axis, an $Mxy$ vector is formed. The T1 and T2 relaxation times are measured from the moment when the $Mz$ vector reaches 0% and the $Mxy$ vector reaches 100%.
- T1 relaxation time: The time it takes for $Mz$ to recover up to 63%.
- Recovery is faster in fat, brain tissue, and cerebrospinal fluid (CSF) in that order (shorter T1 relaxation time).
- T2 relaxation time: The time it takes for $Mxy$ to decay down to 37%, relatively unaffected by magnetic field strength.
- Signal decay is faster in fat, brain tissue, and CSF.
- Tissues with shorter T1 relaxation times also exhibit a rapid decline in the T2 curve.
Water and fat have opposite signal intensities in T1 and T2 (opposite signal intensity).
Spin echo is a technique for acquiring images by manipulating the repetition time (TR) and echo time (TE) while applying RF pulses of 90 and 180 degrees. TR is the time from one 90-degree pulse to the next, while TE is the time until the signal is obtained after the 90-degree pulse. By repeating the pulse during image acquisition, various images can be obtained by adjusting TR and TE.
Pros and Cons of MRI
Pros
- Better contrast of soft tissues compared to CT.
- Ability to observe anatomical, physiological, and functional information.
Cons
- Ferromagnetic artifacts: Even small amounts of ferromagnetic materials in the body can disrupt the homogeneity of the magnetic field, causing distortion in the images.
- Presence of dental fillings or other inserted materials can reduce image quality.
Contraindications
- MRI should not be used for patients with implants or other materials inside the body that may be affected by the magnetic field.
05 Nov 2023
#llm
#transformer
#nlp
Lewis, Patrick, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks.” Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
Paper Link
Points
- Retrieval Augmented Generation (RAG) model combines a retriever and a generator for enhanced knowledge-intense tasks.
- RAG Variants: RAG-Sequence uses a single document for output; RAG-Token integrates multiple documents per token.
- RAG models outperform baselines in open-domain QA, abstractive QA, Jeopardy question generation, and fact verification.
- RAG models demonstrate practical benefits with easy updates to the non-parametric memory.
Background
- Large pre-trained Language models (LLMs) store factual knowledge in their parameters, functioning as implicit knowledge base.
- LLMs, however, have limitations: they cannot expand their memory, provide insight into their predictions, and may produce ‘hallucinations’.
- Recently, hybrid models, such as REALM and ORQA, address these issues by using a differentiable retriever to revised and expanded knowledge, showing promising results, primarily in open-domain question answering (QA).
Method
Retrieval-augmented generation (RAG) fine-tunes pre-trained generation models with a non-parametric memory for a general-purpose tasks.
- Parametric memory: a pre-trained seq2seq transformer
- Non-parametric memory: a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.
- Dense passage retriever (DPR): retrieves latent documents conditioned on the input.
- BART: the generator conditions on the latent documents together with the input to generate the output. Other seq2seq models like T5 can also be used and fine-tuned with the retriever.
- Latent documents: marginalized using a top-K approximation, either on a per-output basis or a per-token basis.
- RAG-Sequence Model: assumes the same document is responsible for all tokens.
- RAG-Token Model: considers different documents for different tokens.
Models
RAG models use the input sequence $x$ to retrieve text documents $z$ and use them as additional context when generating the target sequence $y$. RAG has two components:
- Retriever $p_\eta(z\mid x)$: returns distributions over text passages given a query $x$ with parameters $\eta$.
- Truncated as top-K assumtion.
- Generator $p_\theta(y_i\mid x,z,y_{1:i-1})$: generates a current token based on the previous $i-1$ tokens $y_{1:i-1}$, the input $x$, and a retrieved passage $z$ with parameters $\theta$.
The retriever and the generator are trained end-to-end, treating the retrieved document as a latent variable. To marginalize over the latent documents, two methods are proposed, RAG-Sequence and RAG-Token.
RAG-Sequence and RAG-Token
RAG-Sequence Model uses the same retrieved document to generate the complete sequence.
- The retrieved document is a single latent variable to get the seq2seq probability $p(y\mid x)$ via a top-K approximation.
- The top-K documents are retrieved using the retriever, and generator produces the output sequence probability for each document.
\[p_{RAG-Sequence}(y\mid x) \approx \sum_{z\in top-k(p(\cdot|x))}{p_\eta(z|x)p_\theta(y_i|x,z)} \\ = \sum_{z\in top-k(p(\cdot|x))}{p_\eta(z|x)}\prod_i^N p_\theta(y_i|x,z,y_{1:i-1})\]
- Use cases: Better suited for tasks where the context of entire documents is crucial, like summarization tasks.
RAG-Token Model uses different latent documents for each target token.
- The generator chooses content from several documents for the answer.
- The top-K documents are retrieved using the retriever, and the generator produces a distribution for the next output token for each document before marginalizing.
\[p_{RAG-Token}(y|x)\approx \prod_i^N \sum_{z\in top-k(p(\cdot\mid x))}p_\eta(z\mid x)p_\theta(y_i\mid x,z_i,y_{1:i-1})\]
- Use cases: More suitable for tasks that benefit from integrating detailed information from multiple sources, like open-domain QA.
Retriever and Generator
Retriever $p_\mu(z\mid x)$ is based on DPR, which follows a bi-encoder architecture:
\[p_\mu(z|x)\propto \exp(\bf d \rm (z)^\top \bf q \rm (x)) \\
\bf d \rm (z)=\rm BERT_d(z), \ \bf q \rm (x)=\rm BERT_q(x)\]
- $\bf d \rm (z)$: a dense representation of a document produced by a document encoder based on $\rm BERT_{BASE}$.
- $\bf q \rm (x)$: a query representation produced by a query encoder based on $\rm BERT_{BASE}$.
- Maximum inner product search (MIPS): caculates top-k $p_\eta(\cdot\mid x)$ approximately in sub-linear time.
- Non-parametric memory: the index of the document. The retriever is trained to retrieve documents containing answers to TriviaQA questions and Natural Questions.
Generator $p_\theta(y_i\mid x,z,y_{1:i-1})$ can be any encoder-decoder model, based on BART in the paper.
- $\rm BART_{large}$ is used: a pre-trained seq2seq transformer with 400M parameters, pre-trained using a denoising objective with various noising functions.
- The input $x$ and the retrieved document $z$ are concatenated and then inputted into $\rm BART$ model to generate the output.
- Parametric memory: $\rm BART$ generator parameters $\theta$.
Training
The retriever and generator are trained jointly without direct supervision on which document should be retrieved.
- Objective: Minimize the negative marginal log-likelihood of each target with a corpus of input/output pairs $(x_j, y_j)$, $\sum_j-\log(p(y_j\mid x_j))$.
- Fine-tuning only the query encoder $\rm BERT_q$ and the generator $\rm BART$ during training.
- Updating the document encoder $\rm BERT_d$ is costly and ineffective
- Requires periodic updating of the document index (as REALM).
- Not necessary for strong performance.
Decoding
For testing, RAG-Sequence and RAG-Token require different methos to approximate $\arg \max_y{p(y\mid x)}$.
RAG-Sequence model utilizes beam search for each document $z$. It can’t be solved with a single beam search, as the likelihood $p(y\mid x)$ does not break into a conventional per-token likelihood.
- Each hypothesis of $z$ is scored by $p_\theta(y_i\mid x,z,y_{1:i-1})$.
- Some hypothesis $y$ included in the set of hypothesis $Y$ may not have appeared in the beams of all documents.
- Thorough Decoding: To estimate the probability of $y$, (1) Run an additional forward pass for each $z$ where $y$ doesn’t appear in the beam, (2) multiply the generator probability with $p_\eta(z\mid x)$, and (3) sum the probabilities across beams.
- Fast Decoding: For efficient decoding, Approximate $p_\theta(y\mid x,z_i) \approx 0$ where $y$ wasn’t generated during beam search from $x, z_i$, avoiding additional forward passes once the candidate set $Y$ is generated.
- For longer output sequences, $\left\vert Y \right\vert$ can be large with many forward passes.
RAG-Token model is a basic autoregressive seq2seq generator with transition probability:
\[p'_\theta(y_i\mid x,y_{1,i-1})=\sum_{z\in top-k(p(\cdot \mid x))}p_\eta(z_i \mid x)p_\theta(y_i\mid x,z_i,y_{1:i-1})\]
Experiments
The experiments were conducted on several datasets to evaluate the model’s performance in knowledge-intensive NLP tasks.
- Wikipedia December 2018 dump was used as the non-parametric knowledge source.
- Wikipedia articles were split into 100-word chunks, totaling 21M documents.
- An embedding for each document was calculated by the document encoder $\rm BERT_d$, and a single MIPS index was built with Hierarchical Navigable Small World approximation for fast retrieval.
- When retrieving the top $k$ documents for each query, $k\in {5,10}$ was considered for training, and set using dev data for test time.
Tasks
- Open-domain Question Answering (QA): an important real-world application and common testbed for knowledge-intensive tasks.
- Text pairs $(x,y)$ are matched as questions and answers.
- RAG is trained to minimize the negative log-likelihood of answers.
- Close-book QA is also a compared task: generating answers without retrieving but purely with parametric knowledge.
- Datasets: Natural Questions, TriviaQA, WebQuestions, CuratedTREC
- Abstractrive Question Answering: tests natural language generation (NLG) ability with free-form and abstractive cases.
- Use MSMARCO NLG Task v2.1: only the question and answers, not existing gold passages in the dataset, treated as an open-domain abstractive QA task.
- Jeopardy Question Generation: evaluates the generation ability in a non-QA setting.
- Jeopardy: guessing an entity from a fact about that entity.
- e.g., “In 1986 Mexico scored as the first contry to host this international sport competition twice.” where the answer is “The World Cup”.
- Jeopardy questions are precise and factual, making it a challenging, knowledge-intensive task to generate them conditioned on the anser entities.
- Fact Verification (FEVER): a retrieval problem coupled with an challenging entailment reasoning task.
- Requires classifying whether a text is supported or refuted by Wikipedia or whether there’s not enough information to decide.
- Provides an appropriate testved for exploring a model’s ability to handle classification rather than generation.
- Two varients: the 3-way classification (supports/refutes/not enough) and the 2-way (support/refutes).
Results
The results demonstrated that both RAG-Sequence and RAG-Token models outperformed baseline models across various datasets and tasks.
Open-Domain QA
- RAG models significantly outperformed the baselines, showing higher EM and F1 scores.
- The RAG-Token model, in particular, performed well due to its ability to integrate detailed information from multiple documents.
Abstractive Question Answering
- RAG models achieved SOTA performance, even though many questions are unanswerable without the gold passages.
- RAG models hallucinated less and generated more factually correct and diverse text compared to BART (Table 3).
Jeopardy Question Generation
- Both of RAG models outperformed BART on Q-BLEU-1 (Table 2).
- Human evaluators indicate that RAG-generated content was more factual in 42.7% of cases, demostrating the effectiveness of RAG over the SOTA generation model (Table 4).
- RAG-Token model performed better than RAG-Sequence, combining content from several documents effectively (Fig 2).
- The generator’s the parametric knowledge sufficed to complete the generation after initially referencing the document (Fig 2).
Fact Verification
- For 3-way classification, RAG achieved scores within 4.3% of SOTA models trained with intermediate retrieval supervision for a specific domain.
- For 2-way classification, RAG achieved performance within 2.7% of the base model, SotA, which were trained to classify true of false given the gold evidences.
- The documents retrieved by RAG are overlapped significantly with FEVER’s gold evidence.
Additional Results
-
Generation Diversity: When investigating generation diversity by calculating the ratio of distinct ngrams to total ngrams generated by different models, RAG models generated more diverse outputs compared to BART. RAG-Sequence produced slightly more diverse outputs than RAG-Token (Table 5).
- Retrieval Ablations: Freezing the retriever during training resulted in lower performance compared to the original RAG models. Replacing the retriever with a BM25 system showed that learned retrieval improved performance for all task (table 6).
-
Index hot-swapping: Demonstrated the advantage of non-parametric memory by using an index from Wikipedia dump from December 2016. RAG models still answered 70% of questions correctly, showing that knowledge can be updated simply by replacing the non-parametric memory.
- Effect of Retrieving more documents: Adjusting the number of retrieved documents at test time showed improved performance up to a certain point, demonstrating the benefits of retrieveing more relevant documents (fig 3).
15 Sep 2023
#bio
#ehr
#transformer
Tipirneni, Sindhu, and Chandan K. Reddy. “Self-supervised transformer for sparse and irregularly sampled multivariate clinical time-series.” ACM Transactions on Knowledge Discovery from Data (TKDD) 16.6 (2022): 1-17.
Paper Link
Points
Self-supervised Transformer for Time-Series (STraTS) model
-
Using observation triplets as time-series components: avoids the problems faced by aggregation and imputation methods for sparse and sporadic multivariate time-series
-
Continuous Value Embedding: encodes continuous time and variable values without the need for discretization
-
Transformer-based model: learns contextual triplet embeddings
-
Time series forecasting as a proxy task: leverages unlabeled data to learn better generalized representations
Background
Problems
- Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals.
- Existing approaches, such as aggregation or imputation of values, suppress the fine-grained information and add undesirable noise/overhead into the model.
- The problem of limited availability of labeled data is easily observed in healthcare applications.
The clinical domain portrays a unique set of challenges:
- Missingness and Sparsity: Not all the variables are observed for every patient. Also, the time-series matrices are very sparse.
- Irregular time intervals and Sporadicity: Not all clinical variables are measured at regular time intervals. The measurements may occur sporadically in time depending.
- Limited labeled data: expensive and even more limited for specific tasks.
Existing methods
- Aggregation: could suppress important fine-grained information
- Imputation/Interpolation: not reasonable as not considering the domain knowledge about each variable
Method
Self-supervised Transformer for Time-Series (STraTS)

Embeddings
Triplet Emgeddings = Feature embedding + Value embedding + Time embedding
\(T=\{(t_i, j_i, u_i)\}^n_{i=1}\\
e_i=e_i^f+e_i^v+e_i^t\)
Continuous Value Embedding (CVE)
For continuous values of feature values and times
A one-to-many Feed-Forward Network
\(FFN(x) = U tanh(Wx+b)\)
Demographics Embedding
the prediction models performed better when demographics were processed separately.
\(e^𝑑 = 𝑡𝑎𝑛ℎ(W^𝑑_2𝑡𝑎𝑛ℎ(W^𝑑_1d + b^𝑑_1) + b^𝑑_2) ∈ R^d\)
where the hidden layer has a dimension of 2d
Self-Supervision

Pre-training Tasks: Both masking and forecasting as pretext tasks for providing self-supervision
The forecasitng improved the results on target tasks
The loss is:
\(L_{ss}=\frac{1}{|N'|}\sum_{k=1}^{N'}\sum_{j=1}^{|F|}m_j^k\Big(\tilde{z}_j^k-z_j^k\Big)^2\)
Interpretability
I-STraTS: an interpretable version of STraTS
- The output can be expressed using a linear combination of components that are derived from individual features
Differences with STraTS
- Combine the initial triplet embeddings in Fusion Self-attention module
- Directly use the raw demographics vector as the demographics embedding
\[\tilde{y}=sigmoid\Big(\sum_{j=1}^{D}{\bold{w}_0[j]d[j]+\sum_{i=1}^{n}\sum_{j=1}^{d}\alpha_i\bold{w}_o[j+D]\bold{e}_i[j]+b_o}\Big)\]
Experiments
Target Task: Prediction of in-hospital mortality
Datasets: 2 EHR datasets; MIMIC-III and PhysioNet Challenge 2012
- MIMIC-III: 46,000 patients
- PhysioNet-2012: 11,988 patients
Baselines: Gated Recurrent Unit (GRU), Temporal Convolutional Network (TCN), Simply Attend and DIagnose (SaND), GRU with trainable Decays (GRU-D), Interpolation-prediction Network (InterpNet), Set Functions for Time Series (SeFT)
- Used 2 dense layers for demographics encoding
- Concatenated it to the time-series representation before the last dense layer
Metrics
- ROC-AUC: Area under ROC curve
- PR-AUC: Area under precision-recall curve
- min(Re, Pr): the max of ‘min of recall and precision’ across all thresholds

- Trained each model using 10 different random samplings of 50% labeled data from the train and validation sets
- STraTS uses the entire labeled data and additional unlabeled data if avaliable
- STraTS achieves the best performance
- GRU showed better performance than interpolation-based models (GRU-D, InterpNet) on the MIMIC-III dataset, which was not expected
Generalizability test of models
Lower propotions of labeled data can be observed in real-world when there are several right-censord samples.

- STraTS has an advantage compared to others in scarce labeled data settings, which can be attributed to self-supervision
Ablation Study
Compared STraTS and I-STraTS with and without self-supervision: ‘ss+’ and ‘ss-‘ indicate each case

- I-STraTS showed slightly worse performance as constrained its representations
- Adding self-supervision improves performance of both models
- I-STraTS(ss+) outperforms STraTS(ss-): self-supervision can compensate the performance which could get lower by introducing interpretability
Interpretability
How I-STraTS explains its predictions
A case study: a 85 yrs old female patient from MIMIC-III
- expired on the 6th day after ICU admission
- had 380 measurements corresponding to 58 time-series variables
The model predicts the probability of her in-hospital mortality as 0.94 using only the data collected the first day

- Average contribution score: the average score along with the range, for multiple observations, or value, for only one observation
- The top 5 variables are the most important factors in predicting she ‘s at high risk of mortality that the model observed
&rarr Can be helpful to identify high-risk patients and also understand the contributing factors and make better diagnoses, especially at the early stages of treatment
29 Aug 2023
#bio
#brainImaging
#demensia
#atn
AI 전공자의 알츠하이머 치매 관련 Brain Imaging 논문 스터디
Amyloid Beta(A), Tau(T), Neurodegeneration(N)과 관련된 Alzheimer’s Disease(AD) 기전에 대해 이해하기 위하여 다음의 논문들을 읽고 정리한 내용입니다. 기반 지식이 없어 시각 자료와 사전을 찾아가며 읽었습니다. pdf는 찾아본 이미지와 필기한 내용이 담긴 논문 파일입니다.
Ittner, Lars M., and Jürgen Götz. “Amyloid-β and tau—a toxic pas de deux in Alzheimer’s disease.” Nature Reviews Neuroscience 12.2 (2011): 67-72. link pdf
Vogel, Jacob W., et al. “Four distinct trajectories of tau deposition identified in Alzheimer’s disease.” Nature medicine 27.5 (2021): 871-881. link pdf
Lee, Wha Jin, et al. “Regional Aβ-tau interactions promote onset and acceleration of Alzheimer’s disease tau spreading.” Neuron110.12 (2022): 1932-1943. link pdf
Amyloid Beta(A)는 뉴런에 의해 생성되는 Amyloid Precursor Protein(APP)이 프로테아제에 의해 4부분으로 나눠질 때 생기는 펩타이드 중 하나로,
뉴런 근처에 존재하여 기능 장애를 야기하는 것으로 알려졌다. A의 침착은 Alzheimer’s Disease(AD) 발병 10-20년 전부터 이뤄진다.
-
A는 dimers, oligomers, fibrils 등에 이어 plaque를 형성한다. A가 어느 형태에서 toxicity를 갖기 시작하는지는 확실하지 않다. 항 아밀로이드 치매 치료제는 이 plaque의 감소와 증식 및 생성 방지를 목적으로 한다.
-
A의 toxicity는 postsynaptic compartment, 즉 dendrite(somatodendritic region)를 주 대상으로 하여 작용하고, 특정 수용체의 속성에 따라 세포막을 통해 간접적으로 뉴런에 영향을 끼칠 수 있다. 대표적인 특정 수용체로 NMDAR이 있다.
Tau(T)는 신경 세포에서 microtubule과 결합하는 단백질로, 주로 axon에 존재하여 microtubule의 안정화 및 axonal transition을 조절하는 역할을 한다.
정상 상태의 뉴런의 dendrite에도 소량 존재한다.
T는 A에 의해 과인산화되고(hyperphosphorylated Tau), 과인산화된 T는 Neurofibrillary Tangle(NFT)를 형성한다.
- T의 과인산화는 microtubule 형성을 방해하여 뉴런의 기능을 방해한다.
- NFT는 Somatodendritic region에서 많이 관찰된다. T의 level이 높아지면 T가 dendrite에서 많이 관찰된다.
Dendrite에서 T는 그곳에 위치한 여러 단백질과 상호작용하여 결과적으로 뉴런이 A의 toxicity에 약해지게 만든다.
- T가 인산화되면 Tyrosine protein kinasen FYN과 강하게 작용한다. 과인산화된 T가 dendrite에서 증가함에 따라 FYN도 Soma에서 증가한다.
- FYN은 NMDAR을 인산화한다. 인산화된 NMDAR은 Postsynaptic Density Protein 95(PDS95)와 상호작용한다.
- 이것의 결과로 NMDAR의 excitotoxicity가 나타난다(흥분독성상태). 수용체의 excitotoxicity로 A의 toxicity에 뉴런이 민감해지게 된다.
결과적으로 A와 T는 뉴런을 약화시키는 데에 있어 서로 시너지를 갖는다. A는 T의 과인산화를 촉진하고, 과인산화된 T는 뉴런이 A의 toxicity에 약해지게 만든다.
이 시스템에서 A와 T는 세포의 다른 부분(각각 Complex I, Complex IV)에 악영향을 끼쳐 미토콘드리아 호흡을 방해하고, 결국 Neurodegeneration(N)을 야기한다.
따라서 A의 침착과 T의 전파는 AD의 중요한 요인이다.
T의 전파 양상은 Braak Staging System으로 체계화된 바 있다.
- Transentorhinal cortex → medial and basal temporal lobe → neocortical associative regions → unimodal sensory and motor cortex
그런데 이 system에 부합하지 않는 전파 양상 또한 관찰되었다. T의 전파 양상을 병의 진행과 뇌 영역의 시공간적 기준으로 분류하여 4가지 subtype으로 정의할 수 있다.
- S1 limbic (Braak system), S2 MTL, S3 posterior, S4 Lateral Temporal
침착된 A는 T의 전파에 영향을 준다.
- A는 heteromodal association cortex에 침착되고, T의 전파는 entorhinal cortex(EC)에서 시작되어 점차 뇌 전반으로 퍼진다. ( ← Braak system; S1 type ? )
- Remote Interation: A와 T가 같은 영역에 있지 않은 상태에서, 먼저 A가 연결된 뉴런을 통해서 EC 영역에 있는 T에 영향을 준다. A의 영향으로 T는 점차 주위 영역으로 확산된다.
- Local Interaction: T가 A와 직접적으로 접촉되어 있는 뉴런에 전파되어 만남으로서 T의 전파가 가속화된다(acceleration). 해당 뇌 영역은 Internal Temporal Gyrus(ITG)이다 (propagation hub).
- T 전파의 acceleration이 진행되면 뇌 전반에서 A와 T의 상호작용이 일어나게 되어 N과 AD의 악화를 막기 어렵다.
A와 T의 PET 데이터와 MRI 데이터를 병의 진행에 따라 살펴보면
- A의 침착되는 정도는 뇌 전반에 걸쳐 점점 심해질 것이고
- T는 뇌의 특정 부분에서 시작하여 점차 확산되는 양상으로 관찰되고
- T의 슈퍼 전파가 관찰된 이후 MRI 상 전반적인 뇌 위축(N)의 정도가 심하게 나타날 것이다.
20 May 2023
#nlp
Sequence data를 분석하기 위한 딥러닝 모델 구조로, Rumelhart et al., 1986에 근간을 둔다. Deep neural networks (DNNs)와는 달리 hidden state node 간 연결을 통해 이전 시점의 정보를 현재 시점에서 사용할 수 있게 디자인되었다.
현재 시점 node인 $s_t$에 전 시점의 $s_{t-1}$ node에서 정보가 들어온다. 이 정보를 현재 시점의 입력인 $x_t$와 함께 받아 다음 node $s_{t+1}$로 전송될 값을 계산한다. 이 작업을 회귀적으로(recurrently) 진행한다.
Weight sharing
Weight $U$, $W$, $V$는 모든 시점에서 동일하다. 이것으로
- 학습에 필요한 weight 수를 줄일 수 있다.
- 데이터의 sequence 길이에 유연하다: 하나의 모델을 다른 길이의 sequence에 적용할 수 있다.
- 다른 길이의 sequence에 같은 weight값을 계속 사용함으로써 next token generation이 가능하다
RNN 계산
위 그림을 보면 hidden state $s_t$와 output $o_t$의 계산은 다음과 같다.
\[\begin{align*}
s_t&=\tau(Ws^{t-1})+Ux^t \\
o_t&=softmax(Vs^t) \\
\end{align*}\]
여기서 node 수가 $D$, $J$, $K$인 경우 각 변수의 차원은 아래와 같다.
\[x\in\mathbb{R}^D, s\in\mathbb{R}^J, o\in\mathbb{R}^K, U\in\mathbb{R}^{J\times D}, W\in\mathbb{R}^{J\times J}, U\in\mathbb{R}^{K\times J}\]
Long-term dependency problem
hidden state 연산은 다음과 같이 표현할 수 있다.
\[s^t=\tau(Ux^t+W\tau(Ux^{t-1}+Ws^{t-2}))\]
$s$가 tanh activation function 내에서 중첩되는 것을 볼 수 있다. 이렇게 되면
- feed forward 시 앞 단에서 입력된 정보가 점점 소실된다: tanh의 output은 $\tau(\cdot)\in(-1,1)$인데, 즉 tanh 연산의 중첩은 1보다 작은 값을 계속해서 곱하는 것과 같다. 이렇게 되면 앞에서 곱해진 값은 점점 작아진다.
- back-propagation 시 기울기 소실(gradient vanishing) 혹은 폭발(explosion)의 문제가 생길 수 있다: tanh 함수에 의해 기울기가 0에 가깝게 되거나 너무 커지는 경우가 생긴다. 작은 gradient는 더 작아지고, 큰 gradient는 더 커진다.
*Gradient vanishing: back-propagation 시 반영되는 gradient 값이 layer를 지날 수록 소실되는 문제
*Gradient explosion: gradient가 실제 값보다 증폭되어 loss 계산 시 정답과의 차이가 너무 커져, 업데이트에 과도하게 반영되는 문제
다양한 RNN 구조
입출력 형태에 따라 다양하게 RNN을 구성할 수 있다.