Coffee Chat Brewing AI Knowledge

eng kor

[Paper] Tabtransformer: Tabular data modeling using contextual embeddings (2020)

Huang, Xin, et al. “Tabtransformer: Tabular data modeling using contextual embeddings.” arXiv preprint arXiv:2012.06678 (2020).

Paper Link


Points

  • TabTransformer: A cutting-edge tabular data model leveraging contextual embeddings.
  • Pre-trained by innovative two-phase approach for robust feature representation.
  • Showed SOTA performance in both supervised and semi-supervised learning.
  • Handles missing and noisy data robustly, ensuring reliable performance.


Background

The current state-of-the-art (SOTA) mdoels for tabular data primarily consist of tree-based ensemble methods, notably gradient boosted decision trees (GBDT). However, these models exhibit several limitations in comparison to deep learning models:

  • Not suitable for continual learning from streaming data.
  • Ineffective for end-to-end learning of multi-modality of tabular data, such as incorporating image or text features.
  • Not suitable for semi-supervised learning.

On the other had, while multi-layer perceptrons (MLPs) offer the potential for end-to-end learning of image or text encoders, they are constrained by several drawbacks:

  • Lack of interpretability.
  • Vulnerability to missing and noisy data.
  • Limited performance in semi-supervised learning scenarios.
  • Inability to match the performance of tree-based models.


Method

archi

  • The Transformer layers receive only categorical inputs $x_{cat}$.
  • Continuous inputs $x_{cont}$ are concatenated with the outputs of the Transformer modules of the categorical inputs.
  • During the pre-training phase, the Transformer layers undergo training on two different tasks using unlabeled data
    • Only the categorical inputs are utilized for pre-training, with the exclusion of the continuous inputs.

    code1

  • The pre-trained model is fine-tuned alongsidethe MLP head, utilizing labeled data to predict a target $y$.
  • Continuous values are incorporated during the fine-tuning phase by concatenating them with the categorical values.

    code2


Model Architecture

fig1

  • Each instance $x\equiv \lbrace x_{cat}, x_{cont}\rbrace$ is paired with its corresponding label $y$: $(x, y)$.
  • $x_{cat} \equiv \lbrace x_1, x_2, …, x_m\rbrace$ represents categorical features, with each $x_i$ being a categorical feature $i \in {1, …, m}$.
  • $x_{cat}$ undergoes transformation into column embedding $E_\phi$:

    \[E_\phi(x_{cat}) \equiv \lbrace e_{\phi_1}(x_1), ..., e_{\phi_m}(x_m) \rbrace, \ e_{\phi_i}(x_i) \in \mathbb{R}^d\]
  • The embeddings are fed into the multiple Transformer layers $f_\theta$, producing contextual embeddings:

    \[\{h_1, ..., h_m\}=f_\theta(E_\phi(x_{cat})), \ h\in \mathbb{R}^d\]
  • Contextual embeddings of $x_{cat}$ are concatenated with the $x_{cont} \in \mathbb{R}^c $ to form a vector of dimension $(d\times m+c)$.
  • The vector is passed through an MLP layer $g_\psi$ and a cross-entropy loss $H$ is computed between the predicted output and the target $y$:

    \[L(x, y) \equiv H(g_\psi(f_\theta(E_\phi(x_{cat})), x_{cont}), y)\]


Column Embedding

colemb

  • Each categorical feature $x_i$ has its own embedding lookup table $e_{\phi_i}(.)$.
  • For the $i$th feature with $d_i$ classes, the embedding table $e_{\phi_i}(.)$ contains $(d_1+1)$ embeddings. The additional $d_1+1$th embedding is reserved for representing the missing(masked) values.
  • Each embedding $e_{\phi_i}(j)$ is represented as $[c_{\phi_i}, w_{\phi_{ij}}]$, where:
    • $c_{\phi_i}$ helps distinguish the classes in column $i$ from those in the other columns.
    • $w_{\phi_{ij}}$ distinguishes the class of the feature $j$ within the $i$th column from the other classes within the same column.
  • *The dimension $d$ likely is set to be the same as the hidden dimension $h$ according to the codes.

    code3


Pre-training

The Transformer layers are trained using inputs consisting of categorical values $x_{cat}=\lbrace x_1, x_2, …, x_m\rbrace$ on two pre-training tasks:

  1. Masked language modeling (MLM)
    • Randomly masks $k\%$ features of the input, where $k$ is set to 30 in experiments.
    • Minimizes the cross-entropy loss of a multi-class classifier $g_\psi$, which predicts the original features of the masked features.
  2. Replaced token detection (RTD)
    • Replaces the original feature by a random value of that feature.
    • Minimizes the loss of a binary classifier predicting whether the feature has been replaced.
    • Each column has its own embedding lookup table, necessitating the definition of a separate binary classifier for each column.


Experiments

Settings

Data

  • Models were evaluated on 15 publicly available binary classification datasets sourced from UCI repository, AutoML Challenge, and Kaggle.
  • Each dataset was divided into 5 cross-validation splits.
  • Training:Validation:Testing proportion was set to 65:15:20 (%).
  • The number of categorical features ranged from 2 to136.
  • Semi-supervised and supervised experiments
    • Semi-supervised: Training data consisted of $p$ labeled data points + the remaining unlabeled data, with $p\in (50, 200, 500)$ for 3 different scenarios.
    • Supervised: Fully labeled training data was used.

Setup

  • Hidden dimension: 32
  • The num of layers: 6
  • The num of attention heads: 8
  • MLP layer architecture: $\lbrace 4\times l, \ 2\times l \rbrace$ (where $l$ represents the size of its input).
  • Hyperparamter optimization (HPO) conducted with 20 rounds for each cross-validation split.
  • Metrics: Area under the curve (AUC).
  • Pre-training was exclusively applied in the semi-supervised scenario.
    • It was not found to be significantly beneficial when the entire dataset was labeled.
    • Its benefits were more apparent when there is a large number of unlabeled examples and a few labeled examples, as pre-training provided representations of the data that could not be learned solely from the labeled examples.

Baseline model: An MLP model without Transformers was employed to evaluate the effectiveness of Transformers in comparison.


The effectiveness of the Transformer Layers

  1. Performance comparison

    table1

    • Conducted in a supervised learning scenario, comparing TabTransformer to MLP.
    • TabTransformer outperforms the baseline MLP on 14 datasets, achieving an average 1.0% gain in AUC.
  2. t-SNE visualization of contextual embeddings

    fig2

    • Each marker in the plot represents an average of 2D points over the test data points for a certain class.
    • In the t-SNE plot of the last layer of TabTransformer (Left), semantically similar classes are closely grouped, forming clusters in the embedding space.
    • Before passing into the Transformer (Center), the embeddings start to distinguish features with different characteristics.
    • The embeddings of MLP (Right) do not reveal any discernible pattern.
  3. Prediction performance of linear models using the embeddings from different Transformer layers

    fig2

    • Logistic regression models are employed to evaluate the quality of learned embeddings.
    • Each model predicts $y$ using embedding features along with continuous values.
    • Metrics: Cross-validation score in AUC on the test data.
    • Normalization: Each prediction score is normalized by the best score from an end-to-end trained TabTransformer for the corresponding dataset.
    • Features: Embeddings are averaged and processed using maximum pooling instead of concatenation.
    • The effectiveness of the embeddings improves as the Transformer layers progress.


The robustness of TabTransformer

The robustness of TabTransformer was evaluated by assessing its performance on datasets containing noisy data and data with missing values.

fig4_5

  1. Noisy data
    • Method: Values were replaced with randomly generated ones from corresponding columns, introducing noise into datasets.
    • Findings: As the noise increases, TabTransformer demonstrated significantly significantly superior compared to the MLP (see fig. 4).
    • The contextual property of embeddings likely contributes to TabTransformer’s robustness in noisy environments.
  2. Data with missing values
    • Method: Some values artificially made missing, and models were evaluated on these modified datasets.
      • The average learned embeddings over all classes in the corresponding columns were used to handle the embeddings of missing values.
    • Findings: TabTransformer exhibited better stability than MLP in handling missing values (see fig. 5).


Supervised learning

TabTransformer’s performance was compared against four categories of methods:

  • Logistic Regression and GBDT
  • MLP and sparse MLP
  • TabNet model
  • Variational Information Bottleneck (VIB) model

table2

Findings:

  • TabTransformer demonstrated comparable performance with GBDT.
  • It significantly outperformed than recent deep learning models designed for tabular data, including TabNet and VIB.


Semi-supervised learning

TabTransformer was evaluated under the semi-supervised learning scenario and compared against other semi-supervised models, including baseline models:

  • Entropy Regularization (ER)
  • Pseudo Labeling (PL) combined with MLP, TabTransformer, and GBDT
  • MLP (DAE): An unsupervised pre-training method designed for deep models on tabular data, specifically the swap noise Denoising AutoEncoder

table3_4

Method:

  • Pre-trained models (TabTransformer-RTD/MLM and MLP): pre-trained on the unlabeled data and then fine-tuned on labeled data.
  • Semi-supervised learning methods (ER and PL): trained on the mix of labeled and unlabeled training data.

Findings:

  • TabTransformer-RTD/MLM are outperformed all the other models.
  • TabTransformer (ER), TabTransformer (PL) and GBDT (PL) performed worse than the average of all the models.
  • TabTransformer-RTD consistently showed better results when the number of unlabeled data decreased, surpassing TabTransformer-MLM.
    • This could be attributed to the easier pre-training task of a binary classification compared to the multi-class classification of MLM.
  • With only 50 data points, MLM (ER) and MLM (PL) outperformed TabTransformer models.
    • The suggests that the proposed approach allows for informative embeddings but does not enable the weights of the classifier itself to be trained with unlabeled data.
  • Overall, TabTransformer models are promise in extracting useful information from unlabeled data to aid supervised training, and are particularly useful when the size of unlabeled data is large.