[Paper] Tabtransformer: Tabular data modeling using contextual embeddings (2020)

11 Apr 2024 #tabular #transformer

Huang, Xin, et al. “Tabtransformer: Tabular data modeling using contextual embeddings.” arXiv preprint arXiv:2012.06678 (2020).

Paper Link

Points

TabTransformer: A cutting-edge tabular data model leveraging contextual embeddings.
Pre-trained by innovative two-phase approach for robust feature representation.
Showed SOTA performance in both supervised and semi-supervised learning.
Handles missing and noisy data robustly, ensuring reliable performance.

Background

The current state-of-the-art (SOTA) mdoels for tabular data primarily consist of tree-based ensemble methods, notably gradient boosted decision trees (GBDT). However, these models exhibit several limitations in comparison to deep learning models:

Not suitable for continual learning from streaming data.
Ineffective for end-to-end learning of multi-modality of tabular data, such as incorporating image or text features.
Not suitable for semi-supervised learning.

On the other had, while multi-layer perceptrons (MLPs) offer the potential for end-to-end learning of image or text encoders, they are constrained by several drawbacks:

Lack of interpretability.
Vulnerability to missing and noisy data.
Limited performance in semi-supervised learning scenarios.
Inability to match the performance of tree-based models.

Method

archi

The Transformer layers receive only categorical inputs $x_{cat}$.
Continuous inputs $x_{cont}$ are concatenated with the outputs of the Transformer modules of the categorical inputs.
During the pre-training phase, the Transformer layers undergo training on two different tasks using unlabeled data
- Only the categorical inputs are utilized for pre-training, with the exclusion of the continuous inputs.
The pre-trained model is fine-tuned alongsidethe MLP head, utilizing labeled data to predict a target $y$.
Continuous values are incorporated during the fine-tuning phase by concatenating them with the categorical values.

Model Architecture

fig1

Each instance $x\equiv \lbrace x_{cat}, x_{cont}\rbrace$ is paired with its corresponding label $y$: $(x, y)$.
$x_{cat} \equiv \lbrace x_1, x_2, …, x_m\rbrace$ represents categorical features, with each $x_i$ being a categorical feature $i \in {1, …, m}$.
$x_{cat}$ undergoes transformation into column embedding $E_\phi$:
\[E_\phi(x_{cat}) \equiv \lbrace e_{\phi_1}(x_1), ..., e_{\phi_m}(x_m) \rbrace, \ e_{\phi_i}(x_i) \in \mathbb{R}^d\]
The embeddings are fed into the multiple Transformer layers $f_\theta$, producing contextual embeddings:
\[\{h_1, ..., h_m\}=f_\theta(E_\phi(x_{cat})), \ h\in \mathbb{R}^d\]
Contextual embeddings of $x_{cat}$ are concatenated with the $x_{cont} \in \mathbb{R}^c $ to form a vector of dimension $(d\times m+c)$.
The vector is passed through an MLP layer $g_\psi$ and a cross-entropy loss $H$ is computed between the predicted output and the target $y$:
\[L(x, y) \equiv H(g_\psi(f_\theta(E_\phi(x_{cat})), x_{cont}), y)\]

Column Embedding

colemb

Each categorical feature $x_i$ has its own embedding lookup table $e_{\phi_i}(.)$.
For the $i$th feature with $d_i$ classes, the embedding table $e_{\phi_i}(.)$ contains $(d_1+1)$ embeddings. The additional $d_1+1$th embedding is reserved for representing the missing(masked) values.
Each embedding $e_{\phi_i}(j)$ is represented as $[c_{\phi_i}, w_{\phi_{ij}}]$, where:
- $c_{\phi_i}$ helps distinguish the classes in column $i$ from those in the other columns.
- $w_{\phi_{ij}}$ distinguishes the class of the feature $j$ within the $i$th column from the other classes within the same column.
*The dimension $d$ likely is set to be the same as the hidden dimension $h$ according to the codes.

Pre-training

The Transformer layers are trained using inputs consisting of categorical values $x_{cat}=\lbrace x_1, x_2, …, x_m\rbrace$ on two pre-training tasks:

Masked language modeling (MLM)
- Randomly masks $k\%$ features of the input, where $k$ is set to 30 in experiments.
- Minimizes the cross-entropy loss of a multi-class classifier $g_\psi$, which predicts the original features of the masked features.
Replaced token detection (RTD)
- Replaces the original feature by a random value of that feature.
- Minimizes the loss of a binary classifier predicting whether the feature has been replaced.
- Each column has its own embedding lookup table, necessitating the definition of a separate binary classifier for each column.

Experiments

Settings

Data

Models were evaluated on 15 publicly available binary classification datasets sourced from UCI repository, AutoML Challenge, and Kaggle.
Each dataset was divided into 5 cross-validation splits.
Training:Validation:Testing proportion was set to 65:15:20 (%).
The number of categorical features ranged from 2 to136.
Semi-supervised and supervised experiments
- Semi-supervised: Training data consisted of $p$ labeled data points + the remaining unlabeled data, with $p\in (50, 200, 500)$ for 3 different scenarios.
- Supervised: Fully labeled training data was used.

Setup

Hidden dimension: 32
The num of layers: 6
The num of attention heads: 8
MLP layer architecture: $\lbrace 4\times l, \ 2\times l \rbrace$ (where $l$ represents the size of its input).
Hyperparamter optimization (HPO) conducted with 20 rounds for each cross-validation split.
Metrics: Area under the curve (AUC).
Pre-training was exclusively applied in the semi-supervised scenario.
- It was not found to be significantly beneficial when the entire dataset was labeled.
- Its benefits were more apparent when there is a large number of unlabeled examples and a few labeled examples, as pre-training provided representations of the data that could not be learned solely from the labeled examples.

Baseline model: An MLP model without Transformers was employed to evaluate the effectiveness of Transformers in comparison.

The effectiveness of the Transformer Layers

Performance comparison
- Conducted in a supervised learning scenario, comparing TabTransformer to MLP.
- TabTransformer outperforms the baseline MLP on 14 datasets, achieving an average 1.0% gain in AUC.
t-SNE visualization of contextual embeddings
- Each marker in the plot represents an average of 2D points over the test data points for a certain class.
- In the t-SNE plot of the last layer of TabTransformer (Left), semantically similar classes are closely grouped, forming clusters in the embedding space.
- Before passing into the Transformer (Center), the embeddings start to distinguish features with different characteristics.
- The embeddings of MLP (Right) do not reveal any discernible pattern.
Prediction performance of linear models using the embeddings from different Transformer layers
- Logistic regression models are employed to evaluate the quality of learned embeddings.
- Each model predicts $y$ using embedding features along with continuous values.
- Metrics: Cross-validation score in AUC on the test data.
- Normalization: Each prediction score is normalized by the best score from an end-to-end trained TabTransformer for the corresponding dataset.
- Features: Embeddings are averaged and processed using maximum pooling instead of concatenation.
- The effectiveness of the embeddings improves as the Transformer layers progress.

The robustness of TabTransformer

The robustness of TabTransformer was evaluated by assessing its performance on datasets containing noisy data and data with missing values.

fig4_5

Noisy data
- Method: Values were replaced with randomly generated ones from corresponding columns, introducing noise into datasets.
- Findings: As the noise increases, TabTransformer demonstrated significantly significantly superior compared to the MLP (see fig. 4).
- The contextual property of embeddings likely contributes to TabTransformer’s robustness in noisy environments.
Data with missing values
- Method: Some values artificially made missing, and models were evaluated on these modified datasets.
  - The average learned embeddings over all classes in the corresponding columns were used to handle the embeddings of missing values.
- Findings: TabTransformer exhibited better stability than MLP in handling missing values (see fig. 5).

Supervised learning

TabTransformer’s performance was compared against four categories of methods:

Logistic Regression and GBDT
MLP and sparse MLP
TabNet model
Variational Information Bottleneck (VIB) model

table2

Findings:

TabTransformer demonstrated comparable performance with GBDT.
It significantly outperformed than recent deep learning models designed for tabular data, including TabNet and VIB.

Semi-supervised learning

TabTransformer was evaluated under the semi-supervised learning scenario and compared against other semi-supervised models, including baseline models:

Entropy Regularization (ER)
Pseudo Labeling (PL) combined with MLP, TabTransformer, and GBDT
MLP (DAE): An unsupervised pre-training method designed for deep models on tabular data, specifically the swap noise Denoising AutoEncoder

table3_4

Method:

Pre-trained models (TabTransformer-RTD/MLM and MLP): pre-trained on the unlabeled data and then fine-tuned on labeled data.
Semi-supervised learning methods (ER and PL): trained on the mix of labeled and unlabeled training data.