The Impact of Data Size on Transformer Training: Overfitting & Loss Dynamics

Figure 4: Vanilla Transformers trained on the 2M Question-Formation dataset following the settings in (Murty et al., 2023). The training losses stabilize at a value of approximately 1, which corroborates the result presented in Proposition 4.

:::info
Authors:

(1) Xueyan Niu, Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd.;

(2) Bo Bai baibo (8@huawei.com);

(3) Lei Deng (deng.lei2@huawei.com);

(4) Wei Han (harvey.hanwei@huawei.com).

:::

:::info
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

Table of Links

6.2 Training GPT-2

Leave a Comment Cancel reply