GPT论文《Improving Language Understanding by Generative Pre-Training》中提到在NLP领域主要是监督训练,需要大量标注数据。而论文的创新是半监督训练:预训练(无监督)+ 特定任务微调 (监督)。预训练生成通用的神经网络模型,微调提升特定领域的能力。
本质上,这有点像先有一个小孩子 | 小学生(通用的硬件),然后再把她训练成某个领域的专家,跟人的学习路径类似。
In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective.
另外,论文在相关工作章节的介绍中也提到了,无监督训练 ,半监督训练,甚至预训练+微调并不是本论文首创,以前也有,但结合Transformer是创新 。(Transformer在2017年Google提出,本论文在2018年发表)。可见,GPT也是整合/组合了前人不同的工作,得到了一个比较好的结果。
The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and Ruder [21] follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction ability to a short range. In contrast, our choice of transformer networks allows us to capture longer-range linguistic structure, as demonstrated in our experiments.
注意,GPT参数是117M