关于大模型

从GPT3，正式进入了大模型时代。

GPT 117M
GPT2 1.5B
GPT3 175B

GPT3参数大小是GPT2的100倍，GPT的1000倍。

GPT
- 利用两阶段训练（无监督预训练 + 监督微调），结合Transformer，提升了在语言理解（Language Understanding）领域的效果。
- 关键点
  - 预训练（pre-training）
  - Transformer
- 论文
  - 《Improving Language Understanding by Generative Pre-Training》
- 参数大小
  - 117M
GPT2
- 沿着GPT思路，重点研究了无监督训练，发现无监督训练+大数据集，即可达到SOTA。
- 关键点
  - 无监督（unsupervised）
  - 大数据集（40G）
  - 大参数（1.5B）
- 论文
  - 《Language Models are Unsupervised Multitask Learners》
- 参数大小
  - 1.5B
GPT3
- 沿着GPT3思路，发现通过进一步扩大模型规模，结合few-shot可以达到SOTA。
- 关键点
  - 大模型（175B）
  - few-shot
- 论文
  - 《Language Models are Few-Shot Learners》

引言强调：

虽然两阶段训练（无监督训练+监督微调）可以提升效果。但在微调阶段，仍需要相当的标注样本来训练（上千甚至上万）。而人类只需要几个例子，就可以学习到新技能。

本论文发现，通过扩大模型规模，就可以达到类人的few-shot能力，而不需要模型微调。（只要无监督训练就可得到模型，然后测试的时候结合few-shot）

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

关于大模型的实用性

GPT-3 在现实中可能不太实用（太贵太重），一个办法是使用蒸馏的小模型