../gpt3-note

GPT3 Note

Table of contents

OpenAI Api Note

Introducing GPT-3.5


Language Models are Few-shot Learners

Paper address

Abstract Takeaway

A large corpus of task-specific and fine-tuning text datasets are still needed nowadays. Set human’s ability as objective, task-agnostic and few-shot performance are two main goals to achieve. Here large amount of training data (scaling up language model to 175 billion params) is the German-beer.

Problems to notice: methodological issues related to large amount of web corpora and some failure on specific datasets.

Being Good at overall: translation, question-answering, cloze task, on-the-fly reasoning, domain adaptation.

text-generation based on input prompts/conditions: generating news articles, product description, creative writings… (text completion). language translation, as well.

Introduction Takeaway

Single-layer representations (word vectors) → task-specific architectures

RNNs with multiple layers representations + contextual state → task-specific architectures

Pre-trained recurrent (transformer language models) → task-agnostic architectures, task-specific datasets & fine-tuning

Meta-learning is one of the possible solution, meaning that there could be some repeated sub-tasks (in-context learning) embedded within a single sequence and could be shared between sequences (the sequence is supposed to be forward-pass in the paper). But still fine-tuning results would be the winner.

Notice that there is an intension of meta-learning, which is to not distinguish the origin of learning new tasks from scratch at inference time or the fact of pattern recognition correlated to the training samples.

Another way to ease the limitation is the scaling of ****training with increasing capacity of the transformer language model (100 milli params → 300 milli → 1.5 billi → 8 billis → 11 billi → 17 billi). Log loss is proved to correlate well with multiple downstream tasks by the scaling practice.

Three cases are set for this test:

a) few-shot learning / in-context learning: few demonstrations (ie input-output pairs to demonstrate what the prompt is like and what expected output would be preferred) are given to the model during the development process, the amount of demonstrations would be form 10 to 100 (ie the maximum examples the model able to gather for learning context).

b) one-shot learning: one demonstration only.

c) zero-shot learning: no demonstrations and only an instruction (instruction is like to tell the model ‘Generate the summary for the attached article’ directly).

Quick development of a model: architectures → datasets → training → after-training (fine-tuning / task-desired). In this process, a trend of moving from task-specific to task-agnostic is presenting. In other words, to generalize the model so that it can be the basic of solutions across sites.

Methodologies Takeaway

Picked from paper. Explain how to differentiate the 3+1 cases mentioned for evaluating the GPT-3.

Picked from paper. Explain how to differentiate the 3+1 cases mentioned for evaluating the GPT-3.

This work is about evaluating the three cases (ie Few / One / Zero - Shot), which fall in the stage of ‘training’ actually. Consider if without fine-tuning, how well can the pre-trained model achieve?

Results Takeaway

Results on these specific tasks:

Possible concrete applications: code & writing auto-completion, grammar assistance, narrative generation, improving search engine responses, answering questions…

Analysis Takeaway

Large models should be trained with the concern of training data contamination, specifically the test & evaluation datasets and the training datasets. Memorization of training data should be prevented intensionally. But due to happened-bugs and high training cost, this work turned to study the impact of the detected dataset overlap.

Limitations and Future Works Takeaway

  1. Weaknesses on text synthesis: repeat semantically at document level, lose coherence over super long passages, self-contradiction, non-sequitur sentences / paragraphs several times, difficult with common sense physics.
  2. About reading comprehension: difficult with determining if two words are used the same way in a sentence and the pronouns explanation.
  3. Deeper structural & algorithmic limitations: autoregressive language models do not include any bidirectional architectures (ie no time-series analysis) / other training objectives like denoising / re-reading and tuning response.
  4. Still task-specific due to the pre-training objective. Consider piping inputs other than plain text (modalities: other forms of interactions such as videos, images..).
  5. Poor sample efficiency: incredible large corpus are needed in training, comparing to those absorbed in human life-time.
  6. It is still unable to distinguish the origin of learning new tasks from scratch at inference time or the fact of pattern recognition correlated to the training samples.
  7. Hard to do inference. Maybe learning distillation would help.
  8. Decisions are not interpretable. Bias and prejudice in content are retained.

Other Impacts Takeaway

Too good to distinguish synthetic text from human-written text.

  1. intensionally misuse
  2. bias, fairness, ethic issues
  3. energy efficiency (petaflop/s-days of computing: 1000~ vs 10~, GPT-3 vs GPT-2, 175B vs 1.5B)

ChatGPT

Reference: Introducing ChatGPT

A sibling model to InstructGPT, which task is to follow an instruction in a prompt and provide a detailed response.

Methods

Training using RLHF (Reinforcement Learning from Human Feedback) and the same methods as InstructGPT.

InstructGPT

Reference: Aligning language models to follow instructions

Paper address: Training language models to follow instructions with human feedback

Keywords: #language models, #follow intentional instructions, #GPT-3

GPT-3 is fit to handle natural language tasks with text inputs formatted as in the data structure called ‘prompts’. GPT-3 is not aligned with users, in other words, it may generate toxic responses.

RLHF (reinforcement learning from human feedback) is used to fine-tune GPT-3 in a supervised way and then here comes the InstructGPT. The ‘prompts’ here are provided by users through OpenAI’s api, which would receive manual review before putting into the training dataset. Here is how to use RLHF method to fine-tune GPT-3:

get prompts from users → add to training dataset → labelers give demonstrations to GPT-3 → model generates outputs → labelers rank outputs → use ranking data to train a reward model → utilize rewards as in PPO (Proximal Policy Optimization) to train GPT-3 again.

Picked from InstructGPT’s reference. Explain how to fine-tune GPT-3.

Picked from InstructGPT’s reference. Explain how to fine-tune GPT-3.

Frauds of previous GPT-3 to tackle: make up facts, toxic output generation

Labelers think 1.3B InstructGPT has better outputs than 175B GPT-3.

The InstructGPT work is more like an ‘alignment’ practice. Limitations of it mainly are: aligning to labelers preference is still suspicious, model would bias towards the cultural values of English-speaking people, susceptible to instructions behaviors, frauds from previous GPT-3 are still exist.

Picked from ChatGPT’s reference. Almost the same as InstructGPT’s. Picked from ChatGPT’s reference. Almost the same as InstructGPT’s.

Limitations

/GPT/ /note/