The journey of open AI GPT models (2023)

Table of Contents
final remark: Glossary: Videos
The journey of open AI GPT models (1)

OpenAI's pre-trained Generative Transformer (GPT) models have taken the natural language processing (NLP) community by storm by introducing very powerful language models. These models can perform various NLP tasks, e.ganswer question,Textlink, text summaries, etc. without supervised training. These language models require very few or no examples to understand the tasks and achieve equivalent or even better performance than state-of-the-art supervised-trained models.

In this article we will cover the journey of these models and understand how they have evolved over a 2 year period. Here we will discuss the following topics:

1. Discussion of the GPT-1 Document (Improving language comprehension through generative pre-training).

2. Discussion of the GPT-2 Document (Language models are unsupervised multitasking learners) and its subsequent improvements over GPT-1.

3. Discussion of the GPT-3 Document(Language models are low-opportunity students)) and the improvements that have made it one of the most powerful models NLP has ever seen.

This article assumes familiarity with basic NLP terminologyTransformerarchitecture.

Let's start by understanding these documents individually. To make this journey easier to understand, I have divided each article into four sections: goals and concepts discussed in the articles, the datasets used, model architecture and implementation details, and their benchmarks.

Prior to this work, most state-of-the-art NLP models were trained specifically for a specific task using supervised learning, e.g. B. sentiment classification, text links, etc. However, supervised models have two main limitations:

(Video) The journey of OpenAI

yeah You need a large amount of annotated data to learn a specific task, which is often not readily available.

ii. They cannot generalize to any job other than what they were trained to do.

This article suggested learning a generative language model with unlabeled data and then tweaking the model by providing examples of specific post-tasks such as sorting, sentiment analysis, text links, etc.

Unsupervised learning served as a pre-training target for supervised fine-tuning models, hence the name generative pre-training.

Let's review the concepts and approaches covered in this document.

1.Learning Objectives and Concepts: That's ithalf supervisedLearning (unsupervised pre-training followed by supervised fine-tuning) for NLP tasks consists of the following three components:

A.Unsupervised Language Modeling(Pretraining): For the unsupervised learning, the target of the standard language model was used.

The journey of open AI GPT models (2)

where T was the set of tokens in unsupervised data {t_1,...,t_n}, k was the size of the context window, θ were the neural network parameters trained with stochastic gradient descent.

B.supervised fine-tuning: This part aimed to maximize the probability of observing label y given resources or tokens x_1,…,x_n.

The journey of open AI GPT models (3)
(Video) What is ChatGPT? OpenAI's Chat GPT Explained

where C was the labeled data set consisting of training samples.

Instead of simply maximizing the goal mentioned in equation (ii), the authors added aAuxiliary Objectivefor supervised fine-tuning for better generalization and faster convergence. The modified training goal was set as:

The journey of open AI GPT models (4)

where L₁(C) was the auxiliary goal of the language learning model and λ was the weight of this secondary learning goal. λ was set to 0.5.

Supervised fine-tuning was achieved by adding a linear layer and a softmax layer to the transformer model to provide task designations for subsequent tasks.

C.Task-specific input transformations: In order to make minimal changes to the architecture of the model during fine-tuning, inputs for certain post-tasks were converted into ordered sequences. The tabs have been reorganized as follows:

— Added start and end tokens to input strings.

— Added a separator between different parts of the example so that the input can be sent as an ordered string. For tasks such as answering questions, multiple choice questions, etc., multiple strings were submitted for each example. For example. a training example consisting of context, question and answer sequences for the question-answer task.

2.record: GPT-1 uses thebookscorpusData set to train the language model. BooksCorpus had around 7,000 unpublished books that helped train the language model on invisible data. It is unlikely that this data will be found in the test set of subsequent tasks. In addition, this corpus contained large contiguous blocks of text that helped the model learn long-range dependencies.

3.Model architecture and implementation details: GPT-1 used a 12-layer masked self-awareness transformer decoder framework to train the language model. The architecture of the model remained essentially the same as described inoriginal workat transformers.maskinghelped achieve the language model's goal when the language model did not have access to subsequent words to the right of the current word.

The implementation details follow:

(Video) OpenAI's Greg Brockman: The Future of LLMs, Foundation & Generative Models (DALL·E 2 & GPT-3)

A.For unsupervised training:

  • byte pair encoding(BPE) vocabulary used with 40,000 fusions.
  • The model used a 768-dimensional state to encode tokens in word embeddings. Positional additions were also learned during training.
  • A 12 layer model was used with 12 grooming heads in each self grooming layer.
  • The state of 3072 dimensions was used for the position of the feed-forward layer.
  • The Adam optimizer was used with a learning rate of 2.5e-4.
  • Residual and embedded dropouts were used for regularization, with a dropout rate of 0.1. The modified version of the L2 regularization was also used for unbiased weights.
  • GELU was used as the activation function.
  • The model was trained for 100 epochs on mini-batches of size 64 and sequence length 512. The model had a total of 117 million parameters.

B.For supervised fine tuning:

  • Supervised fine-tuning took only 3 epochs for most subsequent tasks. It turned out that the model had already learned a lot about the language in the previous training session. A minimal fine tuning was enough.
  • Most of the pre-training unmonitored hyperparameters were used for fine-tuning.

4.Performance and Summary:

GPT-1 performed better than next-generation specially trained supervised models on 9 of the 12 tasks in which the models were compared.

Another significant achievement of this model was its decencyzero shot performancefor multiple tasks. The paper showed that the model evolved to zero-shot performance on various NLP tasks such as question answering, schema resolution, sentiment analysis, etc. because of previous training.

GPT-1 showed that the language model served as an effective pre-training target that can help generalize the model well. The architecture facilitated transfer learning and was able to perform many NLP tasks with very little customization. This model demonstrated the power of generative pretraining and paved the way for other models that could better exploit this potential with larger datasets and more parameters.

Developments in the GPT-2 model mainly involved using a larger dataset and adding more parameters to the model to learn an even stronger language model. Let's look at the main developments in the GPT-2 model and the concepts discussed in the article:

  1. Learning Objectives and Concepts: The following are the two important concepts discussed in this document in the context of NLP.
  • task conditioning: We have seen that the training goal of the language model is formulated as P (output|input). However, GPT-2 aimed to learn multiple tasks using the same unsupervised model. To do this, the learning objective must be changed to P(output | input, task).This modification is called task conditioning, where the model is expected to produce different outputs for the same input for different tasks.Some models implement task conditioning at the architecture level, where the model is driven by both the input and the task. In language models, the output, input, and task are all natural language strings. This way,Task conditioning for language modelsis interpreted Provide examples or natural language instructions for the model to perform a task. Task conditioning forms the basis for zero-shot task delegation, which we'll discuss below.
  • Zero Shot Learning und Zero Short Task Transfer: An interesting ability of GPT 2 isZero-shot task transfer.Zero-Shot Learningit is a special case of zero-shot task transfer where no examples are provided and the model understands the task based on the instructions given. Instead of rearranging the sequences as GPT-1 did for fine-tuning, inputs to GPT-2 were provided in a format that expected the model to understand the nature of the task and provide answers. This was done to emulate the non-triggered task transfer behavior. For example. For the English-French translation task, the model received an English sentence followed by the French word and a message (:). The model should understand that this is a translation task and provide the French equivalent of the English sentence.

2.record: In order to create a large, high-quality dataset, the authors scoured the Reddit platform and extracted data from outbound links of highly upvoted articles. The resulting dataset, called WebText, contained 40 GB of text data from over 8 million documents. This dataset was used to train GPT-2 and was huge compared to the Book Corpus dataset used to train the GPT-1 model. All Wikipedia articles have been removed from WebText as many test suites contain Wikipedia articles.

3.Model architecture and implementation details: GPT-2 had 1.5 billion parameters. that was 10 times more than GPT-1 (117 million parameters). The main differences from GPT-1 were:

  • GPT-2 had 48 layers and used 1600 dimensional vectors for word embedding.
  • A larger vocabulary of 50,257 letters was used.
  • A larger stack size of 512 and a larger context window of 1024 tokens were used.
  • Moved level normalization to the entrance of each sub-block, and added additional level normalization after the last self-care block.
  • At initialization, the weight of the residual layers was scaled by 1/√N, where N was the number of residual layers.

The authors trained four language models with the parameters 117M (the same as GPT-1), 345M, 762M and 1.5B (GPT-2). Each subsequent model had less perplexity than the previous one.It was thus established that theconfusionof language models in the same dataset decreases as the number of parameters increases.In addition, the model with the highest number of parameters performed better on each subsequent task.

4.Performance and Summary: GPT-2 was evaluated on various post-assignment datasets such as reading comprehension, summarizing, translating, answering questions, etc. Let's take a detailed look at some of these tasks and how GPT-2 performs on them:

(Video) GPT 3 Demo and Explanation - An AI revolution from OpenAI

  • GPT-2 improved the existing state of the art for 7 out of 8 language modeling datasets in the zero-trigger configuration.
  • Dice set for children's booksevaluates the performance of language models on word categories such as nouns, prepositions, named entities, etc. GPT-2 increased next-generation accuracy by about 7% for common name and named entity recognition.
  • LAMBADAThe dataset evaluates the models' performance in identifying long-range dependencies and predicting the last word in a sentence. GPT-2 reduced cluelessness from 99.8 to 8.6 and greatly improved accuracy.
  • GPT-2 exceeded 3 out of 4 benchmarks on reading comprehension tasks in a zero-trial environment.
  • On the French to English translation task, the GPT-2 performed better than most unsupervised models on zero-fire settings, but did not outperform the last generation unsupervised model.
  • GPT-2 did not perform well on text summaries and performed at or below classic models trained on summaries.

GPT-2 was able to achieve top results on 7 out of 8 language modeling datasets tested in the zero-shot.

GPT-2 showed that training on a larger dataset with more parameters improved the language model's ability to understand tasks and outperform many state-of-the-art tasks at zero-fire settings. The paper found that as model capacity increased, performance increased log-linearly. Furthermore, the perplexity drop of the language models did not saturate and continued to decrease with increasing number of parameters. In fact, GPT-2 didn't match the WebText record, and longer training may have further reduced the confusion. This showed that the size of the GPT-2 model was not the limit and that building even larger language models would reduce perplexity and improve natural language understanding of the language models.

In its quest to create very strong and powerful language models that require no fine-tuning and only a few demos to understand and execute the tasks, Open AI created the GPT-3 model with 175 billion parameters. This model had 10x more parameters than Microsoft's powerful NLG Turing language model and 100x more parameters than GPT-2. Due to the large number of parameters and the extensive dataset on which the GPT-3 was trained, it performs well in subsequent NLP tasks in zero-trigger and low-trigger configurations. Due to his great gift, he has skills like writing articles that are difficult to distinguish from those written by humans. You can also perform real-time tasks that you've never explicitly trained for, such as: Such as summarizing numbers, writing SQL queries and code, deciphering words in a sentence, writing React and JavaScript code based on the task description in natural language, etc. those mentioned in the GPT-3 document Understand concepts and developments, along with some of the broader implications and limitations of this model:

  1. Learning Objectives and Concepts: Let's discuss the two concepts discussed in this document.
  • Don't learn context: Large language models develop pattern recognition and other skills using the text data they are trained on. By learning the main goal of predicting the next word given the contextual words, the language models also begin to recognize patterns in the data that help them minimize the loss of the language modeling task. Later, this ability will help the model in the transfer of zero-fire tasks. When you are presented with some examples and/or a description of what you need to do, the language model combines the pattern from the examples with what it has learned for similar data in the past, and uses that knowledge to perform the tasks . This is a powerful capability of large language models that increases as the number of model parameters increases.
  • Low-Shot-, One-Shot- und Zero-Shot-Konfiguration: As discussed above, the settings few, one, and zero tries are special cases of submitting tasks with zero tries. In the low-shot configuration, the model includes a description of the task and as many examples as fit in the model's context window. In the single-shot configuration, the model is deployed with exactly one instance, and in the zero-shot configuration, no instances are deployed. As the capacity of the model increases, the model's low, single and zero shot capacity also improves.

2.record: GPT-3 was trained with a combination of five different corpora, each assigned a specific weight. High quality datasets were sampled more frequently and the model was trained on them for more than one epoch. The five datasets used were Common Crawl, WebText2, Books1, Books2, and Wikipedia.

3.Model and implementation details: The architecture of GPT-3 is the same as that of GPT-2. Some important differences from GPT-2 are:

  • GPT-3 has 96 layers and each layer has 96 attention heads.
  • Increased embedded word size for GPT-3 from 1600 for GPT-2 to 12888.
  • Increased context window size from 1024 tokens for GPT-2 to 2048 tokens for GPT-3.
  • The Adam optimizer was used with β_1=0.9, β_2=0.95 and ε= 10^(-8).
  • Locally alternating and densely alternating dispersed patterns of attention were used.

4.Performance and Summary: GPT-3 was evaluated on a large number of NLP and language modeling datasets. GPT-3 performed better than the prior art on language modeling datasets such as LAMBADA and Penn Tree Bank in a low or zero firing configuration. On other data sets, it hasn't beaten the prior art, but it has beaten the prior art with zero shot. GPT-3 also performed reasonably well on NLP tasks such as answering closed questions, solving schematics, translating, etc., often outperforming more advanced or comparable adapted models. For most tasks, the model performed better on low-shot settings than on one-shot and zero-shot settings.

In addition to testing the model on a traditional NLP task, the model was also tested on synthetic tasks such as arithmetic addition, word order, creating news articles, learning and using new words, and so on. Also on these tasks, the performance increased with the increase in the number of parameters, and the model performed better in the settings for few shots than in the settings for one and zero shots.

5.Limitations and Broader Impacts: The document analyzes several weaknesses of the GPT-3 model and identifies opportunities for improvement. Let's summarize them here.

  • Although GPT-3 is capable of producing high-quality text, it sometimes loses coherence when formulating long sentences and repeats strings of text over and over. Also, GPT-3 is not very good at tasks like natural language inference (determining whether a sentence implies another sentence), filling in the gaps, some reading comprehension tasks, etc. The article cites the unidirectionality of GPT models as a likely cause of these limitations and suggests training bidirectional models at this scale to overcome these problems.
  • Another limitation mentioned in the article is GPT-3's generic language modeling goal, which weights each token equally and lacks the concept of task- or goal-oriented token prediction. To counteract this, the article suggests approaches such as increasing the learning objective, using reinforcement learning to adjust models, adding other modalities, etc.
  • Other limitations of GPT-3 are complex and expensive model inference due to its cumbersome architecture, lower interpretability of the language and the output generated by the model, and uncertainty about what will help the model achieve its low-effort learning behavior.
  • In addition to these limitations, GPT-3 poses a potential risk of misusing its human text generation capabilities for phishing, spamming, spreading misinformation, or engaging in other fraudulent activities. Additionally, the text generated by GPT-3 exhibits the biases of the language in which he is trained. Articles created by GPT-3 may be biased by gender, ethnicity, race or religion. Therefore, it is extremely important to use such templates with care and monitor the text they generate before using them.

final remark:

This article summarizes the path and developments of the OpenAI GPT models and their development in three articles. These models are undoubtedly very powerful language models and have revolutionized the field of natural language processing by performing a large number of tasks using only instructions and a few examples. Although these models are not comparable to humans in understanding natural language, they have certainly shown a way forward to achieve this goal.


  1. The auxiliary objective is an additional training objective or task learned alongside the main objective to improve the performance of models by making them more generic. ThePapiercontains more details on this concept.
  2. Masking refers to removing or replacing words in a sentence with another dummy token so that the model does not have access to those words at the time of training.
  3. Byte-pair encoding is a data compression technique that replaces frequently occurring pairs of consecutive bytes with a byte that is not present in the data to compress the data. To reconstruct the original data, a table containing the mapping of replaced bytes is used. Thethe blogexplains BPE in detail.
  4. Learning or zero-shot behavior refers to a model's ability to perform a task without having seen any instances in the past. No gradient updates occur during zero-shot learning and the model is assumed to understand the task without looking at any examples.
  5. Zero-shot or meta-learning task delivery refers to the environment in which the model is presented with little or no examples to help you understand the task. The term zero shot comes from the fact that no gradient updates are performed. The model must understand the task based on the examples and instructions.
  6. Perplexity is the default scoring metric for language models. The perplexity is the inverse probability of the test set normalized by the number of words in the test set. Language models with lower perplexity are considered better than those with higher perplexity. To readThat's itBlog for more explanations on confusion.


  1. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I., 2018. Improving language comprehension through generative pretraining.
  2. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I., 2019. Language models are unsupervised multitasking learners.Open AI blog,1(8), S.9.
  3. Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. "Linguistic role models are learners with few opportunities."arXiv prepress arXiv:2005.14165(2020).
  4. Rei, M., 2017. Semi-supervised multitasking learning for sequence labeling.arXiv prepress arXiv:1704.07156.
  5. Waswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., & Polosukhin I., 2017. All You Need is Attention. In thetweezers.

Note: For brevity, links to blogs have not been repeated in references.

(Video) ChatGPT, Explained: What to Know About OpenAI's Chatbot | Tech News Briefing Podcast | WSJ


1. A Fascinating Journey of OpenAI | Leading AI Lab
2. It’s Time to Pay Attention to A.I. (ChatGPT and Beyond)
3. OpenAI’s GPT-3 Use Cases in 2023!
(Plan B Success)
4. How to Train Chat GPT on Your Business 🎓
(Jason West)
5. This Is Better Than ChatGPT (With Prompting Guide)
(Matt Wolfe)
6. Learn Live - Azure OpenAI: Introduction to Language Models and Applications
(Microsoft Developer)
Top Articles
Latest Posts
Article information

Author: Msgr. Refugio Daniel

Last Updated: 03/06/2023

Views: 5767

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.