The newest GPT model, GPT-J, is making its rounds in the NLP community and bringing up some questions along the way. So the purpose of this article is to answer the question: What is GPT-J?
GPT-J-6B is an open source, autoregressive language model created by a group of researchers called EleutherAI. It's one of the most advanced alternatives to OpenAI's GPT-3 and performs well on a wide array of natural language tasks such as chat, summarization, and question answering, to name a few.
For a deeper dive, GPT-J is a transformer model trained using Ben Wang's Mesh Transformer JAX. "GPT" is short for generative pre-trained transformer, "J" distinguishes this model from other GPT models, and "6B" represents the 6 billion trainable parameters. Transformers are increasingly the model of choice for NLP problems, replacing recurring neural network (RNN) models such as long short-term memory (LSTM). The additional training parallelization allows training on larger datasets than was once possible.
The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
GPT-J was trained on the Pile, a large-scale curated dataset created by EleutherAI.
GPT-J was trained for 402 billion tokens over 383,500 steps on a TPU v3-256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
GPT-J learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at generating text from a prompt due to the core functionality of GPT-J being to take a string of text and predict the next token. When prompting GPT-J it is important to remember that the statistically most likely next token is often the one that will be provided by the model.
GPT-J can perform various tasks in language processing without any further training, including tasks it was never trained for. It can be used to solve a lot of different use cases like language translation, code completion, chatting, blog post writing and many more. Through fine-tuning (discussed later), GPT-J can be further specialized on any task to significantly increase performance.
Let's look at some example tasks:
Open ended conversations with an AI support agent.
Create question + answer structure for answering questions based on existing knowledge.
English to french
Translate English text into French.
Parse unstructured data
Create tables from long form text by specifying a structure and supplying some examples.
Translate natural language to SQL queries.
Python to natural language
Explain a piece of Python code in human understandable language.
As you can tell, the standard GPT-J model adapts and performs well on a number of different NLP tasks. However, things get more interesting when you explore fine-tuning.
While the standard GPT-J model is proficient at performing many different tasks, the model's capabilities improve significantly when fine-tuned. Fine-tuning refers to the practice of further training GPT-J on a dataset for a specific task. While scaling parameters of transformer models consistently yields performance improvements, the contribution of additional examples of a specific task can greatly improve performance beyond what additional parameters can provide. Especially for use cases like classification, extractive question answering, and multiple choice, collecting a few hundred examples is often "worth" billions of parameters.
To see what fine-tuning looks like, here's a demo (2m 33s) on how to fine-tune GPT-J on Forefront. There's two variables to fine-tuning that, when done correctly, can lead to GPT-J outperforming GPT-3 Davinci (175B parameters) on a variety of tasks. Those variables are the dataset and training duration.
For a comprehensive tutorial on preparing a dataset to fine-tune GPT-J, check out our guide.
At a high level, the following best practices should be considered regardless of your task:
Let's look at some example datasets:
Classify customer support messages by topic.
Analyze sentiment for product reviews.
Generate blog ideas given a company's name and product description.
The duration you should fine-tune for largely depends on your task and number of training examples in your dataset. For smaller datasets, fine-tuning 5-10 minutes for every 100kb is a good place to start. For larger datasets, fine-tuning 45-60 minutes for every 10MB is recommended. These are rough rules of thumb and more complex tasks will require longer training durations.
GPT-J is notoriously difficult and expensive to deploy in production. When considering deployment options there are two things to keep in mind: cost and response speeds. The most common hardware for deploying GPT-J is a T4, V100, or TPU, all of which come with less than ideal tradeoffs. At Forefront, we experienced these undesirable tradeoffs and started to experiment to see what we could about it. Several low-level machine code optimizations later, and we built a one-click GPT-J deployment, offering the best cost, performance, and throughput available. Here's a quadrant to compare the different deployment methods by cost and response speeds:
Large transformer language models like GPT-J are increasingly being used for a variety of tasks and further experimentation will inevitably lead to more use cases that these models prove to be effective at. At Forefront, we believe providing a simple experience to fine-tune and deploy GPT-J can help companies easily enhance their products with minimal work required. To start using GPT-J, get in touch with our team.
Start fine-tuning and deploying language models or explore Forefront Solutions.