Preparing a Dataset to Fine-tune GPT-J

Forefront Team
September 23, 2021
Preparing a Dataset to Fine-tune GPT-J

Fine-tuning is a powerful technique to create a new GPT-J model that is specific to your use case. When done correctly, fine-tuning GPT-J can achieve performance that exceeds significantly larger, general models like OpenAI's GPT-3 Davinci.

To fine-tune GPT-J on Forefront, all you need is a set of training examples formatted in a single text file with each example generally consisting of a single input example and its associated output. Fine-tuning can solve a variety of problems, and the optimal way to format your dataset will depend on your specific use case. Below, we'll list the most common use cases for fine-tuning GPT-J, corresponding guidelines, and example text files.

Before diving into the most common use cases, there are a few best practices that should be followed regardless of the specific use case:

  • Your data must be in a single text file.
  • You should provide at least one hundred high-quality examples, ideally vetted by a human knowledgeable in the given task.
  • You should use some kind of separator at the end of each prompt + completion in your dataset to make it clear to the model where each training example begins and ends. A simple separator which works well in almost all cases is " <|endoftext|> ".
  • Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator.

Classification

Classification is the process of categorizing text into a group of words. In classification problems, each input in the prompt should be classified into one of your predefined classes.

Choose classes that map to a single token. At inference time, specify the parameter, length=1, since you only need the first token for classification.

Let's say you'd like to organize your customer support messages by topics. You may want to fine-tune GPT-J to filter incoming support so they can be routed appropriately.

The dataset might look like the following:

In the example above, we provided instructions for the model followed by the input containing the customer support message and the output to classify the message to the corresponding category. As a separator we used " <|endoftext|> " which clearly separated the different examples. The advantage of using " <|endoftext|> " as the separator is that the model natively uses it to indicate the end of a completion. It does not need to be set as a stop sequence either because the model automatically stops a completion before outputting " <|endoftext|> ".

Now we can query our model by making a Completion request.

Sentiment Analysis

Sentiment Analysis is the act of identifying and extracting opinions within a given text across blogs, reviews, social media, forums, news, etc. Let's say you'd like to get a degree to which a particular product review is positive or negative.

The dataset might look the following:

Now we can query our model by making a Completion request.


Chatbot

The purpose of a chatbot is to simulate human-like conversations with users via text message or chat. You could fine-tune GPT-J to imitate a specific person or respond in certain ways provided the context of a given conversation to use in a customer support situation. First, let's look at getting GPT-J to imitate Elon Musk.

The dataset might look like the following:

Here we purposefully left out separators to divide specific examples. Instead, you can opt for compiling long-form conversations when attempting to imitate a specific person since we want to capture a wide variety of responses in an open-ended format.

You could query the model by making a Completion request.


Notice that we provide "User:" and "Elon Musk:" as a stop sequence. It's important to anticipate how the model may continue to provide completions beyond the desired output and use stop sequences to stop the model from continuing. Given the pattern of the dataset where a User says something followed by Elon Musk, it makes sense to use "User:" and "Elon Musk:" as the stop sequence.

A similar but different chatbot use case would be that of a customer support bot. Here, we'll go back to providing specific examples with separators so the model can identify how to respond in different situations. Depending on your customer support needs, this use case could require a few thousand examples, as it will likely deal with different types of requests and customer issues.

The dataset might look like the following:

An optional improvement that could be made to the above dataset would be to provide more context and exchanges leading up to the resolution for each example. However, this depends on the role you're hoping to fill with your customer support chatbot.

Now we can query our model by making a Completion request.


As with the previous example, we're using "Customer:" and "#####" as stop sequences so the model stops after providing the relevant completion.

Entity Extraction

The main purpose of entity extraction is to extract information from given text to understand the subject, theme, or other pieces of information like names, places, etc. Let's say, you'd like to extract names from provided text.

The dataset might look like the following:

Now we can query our model by making a Completion request.


Idea Generation

A common use case is to use GPT-J to generate ideas provided specific information. Whether it's copy for ads or websites, blog ideas, or products, generating ideas is a useful task for GPT-J. Let's look at the aforementioned use case of generating blog ideas.

The dataset might look like the following:

Now we can query our model by making a Completion request.


Following the above example on preparing a dataset for each use case should lead to well-performing fine-tuned GPT-J models when having a sufficient amount of examples (>100MB dataset). For datasets less than 100MB, it is recommended to also provide explicit instructions with each example like the following dataset for blog idea generation.

Start fine-tuning

Ready to get started?

Start fine-tuning and deploying language models or explore Forefront Solutions.

Transparent, flexible pricing

Pay per token or per hour with flat-rate hourly GPUs. No hidden fees or confusing math.

pricing details
Start your integration

Get up and running with your models in just a few minutes.

documentation