Introducing GPT-NeoX

The Forefront Team
February 7, 2022
Introducing GPT-NeoXImage by EleutherAI

A few days ago, EleutherAI announced their latest open source language model, GPT-NeoX-20B. Today, we’re excited to announce that GPT-NeoX is live on the Forefront platform, and the model looks to outperform any previous open source language model on virtually any natural language processing or understanding task. Start using GPT-NeoX

We are bringing the same relentless focus to optimizing cost efficiency, throughput, and response speeds as we have with GPT-J. Today, you can host GPT-NeoX on our flat-rate dedicated GPUs at 2x better cost efficiency than any other platform.

The full model weights for GPT-NeoX will be downloadable for free from February 9, under a permissive Apache 2.0 license from The Eye. Until then, you can use the model on the Forefront platform. We look forward to seeing all the ways our customers use GPT-NeoX to build world-changing applications and solve difficult NLP problems.

Let's take a more technical look at the model.

What is GPT-NeoX?

GPT-NeoX-20B is a transformer model trained using EleutherAI’s fork of Microsoft’s Deepspeed which they have coined “Deeperspeed”. "GPT" is short for generative pre-trained transformer, "NeoX" distinguishes this model from its predecessors, GPT-Neo and GPT-J, and "20B" represents the 20 billion trainable parameters. The approach to train the 20B parameter model includes data, pipeline, and model parallelism (”3d parallelism”) to maximize performance and training speed from a fixed amount of hardware. Transformers have increasingly become the model of choice for NLP problems, replacing recurring neural network (RNN) models such as long short-term memory (LSTM), and GPT-NeoX is the newest and largest open source version of such language models.

The model consists of 44 layers with a model dimension of 6144, and a feedforward dimension of 24576. The model dimension is split into 64 heads, each with a size of 96. Rotary Position Embedding (RoPE) is applied to 24 dimensions of each head. The model is trained with a tokenization vocabulary of roughly 50,000. Unlike previous models, GPT-NeoX uses a tokenizer that was trained on the Pile along with added special tokens like multiple white spaces to make code more efficient.

Training data

GPT-NeoX was trained on the Pile, a large-scale curated dataset created by EleutherAI.

Training procedure

GPT-NeoX was trained as a causal, autoregressive language model for 3 months on 96 NVIDIA A100s interconnected by NVSwitch, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.

Intended Use

GPT-NeoX learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at generating text from a prompt due to the core functionality of GPT-NeoX being to take a string of text and predict the next token. When prompting GPT-NeoX it is important to remember that the statistically most likely next token is often the one that will be provided by the model.

Model comparisons

See how GPT-NeoX compares on task accuracy, factual accuracy, and real-world use cases.

Task accuracy

While Davinci still outperforms due to its 10x larger parameter size, GPT-NeoX holds up well in performance and outpaces other models on most standard NLP benchmarks.

Factual accuracy

The model excels at knowledge-based, factual tasks given the Pile contains a lot of code, scientific papers, and medical papers.

Completion comparisons

The following comparisons between GPT-J and GPT-NeoX use the same prompts and parameters. Completions are provided by the general model weights for each model. Keep in mind that fine-tuning will achieve significantly better performance.

Text to command

Translate text into programmatic commands.

Product description rewriting

Generate a new product description based on a given tone.

Named Entity Recognition

Locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations.

Content generation

Generate structured HTML blog content.


Summarize complex text into a few words.

Product review generation

Generate a product review based on a product description.

Code generation

Create code based on text instructions.

Fine-tuning GPT-NeoX

A unique aspect to GPT-NeoX is that it fills a gap between GPT-3 Curie and Davinci, pushing the edge of how large a language model can be while still reasonable to fine-tune without incurring significant training or hosting costs. We’ve seen a majority of our customers get the most value out of fine-tuning GPT-J, and we expect GPT-NeoX to be no different. For this reason, we’re enabling customs to fine-tune GPT-NeoX models for free. Stay tuned for a blog post comparing fine-tuned GPT-NeoX models with the standard GPT-NeoX model.

What’s next

Our team is currently working to release GPT-NeoX fine-tuning within 24-48 hours. If you already have access to the Forefront platform, you can start using GPT-NeoX here. Or request access here. Please contact our team for specific questions or help related to your use case.

Learn more

Ready to get started?

Start fine-tuning and deploying language models or explore Forefront Solutions.

Transparent, flexible pricing

Pay per token or per hour with flat-rate hourly GPUs. No hidden fees or confusing math.

pricing details
Start your integration

Get up and running with your models in just a few minutes.