A few days ago, EleutherAI announced their latest open source language model, GPT-NeoX-20B. Today, we’re excited to announce that GPT-NeoX is live on the Forefront platform, and the model looks to outperform any previous open source language model on virtually any natural language processing or understanding task. Start using GPT-NeoX
We are bringing the same relentless focus to optimizing cost efficiency, throughput, and response speeds as we have with GPT-J. Today, you can host GPT-NeoX on our flat-rate dedicated GPUs at 2x better cost efficiency than any other platform.
The full model weights for GPT-NeoX will be downloadable for free from February 9, under a permissive Apache 2.0 license from The Eye. Until then, you can use the model on the Forefront platform. We look forward to seeing all the ways our customers use GPT-NeoX to build world-changing applications and solve difficult NLP problems.
Let's take a more technical look at the model.
GPT-NeoX-20B is a transformer model trained using EleutherAI’s fork of Microsoft’s Deepspeed which they have coined “Deeperspeed”. "GPT" is short for generative pre-trained transformer, "NeoX" distinguishes this model from its predecessors, GPT-Neo and GPT-J, and "20B" represents the 20 billion trainable parameters. The approach to train the 20B parameter model includes data, pipeline, and model parallelism (”3d parallelism”) to maximize performance and training speed from a fixed amount of hardware. Transformers have increasingly become the model of choice for NLP problems, replacing recurring neural network (RNN) models such as long short-term memory (LSTM), and GPT-NeoX is the newest and largest open source version of such language models.
The model consists of 44 layers with a model dimension of 6144, and a feedforward dimension of 24576. The model dimension is split into 64 heads, each with a size of 96. Rotary Position Embedding (RoPE) is applied to 24 dimensions of each head. The model is trained with a tokenization vocabulary of roughly 50,000. Unlike previous models, GPT-NeoX uses a tokenizer that was trained on the Pile along with added special tokens like multiple white spaces to make code more efficient.
GPT-NeoX was trained on the Pile, a large-scale curated dataset created by EleutherAI.
GPT-NeoX was trained as a causal, autoregressive language model for 3 months on 96 NVIDIA A100s interconnected by NVSwitch, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
GPT-NeoX learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at generating text from a prompt due to the core functionality of GPT-NeoX being to take a string of text and predict the next token. When prompting GPT-NeoX it is important to remember that the statistically most likely next token is often the one that will be provided by the model.
See how GPT-NeoX compares on task accuracy, factual accuracy, and real-world use cases.
While Davinci still outperforms due to its 10x larger parameter size, GPT-NeoX holds up well in performance and outpaces other models on most standard NLP benchmarks.
The model excels at knowledge-based, factual tasks given the Pile contains a lot of code, scientific papers, and medical papers.
The following comparisons between GPT-J and GPT-NeoX use the same prompts and parameters. Completions are provided by the general model weights for each model. Keep in mind that fine-tuning will achieve significantly better performance.
Text to command
Translate text into programmatic commands.
Product description rewriting
Generate a new product description based on a given tone.
Named Entity Recognition
Locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations.
Generate structured HTML blog content.
Summarize complex text into a few words.
Product review generation
Generate a product review based on a product description.
Create code based on text instructions.
A unique aspect to GPT-NeoX is that it fills a gap between GPT-3 Curie and Davinci, pushing the edge of how large a language model can be while still reasonable to fine-tune without incurring significant training or hosting costs. We’ve seen a majority of our customers get the most value out of fine-tuning GPT-J, and we expect GPT-NeoX to be no different. For this reason, we’re enabling customs to fine-tune GPT-NeoX models for free. Stay tuned for a blog post comparing fine-tuned GPT-NeoX models with the standard GPT-NeoX model.
Start fine-tuning and deploying language models or explore Forefront Solutions.