We could run out of data to train AI language programs

Researchers may have to get creative to make training data stretch further.

Tammy Xuarchive page

November 24, 2022

Stephanie Arnett/MITTR

Large language models are one of the hottest areas of AI research right now, with companies racing to release programs like GPT-3 that can write impressively coherent articles and even computer code. But there’s a problem looming on the horizon, according to a team of AI forecasters: we might run out of data to train them on.

Language models are trained using texts from sources like Wikipedia, news articles, scientific papers, and books. In recent years, the trend has been to train these models on more and more data in the hope that it’ll make them more accurate and versatile.

The trouble is, the types of data typically used for training language models may be used up in the near future—as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization, that is yet to be peer reviewed. The issue stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find ever more texts to train them on. Large language model researchers are increasingly concerned that they are going to run out of this sort of data, says Teven Le Scao, a researcher at AI company Hugging Face, who was not involved in Epoch’s work.

The issue stems partly from the fact that language AI researchers filter the data they use to train models into two categories: high quality and low quality. The line between the two categories can be fuzzy, says Pablo Villalobos, a staff researcher at Epoch and the lead author of the paper, but text from the former is viewed as better-written and is often produced by professional writers.

Data from low-quality categories consists of texts like social media posts or comments on websites like 4chan, and these examples greatly outnumber those considered to be high quality. Researchers typically only train models using data that falls into the high-quality category because that is the type of language they want the models to reproduce. This approach has resulted in some impressive results for large language models such as GPT-3.

One way to overcome these data constraints would be to reassess what’s defined as “low” and “high” quality, according to Swabha Swayamdipta, a University of Southern California machine learning professor who specializes in data-set quality. If data shortages push AI researchers to incorporate more diverse data sets into the training process, it would be a “net positive” for language models, Swayamdipta says.

Researchers may also find ways to extend the life of data used for training language models. Currently, these models are trained on the same data just once, owing to performance and cost constraints. But it may be possible to train a model several times using the same data, says Swayamdipta.

Some researchers believe big may not equal better when it comes to language models anyway. Percy Liang, a computer science professor at Stanford University, says there’s evidence that making models more efficient may improve their ability, not just increase their size. “We’ve seen how smaller models that are trained on higher-quality data can outperform larger models trained on lower-quality data,” he explains.

Deep Dive

Artificial intelligence

Google DeepMind used a large language model to solve an unsolved math problem

They had to throw away most of what it produced but there was gold among the garbage.

Will Douglas Heavenarchive page

AI for everything: 10 Breakthrough Technologies 2024

Generative AI tools like ChatGPT reached mass adoption in record time, and reset the course of an entire industry.

Will Douglas Heavenarchive page

Google DeepMind’s new Gemini model looks amazing—but could signal peak AI hype

It outmatches GPT-4 in almost all ways—but only by a little. Was the buzz worth it?

What’s next for AI in 2024

Our writers look at the four hot trends to watch out for this year

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

We could run out of data to train AI language programs

Deep Dive

Artificial intelligence

Google DeepMind used a large language model to solve an unsolved math problem

AI for everything: 10 Breakthrough Technologies 2024

Google DeepMind’s new Gemini model looks amazing—but could signal peak AI hype

What’s next for AI in 2024

Stay connected

Get the latest updates from
MIT Technology Review

The latest iteration of a legacy

Advertise with MIT Technology Review

About

Help

Deep Dive

Artificial intelligence

Google DeepMind used a large language model to solve an unsolved math problem

AI for everything: 10 Breakthrough Technologies 2024

Google DeepMind’s new Gemini model looks amazing—but could signal peak AI hype

What’s next for AI in 2024

Stay connected

Get the latest updates fromMIT Technology Review

Get the latest updates from
MIT Technology Review