AI Model Decodes Genetic Mutations to Trace Evolutionary History

Evolution did its thing. Now we can ask when.
Kern describes how the AI tool enables researchers to trace when insecticide resistance genes emerged in malaria mosquitoes.

At the University of Oregon, computational biologists have taught a machine to read the oldest text humanity has never written — the genetic record of life's branching history. By adapting the architecture behind large language models to scan DNA mutation patterns, they have compressed what once took days of calculation into minutes, without sacrificing accuracy. It is a quiet but consequential crossing: the tools built to predict the next word in a sentence now help us understand when two lineages last shared a common ancestor, with immediate stakes for diseases like malaria that still claim hundreds of thousands of lives each year.

  • Classical methods for tracing evolutionary ancestry are mathematically sound but punishingly slow — a single mosquito chromosome can demand hours or days of computation, creating a bottleneck that limits the scale of genetic research.
  • Insecticide-resistant malaria mosquitoes are spreading globally, and scientists urgently need faster tools to pinpoint when and how resistance genes emerged in order to redesign control strategies.
  • University of Oregon researchers retrained GPT-2 — the architecture beneath ChatGPT — on simulations of genetic evolution across bacteria, rodents, mosquitoes, and primates, teaching it to read mutation patterns the way it once learned to read words.
  • The model matches the accuracy of gold-standard statistical methods while reducing processing time from hours to minutes and outperforming classical approaches on the incomplete datasets that real-world genetic research routinely produces.
  • The team is now moving to reconstruct full genealogical trees across multiple lineages simultaneously, positioning machine learning as a transformative force in a field long governed by classical mathematics.

At the University of Oregon, a team of computational biologists has borrowed an insight from the world of chatbots and applied it to something far older: the language of evolution. They built an AI model that reads genetic code the way ChatGPT reads English — scanning DNA mutation patterns to determine when different organisms last shared a common ancestor. Published in April in the Proceedings of the National Academy of Sciences, it is the first language model designed specifically for population genetics.

The logic is elegant. Mutations accumulate over time and leave a trail: stretches of DNA with many mutations point to distant common ancestors, while those with few suggest recent shared lineage. Traditional statistical methods translate these patterns into evolutionary history with mathematical rigor, but they are slow — processing a single mosquito chromosome can take hours or days. Lead author Kevin Korfmann recognized the bottleneck and, with computational biologist Andrew Kern, adapted GPT-2 by training it not on text but on simulations of genetic evolution across bacteria, rodents, mosquitoes, and primates.

The results surprised the researchers. The model matched state-of-the-art statistical methods in accuracy while completing the same analysis in minutes. It also handled incomplete or fragmented genetic datasets better than classical approaches — a practical advantage in fields where perfect data is rarely available.

The urgency is real. Insecticide resistance has spread through malaria-carrying mosquito populations worldwide, and understanding precisely when resistance genes emerged could reshape how the disease is fought. The AI model makes that kind of historical reconstruction fast enough to be practical at scale.

Kern and Korfmann plan to extend the tool from tracing pairs of lineages to reconstructing full genealogical trees across multiple species simultaneously — pursuing from a machine-learning angle what classical methods have only partially achieved. The work hints at something larger: that the architectures powering AI may be on the verge of reshaping fields that have long relied on classical mathematics alone.

At the University of Oregon, a team of computational biologists has borrowed a trick from the world of chatbots and applied it to something far older: the language of evolution itself. They've built an artificial intelligence model that reads genetic code the way ChatGPT reads English, scanning for patterns in DNA mutations to trace when different organisms last shared a common ancestor. The work, published in April in the Proceedings of the National Academy of Sciences, represents the first language model designed specifically for population genetics—a field that has relied on classical statistical methods for decades.

The insight is elegant. Genomes, like written text, follow patterns. DNA's four-letter alphabet—A, T, C, and G—combines to form genes and chromosomes. But what interests Andrew Kern, the computational biologist leading the research, are the misspellings: mutations, or changes in DNA sequences, that accumulate over evolutionary time. These mutations leave a trail. Stretches of DNA with many mutations likely trace back to a distant common ancestor; those with few mutations probably share a more recent one. This is why chimpanzees are our closest living relatives, with similar DNA, while sea sponges diverged from us more than 700 million years ago.

Traditional methods for translating mutations into evolutionary history are mathematically rigorous and remain the gold standard in most cases. But they have a weakness: they're slow, especially when dealing with large or incomplete genomic datasets. Kevin Korfmann, the lead author of the study and a former postdoctoral researcher at Oregon, recognized the bottleneck. Classical probabilistic approaches require reasoning about every mutation individually, a computationally expensive task that can take hours or even days to process a single mosquito chromosome. So the team modified GPT-2, the machine learning architecture underlying ChatGPT, and trained it not on English text but on simulations of genetic evolution across bacteria, rodents, mosquitoes, and primates. The model learned to recognize mutation patterns the way a language model learns to recognize word patterns.

When tested, the results surprised the researchers themselves. The AI tool performed as well as state-of-the-art statistical methods—a genuine achievement when borrowing techniques from an entirely different field. But the real advantage was speed. The new approach can decode a mosquito chromosome in minutes, not hours or days. This efficiency matters enormously for scientists handling vast amounts of genetic data. The model also handles incomplete datasets better than classical methods, a practical advantage for researchers like Kern who frequently work with mosquito genetic databases that have missing or fragmented code.

The immediate application is urgent. For decades, insecticides have been the primary tool for controlling malaria-spreading mosquitoes. But evolution, as Kern notes, "did its thing." Insecticide resistance is now widespread across mosquito populations worldwide, and understanding how and when these resistance genes emerged is critical for malaria control. With the AI model, researchers can now ask precisely when resistance genes arose in a population and trace their evolutionary history—information that could reshape strategies for combating one of the world's deadliest diseases.

Looking forward, Kern and Korfmann plan to expand the model beyond tracing ancestry between two lineages to reconstructing full genealogical trees across multiple species and populations simultaneously. Some traditional methods can already do this, but the researchers want to pursue it from a machine-learning angle, tapping into advances in the field that haven't yet been applied to biology. The work signals something broader: that the architectures powering artificial intelligence have uses far beyond chatbots, and that fields long reliant on classical mathematics may be on the edge of a significant shift.

Advances in generative AI and the architectures behind them are potentially useful to a number of fields outside a chatbot. We're borrowing strengths from the world of AI and applying them in this different context that's largely been untapped.
— Andrew Kern, computational biologist, University of Oregon
Insecticide resistance is being observed in all of these mosquito populations today. Now, we can go in with our AI model, ask how long ago these resistance genes arose in the population, and learn about the evolutionary history of this critical carrier of malaria.
— Andrew Kern
The Hearth Conversation Another angle on the story
Inventor

Why does speed matter so much here? Couldn't researchers just run the classical methods overnight and get the same answer?

Model

They could, but scale changes everything. When you're working with thousands of mosquito populations or trying to understand resistance patterns across a continent, overnight becomes impractical. Minutes instead of hours means you can iterate, test hypotheses, explore variations. It's the difference between a tool you use once and a tool you use constantly.

Inventor

So you're saying this is really about enabling a different kind of science, not just making the old science faster.

Model

Exactly. The speed unlocks new questions. Kern can now ask "when did this resistance gene emerge" and get an answer in time to act on it. With the old methods, you'd be asking that question weeks ago.

Inventor

The model was trained on simulations, not real genetic data. How does that work? Aren't simulations always wrong?

Model

They're simplified, yes, but they capture the essential logic of how mutations accumulate over time. The model learns the patterns, not the specifics. It's like training a language model on synthetic text—the patterns of grammar and meaning transfer to real text.

Inventor

What happens when the model encounters something evolution actually did that the simulations didn't predict?

Model

That's the honest answer: we don't fully know yet. The tests show it performs as well as classical methods, but those methods have been refined over decades. This is new. The researchers are careful about that uncertainty.

Inventor

And the incomplete data problem—that seems like it could be huge for real-world research.

Model

It is. Most genetic databases in the field are messy. Missing sequences, degraded samples, incomplete records. Classical methods struggle with that. This model, because it learned from simulations that included variation, handles gaps more gracefully. That's not a small thing.

Contact Us FAQ