Evolution did its thing. Now we can trace when it happened.
At the University of Oregon, researchers have taught a machine to read the oldest language on Earth — DNA — by borrowing the same architecture that gives modern AI its fluency with human words. The tool reconstructs evolutionary ancestry from mutation patterns in minutes rather than days, matching the precision of classical methods while tolerating the imperfections of real-world genetic data. In a moment when insecticide-resistant mosquitoes are outpacing our defenses against malaria, this acceleration of biological understanding arrives as something more than a technical curiosity.
- Malaria-carrying mosquitoes are evolving resistance to insecticides faster than researchers can trace the genetic history of that resistance using traditional methods.
- Classical statistical approaches — the gold standard for decoding evolutionary ancestry — can take hours or days per chromosome, creating a bottleneck in urgent public health research.
- A GPT-style AI retrained on thousands of simulated evolutionary scenarios now reads mutation patterns across a genome in minutes, matching classical accuracy without the computational wait.
- The model handles incomplete genetic datasets better than traditional tools, a critical advantage when working with real-world mosquito databases full of gaps and missing sequences.
- Researchers are already planning to scale the model toward reconstructing full multi-species genealogical trees, pushing machine learning into territory biology has not yet explored.
At the University of Oregon, computational biologist Andrew Kern and his team have retrained the machine-learning architecture behind ChatGPT to read something far older than human language: the four-letter alphabet of DNA. The result is the first language model built specifically for population genetics — one that scans a genome for mutation patterns and reconstructs when two genes last shared a common ancestor.
The logic is elegant. DNA, like text, follows patterns. Many accumulated mutations suggest a distant common ancestor; few suggest a recent split. The team trained their model not on prose but on thousands of simulated evolutionary scenarios spanning bacteria, rodents, mosquitoes, and primates, until the AI learned to recognize mutation signatures the way a fluent reader recognizes words.
Lead researcher Kevin Korfmann, whose study appeared in April in the Proceedings of the National Academy of Sciences, notes that classical methods remain the gold standard — but they're slow, sometimes taking days to analyze a single mosquito chromosome. The AI completes the same work in minutes, because it absorbed the statistical heavy lifting during training rather than performing it anew each time. In head-to-head tests, it matched state-of-the-art accuracy. It also proved more resilient with incomplete data, a persistent problem in real-world genetic databases.
The urgency is concrete. Insecticide resistance has spread through malaria-carrying mosquito populations worldwide, and tracing exactly when and how resistance genes emerged is essential to designing new interventions. This tool makes that tracing fast enough to be practically useful. The team's next step is expanding the model to reconstruct full genealogical trees across multiple species simultaneously — work that is, as Korfmann put it, unglamorous and potentially transformative.
At the University of Oregon, computational biologist Andrew Kern and his team have taken the same machine-learning architecture that powers ChatGPT and retrained it to read something far older than human language: the four-letter alphabet of DNA itself. The result is an artificial intelligence tool that can scan a genome for the telltale marks of mutation and, by tracing those patterns backward through time, reconstruct when two genes last shared a common ancestor. It's the first language model ever built specifically for population genetics, and it works.
The insight behind the approach is deceptively simple. DNA, like written text, follows patterns. Stretches of genetic code with many mutations accumulated over time likely trace back to a distant common ancestor, while those with few mutations suggest a more recent split. A chimpanzee and a human share similar DNA because they diverged relatively recently in evolutionary time. A sea sponge and a human, by contrast, parted ways more than 700 million years ago, leaving their genomes far more different. The researchers trained their model not on English text but on thousands of simulated evolutionary scenarios—playing out genetic change across bacteria, rodents, mosquitoes, and primates—until the AI learned to recognize these mutation signatures the way a language model learns to predict the next word in a sentence.
What makes this work matters in practice. Kevin Korfmann, the lead researcher on the study published in April in the Proceedings of the National Academy of Sciences, explains that traditional statistical methods for decoding evolutionary history are the gold standard—but they're slow. A classical approach can take hours or even days to analyze a single mosquito chromosome. The new AI model does the same work in minutes. That speed comes from a fundamental difference in how the two approaches operate. Classical methods must reason through every single mutation individually, a computationally expensive process. The AI model, by contrast, has already done all that statistical heavy lifting during training. It simply reads the patterns, the way a fluent reader doesn't sound out every letter.
The performance surprised even Kern's team. In head-to-head tests, the AI tool matched the accuracy of state-of-the-art statistical methods. "You never really know what's going to work when you're essentially borrowing techniques from a totally different world," Kern said. "But this was a case where things worked really well." The model also handles a problem that has long plagued real-world genetic research: incomplete data. Mosquito genetic databases, which Kern works with frequently, often contain gaps and missing sequences. Classical methods struggle with these holes. The simulation-based training of the AI approach allows it to work around them.
The timing matters urgently. For decades, insecticides have been the primary tool for controlling malaria-carrying mosquitoes. But evolution has done what evolution does. Insecticide resistance has now emerged in mosquito populations worldwide, and understanding how and when these resistance genes arose is critical to developing new control strategies. With this AI tool, Kern can now ask precisely when resistance genes emerged in a population and trace their evolutionary history—information that could reshape malaria prevention efforts. The researchers are already planning the next phase: expanding the model to reconstruct entire family trees across multiple species simultaneously, a capability that would push the boundaries of what machine learning can do in biology. "There's so much going on in the machine learning field that we haven't applied yet in our field," Korfmann said. The work ahead is translational, unglamorous, and potentially transformative.
Notable Quotes
Advances in generative AI and the architectures behind them are potentially useful to a number of fields outside a chatbot. We're borrowing strengths from the world of AI and applying them in this different context that's largely been untapped.— Andrew Kern, computational biologist at University of Oregon
Insecticide resistance is being observed in all of these mosquito populations today. Now, we can go in with our AI model, ask how long ago these resistance genes arose in the population, and learn about the evolutionary history of this critical carrier of malaria.— Andrew Kern
The Hearth Conversation Another angle on the story
So you've taken a chatbot architecture and pointed it at DNA. Why does that work at all? Aren't genes and language completely different things?
They're more similar than you'd think. Both are sequences with patterns. DNA has four letters—A, T, C, G—and language has 26. Both accumulate changes over time. A language model learns to predict the next word; our model learns to predict the next mutation pattern. The underlying math is the same.
But the source material is so different. Text is written by humans with intention. Mutations are random.
True, but randomness has structure. Mutations accumulate at predictable rates. The AI learns those rates by studying thousands of simulated evolutionary histories. It's not learning intention; it's learning the statistical signature of time itself.
And it's faster than the old methods. How much faster are we talking?
Hours or days down to minutes. For a single mosquito chromosome. That matters when you're trying to understand insecticide resistance in real time, when populations are evolving faster than we can track them.
What's the catch? There's always a catch.
The catch is that this is still new. Classical methods are the gold standard for a reason—they're mathematically proven. This AI approach matches them in accuracy, which is remarkable, but it's not yet clear how far it scales or what edge cases might break it. The researchers are already thinking about the next problem: reconstructing entire family trees, not just pairs of ancestors.
So this is just the beginning.
Exactly. This is one tool for one problem. But it shows that machine learning architectures built for one domain can be retrained for another. That opens a lot of doors in biology that have been closed.