We can't fully close the gap with new methods alone
For generations, the genetic data shaping disease prediction has been drawn almost exclusively from people of European descent, quietly encoding inequality into the very tools meant to protect human health. Researchers at Johns Hopkins and the National Cancer Institute have now introduced CT-SLEB, an algorithm that blends machine learning with Bayesian modeling to improve genetic risk scoring across African, Latino, East Asian, and South Asian populations. Published in Nature Genetics in September 2023, the work is both a meaningful advance and a candid admission: better methods can narrow the gap, but they cannot close it without a deeper commitment to gathering diverse genetic data in the first place.
- Genetic risk algorithms trained almost entirely on European ancestry data have long failed non-European populations, quietly widening healthcare disparities in disease prevention.
- The stakes are not abstract — millions of people may be missing early warnings for conditions like heart disease and depression simply because the science was not built with them in mind.
- CT-SLEB, tested across five million individuals and five ancestry groups, outperformed existing models — especially for African ancestry populations where accuracy had historically been worst.
- The algorithm is faster and openly available on GitHub, lowering barriers for researchers worldwide to apply and build upon it.
- Senior author Nilanjan Chatterjee issued a clear-eyed warning: AI cannot perform miracles on flawed data, and closing the gap will require far larger genome-wide studies in non-European populations.
For decades, the genetic studies underpinning disease prediction have drawn overwhelmingly from people of European descent, creating a medical blind spot. Algorithms designed to identify who is most at risk for heart disease, cancer, or depression work reasonably well for Europeans but stumble when applied to African, Latino, East Asian, or South Asian populations. That gap is not a minor technical inconvenience — it is a source of healthcare inequality, potentially leaving millions without access to the preventive care they need.
A team at Johns Hopkins Bloomberg School of Public Health and the National Cancer Institute set out to narrow that gap with a new algorithm called CT-SLEB, published in Nature Genetics in September 2023. The method combines machine learning with Bayesian statistical modeling to retrain genetic risk scores across ancestry groups. Tested on data from more than five million individuals — drawn from 23andMe, the NIH's All of Us program, UK Biobank, and other sources — CT-SLEB outperformed existing models across 13 traits, with the most significant gains for African ancestry populations where accuracy had historically been lowest.
The algorithm also proved faster than competing approaches, making large-scale analysis more computationally feasible, and the team has made the code publicly available on GitHub. But senior author Nilanjan Chatterjee was careful not to oversell the achievement. Better methods can help, he acknowledged, but they cannot fully close the performance gap without larger, more diverse genome-wide studies. The root problem is structural: most foundational genetic research has been conducted in European populations simply because those populations were more accessible to researchers.
The algorithm, then, is a genuine step forward — and an honest map of how far the field still has to travel. Equitable genetic medicine will require not just smarter tools, but a sustained commitment to building the diverse datasets those tools need to learn from.
For decades, the genetic studies that underpin disease prediction have drawn overwhelmingly from people of European descent. The result is a medical blind spot: algorithms designed to flag who is at highest risk for heart disease, cancer, depression, and other conditions work reasonably well for Europeans but stumble when applied to African, Latino, East Asian, or South Asian populations. A team at Johns Hopkins Bloomberg School of Public Health and the National Cancer Institute has now developed a new method to narrow that gap, publishing their work in Nature Genetics in September 2023.
Genetic risk-scoring algorithms work by identifying DNA variants linked to disease and calculating an individual's cumulative risk based on how many of those variants they carry. The logic is sound: find the people most likely to get sick, and you can intervene early. But the algorithms are only as good as the data that built them. When most of that data comes from one ancestry group, the models learn patterns specific to that group and fail to generalize. The performance gap between European-ancestry and other-ancestry populations is not a minor technical problem—it is a source of healthcare inequality, potentially leaving millions of people without access to the preventive care they need.
The new algorithm, called CT-SLEB, combines machine learning with Bayesian statistical modeling to retrain genetic risk scores for different ancestry groups. The researchers tested it on data from more than five million individuals across five ancestry categories: European, African, Latino, East Asian, and South Asian. They drew from 23andMe, the Global Lipids Genetics Consortium, the National Institutes of Health's All of Us research program, and UK Biobank. The method was applied to 13 traits, including coronary artery disease and depression. When the team benchmarked their results against standard approaches, CT-SLEB outperformed existing models, particularly for African ancestry populations where accuracy had historically been lowest.
Nilanjan Chatterjee, the study's senior author and a Bloomberg Distinguished Professor at Johns Hopkins, emphasized that the algorithm's success came with an important caveat. "We showed that our method can help close the risk-scoring performance gap for non-European-ancestry populations," he said. "At the same time, we also concluded that we can't fully close the gap with new methods alone—we also need larger datasets on these populations." The problem, in other words, is not just computational. It is structural. Most genome-wide association studies—the large-scale genetic investigations that feed into risk algorithms—have been conducted in European populations simply because those populations were more accessible to researchers. Building equitable algorithms requires building equitable science.
The CT-SLEB method also proved faster than competing approaches, making it feasible to analyze larger numbers of DNA variants and more populations without prohibitive computational costs. The researchers have made the code publicly available on GitHub, and they are already working on more advanced versions that maintain speed while improving accuracy further. Chatterjee was direct about the limits of technological optimism: "A lot of people think machine-learning and AI can do magic but without large, well-designed studies, algorithms will not be as useful." The real work ahead is not in the lab but in the field—conducting more genetic research in non-European populations, building the datasets that will allow algorithms to learn fairly across all groups. The algorithm is a tool. The equity requires commitment.
Citas Notables
We can't fully close the gap with new methods alone—we also need larger datasets on these populations.— Nilanjan Chatterjee, Bloomberg Distinguished Professor, Johns Hopkins Bloomberg School of Public Health
A lot of people think machine-learning and AI can do magic but without large, well-designed studies, algorithms will not be as useful.— Nilanjan Chatterjee
La Conversación del Hearth Otra perspectiva de la historia
Why does it matter that genetic risk algorithms have been built mostly on European data?
Because when you train an algorithm on one group, it learns the genetic patterns specific to that group. Those patterns don't always translate. A variant that predicts heart disease in Europeans might not predict it the same way in Africans or Asians. So the algorithm fails for them—and they don't get identified as high-risk, so they don't get preventive care.
And this new method, CT-SLEB, fixes that?
It helps. It retrains the algorithm for each ancestry group using their own data. But here's the catch: it only works as well as the data you feed it. If you have five million Europeans in your dataset but only half a million Africans, the algorithm will still be better at Europeans.
So the algorithm is not the real problem.
Right. The algorithm is a symptom. The real problem is that we haven't done enough genetic research in non-European populations. That's a choice we made as a field, and it's a choice we have to unmake.
What does that look like in practice?
It means funding more genome-wide association studies in Africa, in Latin America, in Asia. It means recruiting more diverse participants. It means treating genetic diversity as a research priority, not an afterthought. Without that, no algorithm—no matter how clever—can close the gap.
And if we don't do that work?
Then we keep reproducing the same disparity. We keep having better disease prediction for some people than others. We keep letting preventable illness happen to people we could have helped.