57 percent of medical AI skills fell below acceptable standards
As artificial intelligence quietly assumes roles once held by human researchers—screening literature, running analyses, drafting manuscripts—a Singapore-based firm and a Chinese university hospital have asked a question the field had long deferred: who checks the tools before the tools check the science? In mid-2026, AIPOCH and Fudan University's Zhongshan Hospital released MedSkillAudit, a structured audit framework designed to intercept unreliable AI capabilities before they enter medical research workflows. Their first test of 75 skills found more than half unfit for deployment, suggesting that the infrastructure of trust in AI-assisted science has been built on largely unexamined ground.
- Medical research is quietly delegating critical tasks to AI agents whose individual skills have never been systematically verified for safety or accuracy.
- A validation study of 75 AI skills across five research domains found 57.3% failed to meet even a Limited Release standard—a figure that reframes confidence in current AI research tools as largely unearned.
- Failures ranged from fabricated citations and confused causality to broken code and diagnostic overreach, each a different way an unaudited tool could corrupt scientific output.
- MedSkillAudit's two-layer veto system—testing both operational stability and scientific integrity through static design review and live simulation—offers a structured checkpoint where none previously existed.
- The framework's judgments aligned with human expert reviewers and held consistent across repeated runs, lending it credibility as a potential industry standard rather than a one-time experiment.
- Whether MedSkillAudit becomes a required checkpoint or remains a cautionary prototype depends on whether institutions treat the 57% failure rate as an alarm or a footnote.
In early summer 2026, AIPOCH, a Singapore-based company working with pathology researchers at Fudan University's Zhongshan Hospital, released MedSkillAudit—a pre-deployment audit framework built to catch what no one had been systematically catching: AI skills that don't actually work.
Medical research has increasingly outsourced pieces of its process to AI agents capable of screening literature, running statistics, designing protocols, and drafting manuscripts. These agents are assembled from modular skills, each a specialized capability. The problem is that no reliable checkpoint existed to verify those skills before researchers began depending on them. An AI might invent plausible-sounding citations, confuse correlation with causation, generate code with missing dependencies, or make diagnostic claims beyond its scope—errors that traditional model-level evaluations were never designed to catch.
MedSkillAudit addresses this through two layers. The first is operational: stability, consistency, security. The second is scientific: citation integrity, sound reasoning, respect for the boundary between analysis and diagnosis, functional code. A failure on any critical requirement in either layer blocks deployment entirely. Beyond the veto gates, skills are scored through static evaluation of their design and source code, weighted at 40 percent, and dynamic evaluation through simulated research scenarios, weighted at 60 percent. Final scores place each skill into one of four tiers: Production Ready, Limited Release, Beta Only, or Rejected.
When the framework was tested on 75 medical AI skills across five research domains, the results were stark: 57.3 percent fell below the Limited Release threshold. The framework's assessments also aligned closely with human expert reviewers and remained consistent across repeated runs.
AIPOCH's CEO Huimei Wang described the release as a response to a structural gap—AI agents are now part of the scientific workflow, yet the skills they rely on have had no quality-control equivalent. The question now is whether MedSkillAudit becomes a standard, and whether the majority of skills that failed will be rebuilt or quietly set aside.
In early summer 2026, a Singapore-based company called AIPOCH, working alongside researchers at Zhongshan Hospital's pathology department at Fudan University, released something that had been missing from the medical AI landscape: a systematic way to catch broken tools before they reach scientists' hands.
The tool is called MedSkillAudit. It exists because medical research has begun outsourcing pieces of its work to AI agents—systems that can screen literature, run statistical tests, design protocols, draft manuscripts. These agents are built from modular skills, each one a specialized capability. The problem is that no one had built a reliable checkpoint to verify these skills actually work before researchers start relying on them. An AI might fabricate citations that sound plausible. It might confuse correlation with causation. It might generate code with missing dependencies. It might make diagnostic claims it has no business making. These errors slip through because traditional AI evaluation methods focus on the model itself, not on the specific skills it's been assembled to perform.
MedSkillAudit works in two layers. The first layer is operational: Does the skill run stably? Does it produce consistent results? Is it secure? The second layer is scientific: Does it invent citations or DOI numbers? Does it respect the boundary between analysis and diagnosis? Does it reason soundly? Does the code it generates actually work? If a skill fails any critical requirement in either layer, it gets blocked. No deployment.
Beyond the veto gates, the framework uses a two-stage evaluation. Static evaluation looks at the skill's design and source code—how it was built—and counts for 40 percent of the score. Dynamic evaluation actually runs the skill through simulated research scenarios and watches what happens—this counts for 60 percent. The final score places each skill into one of four categories: Production Ready, Limited Release, Beta Only, or Rejected.
When AIPOCH and Fudan tested this framework on 75 medical AI skills across five research domains—evidence analysis, protocol design, data analysis, academic writing, and others—the results were sobering. Nearly 43 out of 75 skills, or 57.3 percent, fell below the Limited Release threshold. They were not ready. The validation also showed that MedSkillAudit's judgments matched what human expert reviewers thought, and the framework produced consistent results when run multiple times.
Huimei Wang, AIPOCH's CEO, framed the release as a response to a gap in how science is being conducted now. "AI agents are becoming part of the scientific workflow," Wang said, "yet there is still no equivalent of a quality-control checkpoint for the skills they rely on." The implication is clear: as these systems embed themselves deeper into research, domain-specific auditing frameworks like this one will become not optional but necessary—a complement to the broader evaluation methods already in place for AI models themselves.
The research was published as a preprint in April 2026. What happens next is whether other institutions adopt this framework, whether it becomes a standard, and whether the 57 percent of skills that failed will be fixed or abandoned.
Citações Notáveis
AI agents are becoming part of the scientific workflow, yet there is still no equivalent of a quality-control checkpoint for the skills they rely on.— Huimei Wang, CEO at AIPOCH
A Conversa do Hearth Outra perspectiva sobre a história
Why does this matter now? AI has been used in research for years.
Because the scale and autonomy have changed. These aren't tools researchers point at a problem anymore—they're agents that make decisions in the workflow. A fabricated citation buried in a literature review could shape which direction a whole research program goes.
So MedSkillAudit is basically a quality-control line for AI components.
Exactly. Like how a pharmaceutical company tests each ingredient before it goes into a drug. Except here, 57 percent of the ingredients they tested were failing.
That's a high failure rate. Does that mean the AI is broken, or the audit is too strict?
The audit is domain-specific—it's asking whether these skills can do medical research safely and accurately. A skill that invents p-values isn't broken in a technical sense. It runs fine. But it's broken for this purpose.
And the two-layer veto—operational first, then scientific—that seems designed to catch different kinds of problems.
Right. The first layer catches things like instability or security flaws. The second catches the subtle stuff: logical errors, boundary violations, things that look right but are scientifically unsound. You need both.
What happens to the 43 skills that failed?
That's the open question. Some might be fixed. Some might be abandoned. But now researchers know not to use them, which is the whole point.