LLM Performance Collapses Under Conflicting Instructions, Study Warns

The effective control window is four times shorter than total capacity
Models degrade under conflicting instructions far earlier than their nominal context limits suggest.

Una investigación publicada en PNAS Nexus este junio revela que los modelos de lenguaje más avanzados —GPT-4o, Claude 3.5 Sonnet y sus sucesores— colapsan en su capacidad de resolver instrucciones contradictorias con apenas 40 palabras de contexto, cayendo por debajo del 50% de precisión. No se trata de un fallo de entrenamiento ni de escala, sino de una limitación inscrita en la arquitectura misma de los transformers: carecen del equivalente funcional de las redes frontales del cerebro humano, aquellas que permiten mantener prioridades bajo interferencia. En un momento en que startups de todo el mundo construyen sistemas críticos sobre estos modelos, este hallazgo convierte una suposición técnica en un riesgo operativo real.

  • Los modelos más capaces del mercado fallan en una tarea que cualquier humano atento resuelve sin esfuerzo: seguir instrucciones cuando estas compiten entre sí.
  • La ventana efectiva de control ejecutivo es cuatro veces más corta que la capacidad nominal del modelo, lo que significa que el colapso llega mucho antes de lo que cualquier especificación técnica advertiría.
  • Startups que dependen de LLMs para clasificación legal, revisión de contratos o cumplimiento normativo están expuestas a degradaciones silenciosas que no aparecen en los benchmarks estándar.
  • Ni escalar a modelos más grandes ni ampliar la ventana de contexto resuelve el problema, porque la limitación es estructural, no de capacidad.
  • La comunidad investigadora ya trabaja en soluciones como Elastic Attention, pero ninguna ha llegado aún a los modelos en producción, dejando a los fundadores solos frente al problema hoy.

Un estudio publicado en PNAS Nexus este junio expuso algo incómodo sobre los grandes modelos de lenguaje: fallan de forma catastrófica cuando deben resolver instrucciones contradictorias en textos de apenas 40 palabras. GPT-4o y Claude 3.5 Sonnet caen por debajo del 50% de precisión en estas condiciones. Incluso GPT-5 y Claude Opus 4.1 muestran mejoras apenas marginales. Los investigadores adaptaron el clásico test de Stroop de la psicología cognitiva —que mide la capacidad de seguir instrucciones cuando estas entran en conflicto con impulsos automáticos— y los resultados fueron contundentes: donde los humanos mantienen un rendimiento sólido, los modelos colapsan.

El hallazgo más inquietante tiene que ver con lo que los investigadores denominan la «ventana efectiva de control ejecutivo». Un modelo entrenado para procesar 128.000 tokens puede hacerlo técnicamente, pero su capacidad de mantener instrucciones prioritarias bajo interferencia se desmorona mucho antes: la degradación comienza con apenas 10 palabras de instrucciones conflictivas. La ventana de control real es aproximadamente cuatro veces más corta que la capacidad total del modelo. El problema no es perceptivo ni de memoria —el estudio lo descartó explícitamente—, sino arquitectónico: los transformers carecen de mecanismos que ajusten dinámicamente la atención ante demandas crecientes, lo que la neurociencia cognitiva llama regulación adaptativa ascendente.

Para los fundadores que construyen sobre modelos de lenguaje, esto no es una limitación abstracta. Sistemas de clasificación legal, revisión de contratos, extracción de datos financieros o gestión de soporte con reglas superpuestas están expuestos a este riesgo hoy. Las estrategias de mitigación son concretas: implementar validación externa con reglas codificadas de forma determinista, fragmentar tareas complejas en pasos atómicos con contexto mínimo, construir benchmarks internos que midan la degradación bajo conflicto, y considerar arquitecturas híbridas que combinen modelos de lenguaje con sistemas basados en reglas. La investigación sobre Elastic Attention apunta hacia soluciones futuras, pero aún no están en producción. Quienes reconozcan estos límites y diseñen sus sistemas en consecuencia tendrán una ventaja real cuando estas limitaciones se vuelvan imposibles de ignorar.

A study published in PNAS Nexus this June exposed something uncomfortable about the large language models that have become central to startup operations: they fail catastrophically at a task that any attentive human handles without thinking. When asked to resolve conflicting instructions embedded in just 40 words of text, models like GPT-4o and Claude 3.5 Sonnet dropped below 50% accuracy. Even the newest versions—GPT-5 and Claude Opus 4.1—showed only marginal improvement. The researchers adapted a classic cognitive psychology test called the Stroop task, which measures how well a mind can follow instructions when those instructions conflict with automatic impulses. The results were stark: where humans maintained strong performance and the models excelled at simple reading, they degraded toward what the authors called "near-total performance collapse" as the conflicting instructions grew longer.

For founders building systems that depend on language models, this is not an abstract limitation. If your startup classifies legal documents, reviews contracts, extracts financial data, or manages support systems with multiple overlapping rules, this finding is an operational risk that demands immediate attention. The problem is not that the models are stupid or that more training will fix it. The problem is architectural—baked into the fundamental design of transformer attention mechanisms.

The most unsettling discovery concerns what researchers call the "effective window for executive control." A model trained to handle 128,000 tokens of context can technically process information at that scale. But its ability to maintain priority instructions when faced with interference collapses far earlier—degradation begins at just 10 words of conflicting information. The effective control window is roughly four times shorter than the model's total processing capacity. This is not a perceptual problem or a memory problem. The study explicitly ruled out difficulty in perceiving the text. The limitation lies in the absence of mechanisms that dynamically adjust attention and control as demands increase—what cognitive neuroscience calls adaptive up-regulation. Transformers simply lack the equivalent of the brain's frontal networks, which humans use to flexibly reconfigure how they process competing signals.

The implications ripple outward. In high-stakes applications—contract review, medical diagnosis support, regulatory compliance—you cannot assume consistency under long context. The model's performance will degrade before you expect it to, even when its nominal capacity seems sufficient. Scaling to a larger model or expanding the context window does not solve the underlying problem if the architecture itself lacks adaptive executive control mechanisms.

What can be done? The researchers and the startup community they address offer concrete mitigation strategies. First: implement external validation. Before delivering a model's output to a user, run it through explicit guardrails that verify the result against hard-coded rules. For critical tasks, use double verification—a second model or a deterministic rule checking the first model's work. Second: fragment complex tasks into atomic steps. Instead of one prompt containing ten potentially conflicting instructions, break the process into sequential steps, each validated individually, each with minimal context to reduce interference. Third: test deliberately for conflict and interference. Build an internal benchmark with cases where instructions compete, measure not just overall accuracy but degradation as context grows longer, and identify the specific breaking point for your use case. Fourth: consider hybrid architectures that combine language models with rule-based systems, letting the model handle semantic understanding while deterministic logic handles the application of conflicting rules.

The research community is already working on solutions. Recent work on Elastic Attention, published on arXiv in January 2026, proposes mechanisms to dynamically adjust attention sparsity based on input, precisely to address adaptive attention allocation. But these solutions are not yet in production models. For now, the founders who acknowledge these architectural limits and engineer systems to work around them will have a competitive advantage when these limitations become impossible to ignore in production.

The effective window for executive control is roughly four times shorter than the model's total processing capacity
— PNAS Nexus study findings
This is not a bug that fine-tuning will fix—it is structural
— Study interpretation for production systems
A Conversa do Hearth Outra perspectiva sobre a história
Inventor

So the study is saying that even the newest models just... give up when instructions conflict?

Model

Not give up exactly. They degrade. It's like their attention mechanism can't hold multiple competing priorities at once. A human reads "ignore the word red, tell me the color" and sees the word red printed in blue ink—easy. The model sees the same setup in 40 words and starts guessing.

Inventor

But these models handle 128,000 tokens. Why does 40 words break them?

Model

That's the crucial part. Token capacity and executive control are different things. You can hold a lot of information in working memory, but that doesn't mean you can manage conflicting demands on that information. The model's attention can process the tokens, but it can't dynamically adjust which instructions take priority when they clash.

Inventor

Is this fixable by just training harder?

Model

The researchers don't think so. It's not a capacity problem—it's architectural. The transformer's attention mechanism doesn't have the equivalent of what a human brain does with its frontal networks. You'd need to redesign how attention itself works, not just feed it more data.

Inventor

What does a startup do right now if they're using these models for something critical?

Model

Validate externally. Don't trust the model alone. Wrap it in guardrails, fragment complex tasks into smaller steps, test specifically for where it breaks under conflict. And seriously consider hybrid systems—use the model for what it's good at, use rules for what requires executive control.

Inventor

So the model is still useful, just not in the way we thought?

Model

Exactly. It's useful for semantic understanding, for generation, for pattern matching. But for tasks where you need to hold competing instructions and apply them consistently—that's where you need human judgment or deterministic logic backing it up.

Fale Conosco FAQ