AI  

Are Large Language Models Truly Intelligent? A Structural Critique of Next-Token Prediction Architecture

Abstract

Despite the increasing capabilities of large language models (LLMs) such as OpenAI's GPT family, their underlying architecture is fundamentally rooted in next-token prediction. This paper critically examines the notion that LLMs possess “intelligence” by exploring theoretical, empirical, and architectural limitations. We argue that the autoregressive design of these models—while effective at mimicking intelligent behavior—lacks semantic understanding, goal-directed reasoning, and cognitive grounding. Through a synthesis of recent research, we show that the behaviors observed in LLMs are emergent properties of statistical pattern recognition, not indicative of true intelligence. We conclude that LLMs simulate aspects of intelligent behavior without exhibiting intelligence itself, and we outline future directions for building systems that go beyond next-token prediction.

1. Introduction

Large Language Models (LLMs) like GPT-4, Claude, and Gemini have achieved remarkable performance in natural language processing tasks, from question answering to code generation. Their fluency and apparent reasoning have prompted public and academic discourse around their intelligence. However, these models are designed to optimize a single objective: the prediction of the next token in a sequence. This architectural constraint, we argue, is a fundamental limitation that prevents LLMs from achieving true intelligence as defined in cognitive science and philosophy of mind.

This paper aims to provide a structured critique of LLMs as intelligent agents by analyzing their operational foundations and comparing them with theoretical models of cognition.

2. Methodology

Our analysis proceeds along three axes:

  1. Architectural Analysis: We examine the Transformer model and its token-level objective function.
  2. Theoretical Limitations: We analyze reasoning, planning, and generalization deficiencies in LLMs.
  3. Empirical Case Studies: We include examples from recent benchmarks and experiments demonstrating failure modes inconsistent with intelligent behavior.

All literature was reviewed from peer-reviewed journals, arXiv preprints, and recent whitepapers by leading AI research labs.

3. The Architecture of LLMs: An Overview

3.1 Autoregressive Next-Token Modeling

LLMs such as GPT-3, GPT-4, GPT-4o or even o3 and latest models use the Transformer architecture to compute the probability distribution of the next token tit_iti​ given prior context t1,...,ti−1t_1, ..., t_{i-1}t1​,...,ti−1​. The objective function during training is to minimize cross-entropy loss across massive text corpora.

Formula

This setup induces statistical fluency but not semantic grounding. The model has no mechanisms for verifying truth, understanding consequences, or maintaining long-term objectives.

3.2 Token Democracy and Shallow Semantics

Each token in the sequence is treated uniformly, preventing hierarchical structuring of knowledge. As such, planning, recursion, or causal modeling is not inherent to the architecture.

4. Theoretical Constraints on Intelligence in LLMs

4.1 Lack of Grounded Semantics

LLMs lack referential grounding. They do not map words to real-world experiences or objects but to token co-occurrence statistics. Philosophers like Harnad (1990) argue that grounding is essential for meaning; by this metric, LLMs are ungrounded symbol manipulators.

Symbol grounding problem

4.2 Absence of Agency or Intentionality

True intelligence entails having beliefs, goals, and the capacity to act in accordance with them (Dennett, 1987). LLMs have no internal states that correspond to desires or beliefs, only attention-weighted embeddings that change per context window.

4.3 Reasoning and Planning Failures

Studies (Bachmann & Nagarajan, 2024) show LLMs struggle with tasks like the Tower of Hanoi or multi-step arithmetic without chain-of-thought prompting. These are brittle workarounds, not signs of intrinsic reasoning capabilities.

5. Empirical Limitations

5.1 Hallucinations

Hallucinations—producing fluent but false statements—are a core issue. Xu et al. (2024) proved that hallucination is an inevitable consequence of finite next-token predictors. Even large-scale instruction tuning cannot fully eliminate this issue.

5.2 Lack of Transferability

LLMs trained on math or logic often fail to apply learned rules to slightly different problem formats. This brittleness contradicts the robust generalization expected of intelligent systems.

5.3 Symbolic Failure

Transformer models often fail to perform symbolic reasoning tasks unless directly trained on near-identical data distributions. Compositional generalization is poor (Lake & Baroni, 2018).

6. Emergent Behavior: Simulation, Not Cognition

While LLMs exhibit emergent abilities—such as passing bar exams or generating novel content—these are better understood as statistical interpolation across training data rather than cognitive synthesis. The appearance of intelligence does not imply its presence.

7. Toward True Intelligence: Beyond Next-Token Prediction

We argue that true artificial intelligence must include the following components:

  • Multi-token or goal-conditioned architectures
  • Causal reasoning modules
  • Grounded semantic representations (e.g., via embodied learning or sensory inputs)
  • Explicit symbolic manipulation or hybrid neuro-symbolic systems

Such architectures would mark a transition from language simulation to cognitive modeling.

Conclusion

LLMs based on next-token prediction are impressive tools for language generation, but they fall short of qualifying as truly intelligent agents. Their architecture precludes understanding, reasoning, and intentionality. Intelligence, in a scientific sense, requires more than statistical association; it requires grounding, goal-directed behavior, and causal inference. Without these, LLMs remain powerful simulators—stochastic parrots, not minds.

References

  • Bachmann, P., & Nagarajan, V. (2024). Pitfalls of Next-Token Prediction in Sequential Planning Tasks. arXiv preprint arXiv:2403.06963.
  • Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. FAccT.
  • Harnad, S. (1990). The Symbol Grounding Problem. Physica D.
  • Xu, Y., et al. (2024). Hallucination is Inevitable: A Formal Limit of Language Modeling. arXiv:2401.01039.
  • Lake, B. M., & Baroni, M. (2018). Generalization Without Systematicity: Insights from Deep Learning. Trends in Cognitive Sciences.
  • Dennett, D. (1987). The Intentional Stance. MIT Press.
  • Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.