A small failure case — ChatGPT cannot reliably read or generate the time on an analogue clock — opens onto Ned Block’s larger argument that current LLMs operate without the perceptual layer that grounds human spatial reasoning.
Overview
In a clip from Robinson Erhardt’s Podcast #239 (September 2025), Ned Block — Silver Professor of Philosophy and Psychology at NYU — uses an unusually mundane example to make a load-bearing claim about contemporary AI. At the time of the interview, asking ChatGPT to render an analogue watch face showing an arbitrary time would reliably return a watch with hands at ten-past-ten — the canonical position from watch advertising, where the V-shape frames the brand logo. The model had memorised the statistical regularity rather than reasoning about hour and minute hand positions.
By mid-2026 this specific failure has been patched. ChatGPT and Grok can now render arbitrary times on analogue clocks with reasonable accuracy, very likely because the labs trained on synthetic clock-face data once the limitation became a public talking point. Eight months from clean failure to working competence is itself a marker — the surface of what these systems can do is moving faster than the conceptual frameworks built to describe them.
The patch is real. What it does not do is vindicate the architecture — and that is exactly Block’s point. The watch-face was never the substantive claim. The point was why the failure existed, and what that diagnosis implied about the architecture. The diagnosis still stands; the demo is just worn out.
The seeing–thinking border
Block’s recent work, including his 2023 book The Border Between Seeing and Thinking (Oxford University Press), distinguishes two modes of cognition that humans run continuously and in parallel. Perception — Block’s “seeing” — is fast, format-specific, iconic, and operates over structured representations of space. Cognition — “thinking” — is slow, abstract, conceptual, and operates over propositional representations.
For an embodied human, reading a clock is a perceptual act before it is a conceptual one. We see two angles in space and infer the time from them. The conceptual machinery (hour conventions, AM/PM, time zones) sits on top of a perceptual base that already represents the angles correctly.
A current LLM has no perceptual base. Its “view” of a watch is whatever caption-text and pixel-statistics survived in its training distribution. The most stable signal in that distribution is the marketing convention, not the physical geometry. The model isn’t failing at clock-reading; it is succeeding at the task it was actually trained on, which is predicting watch imagery in aggregate.
Why this is a load-bearing example
What made the watch-face case useful for Block was not that it was unfixable but that it was diagnosable. Once you understand why the model defaults to ten-past-ten, you understand a class of failure:
- The error was not stochastic. It was not a one-off hallucination — the model failed the same way every time because the failure mode was structural to how it represented the input.
- The error was not solvable by chain-of-thought. Asking the model to reason step-by-step about the watch did not help, because there was no perceptual representation for the reasoning to operate over.
- The error generalises. The fix that closed the watch-face case (synthetic training data on clock geometry) is task-specific. The next demand on iconic representation — mechanical assembly diagrams, anatomical illustrations, novel object configurations, multi-step spatial transformations — re-opens the same gap until that one too is patched in turn.
The watch-face is now a case study in the patching dynamic Block predicted. Specific failures get closed by ingesting more of the right data, which changes the surface behaviour without addressing the underlying representational asymmetry. The architecture continues to translate everything into a linguistic-statistical space; it just gets better at faking the cases people have already noticed.
This is a sharper version of an argument Yann LeCun has made about LLMs lacking world models, but Block’s framing is more philosophically precise: the missing layer is not a “world model” in some general sense, it is the perception–cognition interface that Kant, Marr, and the broader cognitive science tradition have spent decades articulating.
What this implies for AI capability discourse
For practitioners working with LLMs, Block’s example is useful as a diagnostic. If a task requires translating between iconic and propositional representations — between “what the thing looks like” and “what is true about it” — current models will fail in characteristic ways that are not benchmark-fixable. The watch face is a clean test case; mechanical assembly diagrams, anatomical illustrations, and floor plans are messier ones.
This connects to broader 2025 capability discourse about what LLMs can and cannot do. The convergence-and-cost-collapse story is real, but it sits inside a competence envelope whose boundary is partly defined by exactly the seeing–thinking border Block describes. The question was never whether the watch-face case would be patched (it has been) but whether the architecture itself can ground spatial reasoning in something other than statistical regularities over images. That is unchanged.
Block himself is not an AI doomer or an AI booster. He is a philosopher of mind treating LLMs as an object of study, not a culture-war flashpoint. That posture is rare in the current discourse and is part of what makes the interview worth the full 90 minutes.
Practical applications
For people designing LLM-assisted systems, the watch-face heuristic suggests:
- Treat spatial outputs as suspect. When an LLM produces a diagram, layout, or visual description, verify against ground truth before trusting it. The error mode is silent and confident.
- Prefer symbolic intermediates. When the task is spatial, route through a representation the model can manipulate symbolically (coordinates, SVG primitives, structured layout descriptions) rather than asking for the spatial output directly.
- Don’t conflate fluency with competence. A model that describes a watch face fluently in English is not therefore representing it correctly internally. The English is the strong signal; the spatial representation is the weak one.
Limitations and open questions
Block’s argument is about LLMs as currently architected. It does not foreclose:
- Multimodal models with native vision. GPT-4V, Claude with vision, and Gemini all have perceptual front-ends, but Block’s claim is that these are still bolted-on rather than constitutive — they translate images into the same linguistic representation rather than maintaining a parallel iconic one. Whether this is a difference in degree or in kind is open.
- Embodied agents. Robotics-adjacent work that grounds models in continuous sensorimotor data may approach the seeing–thinking border from the other side. None of this is settled.
- Hybrid neurosymbolic systems. Neurosymbolic approaches explicitly try to bridge perceptual and propositional layers. Whether they can do so without inheriting the brittleness of classical symbolic AI is the open empirical question.
Full episode
Sources
- Robinson Erhardt’s Podcast #239, Ned Block: Consciousness, Artificial Intelligence, and the Philosophy of Mind (Sep 2025) — https://www.youtube.com/watch?v=wM1fcZr0iSk
- Block, N. (2023). The Border Between Seeing and Thinking. Oxford University Press. https://a.co/d/fqVb7gj
- Ned Block’s website: https://www.nedblock.us