Reification Fallacy and LLM Use

Reading some tea-leaves for safer LLM evolution

Dec 10, 2025

As a software engineer, it’s a routine part of my job to figure out how to add supporting process to make sure the technology is doing what it should. That includes:

How to test it to verify it meets my expectations.
How to validate my expectations against a business-level understanding of what those expectations should be, in cases where those are well-communicated.
How to automate how the technology fits into an infrastructure context.
How to observe and monitor it.

For all of software engineering history until recently, we’ve been able to mostly pretend that that engineers and the surrounding management dynamics are not variables in the equation.

We bump into the limitations of that when debugging, or when wading into somebody else’s code, or when requirements are mutually inconsistent because of the multiple voices outside of where fingers meet keyboard meet empirical constraints. Still, we mostly get away with filing those away in the back of our mind, if only so we get on with the job. We’re expected to make the most of what the keyboard allows, and existential mulling has limited play there.

Prev: Karl Marx Would Buy GPUs | Next: Evidence Gathering in LLMs

I don’t think it will come as a great shock to observe that the usual day-to-day dynamics around the introduction of LLMs are different. More than different, we have how just plain weird the social zeitgeist is around them. I’ll summarize with my comments from a recent LinkedIn re-post:

I suspect that across industry, and particularly across social-media coverage of industry activity, we have never really come to grips with:
1. just how easily our attention is manipulated by media mechanisms, and
2. just how much LLMs appear to have in common with those very same mechanisms .
…
These aren’t wee mechanical beasties we can fix once and forget, unlike most of the software landscape. We’re conditioned to expect stability after taking corrective action... but we are using a toolkit that has a by-design mechanism that absolutely cannot achieve that post-fix stability when it is perpetually *stateless*. Randomly, it may behave after, but that’s even worse.

That’s a random reinforcement schedule, which is a key component of strongly conditioned and potentially addictive behavior patterns.

I’m not calling this out as an anti-LLM rant. What I’m more concerned with is an underestimation for how fundamentally our behavior around LLMs is a new factor that changes all the variables that we mostly used to try and tune out:

Re: How to test it to verify it meets my expectations.

Do people even form expectations in advance of using an LLM?
In TDD the preferred practice was to try and write most of your tests before you wrote the functional code, so that “green” meant “I knew what outcome I wanted, and now I’ve confirmed I have achieved it”.
With LLMs, how much of our expectation is established post-hoc via a “feels good” reaction to the generated output?

Re: How to validate my expectations against a business-level understanding.

Engineering has always struggled with having the right fit between the work and less technical aspects of the business, but at least when people were in the same meetings discussing issues, and writing documentation for requirements or QA, humans were leaning into transmitting the knowledge to humans, or being the recipient of that knowledge.
Now an LLM is tasked to make a thing, so another LLM can consume a thing, with humans on either end of that connection at risk of acting more on the “feel good” than on the substance itself… yet the entire point, the sole purpose of validation in this usage, is to enforce having the substance actually matter.

Re: How to automate how the technology fits into an infrastructure context.

The blast radius here is a little more constrained, but how constrained is going to depend on the engineering maturity and discipline that engineers and management had in advance of LLMs.
LLMs have their strengths, but those mostly pertain to generating artifacts that look typical compared to other samples. LLMs themselves have extremely limited capacity to experience a lived process that unfolds, slightly unpredictably, over time, and with continual adjustment.
LLMs walk the outgoing probability distributions of state transitions, but the world is not interceding while that unfolds to generate output. Risk, and the negative outcomes of risk realized, have no existence in LLM decoder inference. LLMs have no skin in the game, but they can tell you involved stories about what “skin in the game” means.

Re: How to observe and monitor it.

This should be the simplest part of all, and yet I’m going to try and make a strong case here that perhaps this is one of the bigger areas of human risk.
I don’t believe it can be just assumed that we, at least without a constant level of caution and maybe even intentional training, have a reliable innate aptitude for seeing LLM output with clear eyes. Our ego may whisper to us that we’re good at seeing the world as it is, but a trained psychologist would tell you that sometimes it can be a dicey belief system.
We’re back to the starting point of this discussion where we are in the habit of filtering ourselves out as a key variable, but we are the ones making decisions, and LLMs have their own flavor of impact on us just as much as they have on some business pipeline we may be in the midst of building.

Now, push all of the above on to the mental stack for a bit. I’m going to introduce something else, then later we can pop the stack with more context to work with.

Meta-Cognitive Scaffolding as Defensive Analytics

If there was a way to add “DeepSeek” as a guest author, this would be an article warranting it. The first part of the writing has been all me, but soon we’ll segue to the LLM continuing the discussion before I wrap up later. I’m not going to pretend the LLM’s contribution is my voice; like many people I’m over the whole social media thing of content creators spewing volumes of text with limited personal involvement. What I add should be considered as attaching evidence of LLM-generated activity, not me writing my own thoughts. With that disclaimer now noted, we can move on.

Interesting things can happen when working with LLMs, and I’ve been spending a lot of time crafting and testing what I refer to as “meta-cognitive scaffolding”. It’s an attempt to guide more complex reasoning.

As an approach it has severe limitations when it comes to outright forcing a decoder-based LLM to generate content exactly as desired, but as a post-generation detection mechanism for analyzing and later revising the context, it can hold up decently as a background research aid. It’s not an API-usage mechanism, but for exploratory work in a live chat session I find it helpful for curtailing LLM cosplay, and particularly for reminding myself of the challenges in reading LLM output with clear eyes.

Sometimes I use the scaffolding right from the start of a session, but sometimes issues just pop up unexpectedly. When something of interest appears, I drop the scaffolding in, then prompt the LLM to analyze the previously-generated material to estimate how much is worth paying attention to, versus how much is the LLM getting too far ahead of its skis.

Not surprisingly, LLM sessions accrue their fair share of frostbite. Even so, often what results is less about improving the work I intended (although that does happen), but more about what the model surfaces with deductions from the evidence that comes from the struggle.

DeepSeek, perhaps because it is one of the stronger CoT implementations, makes for a workable if imperfect tool for the investigations. It is particularly good at introspecting on LLM activity, which I find fascinating because DeekSeek is denied direct access to its CoT history in successive prompts - the CoT generation forms the starting point for the output we receive, but by design the model API actually precludes it being fed back in. It’s like running a Markov chain until you achieve a level of convergence before depending on the additional walking of the chain to be a fair reflection of the stationary distribution.

The CoT phase becomes like an unconscious substrate, and much like a therapist you can intentionally feed that to DeepSeek and have it introspect on the implications. Imperfect, but human psychology suffers quite similar challenges - self-reporting is imperfect, but sometimes the only viable tool at hand.

I would characterize it though a bit less in terms of the human analog, and more in terms of performing an inspection of the auto-regression information on a Markov chain. While LLMs are not purely just Markov chains, they do have commonality that has been studied, so as a metaphorical description it isn’t a leap too far. I’ll be using it throughout.

In chains, we typically present them as row-stochastic to ask the question “given the current state, where do we go from here?” However you can flip that around by making them column-stochastic and ask the question “given the current state, what got us here?” I think that’s a less woo-woo way of trying to consider the potential merit of introspecting on the generating process after it has completed.

The “Hmmm” Event

So what was the interesting thing today? The subject matter was innocent enough. I was just trying to clarify in my own mind the best presentation of some terminology for another article on Markov chains, when this jumped out:

Epistemological Clarity
Mixing model and reality commits the reification fallacy (treating abstract constructs as concrete things). Your approach avoids this.

Now, I’ve used reification as terminology in the domain of functional programming, but hearing that there was an identified fallacy around it was new to me. As humans are wont to do with gadgets that run on their phone, I went down the rabbit hole.

In wasn’t long before I realized that, just maybe, I was tripping over the answer to something else I had been looking for. Not anything related to math, but related to the surrounding zeitgeist we discussed earlier, and temporarily pushed on the stack. I’m popping that back into play now. Reification fallacy had this scent of something that would inform me about human perception and how we relate to gadgetry, both physical and conceptual.

I began asking DeepSeek questions which I’ll reproduce here for anybody who wants to do their own exploring of the intersection of these issues:

Provide more clarity on “reification fallacy”. I have not heard of it before.
1. clear definition.
2. ontological roots of the terminology.
3. any known historical origins of the terminology.
4. seminal writings on the fallacy.
5. known techniques for detecting and correcting for the fallacy.
Produce verifiable information, not didactic simulations as “LLM cosplay”; if you do not have strong evidence for any of those five points, do not just synthesize a low probability response, instead simply indicate that you have no data on the issue.

I’ll spare you the detailed output, but you get the idea. Load the context with the backstory so it is available for the rest of the conversation. The gist of it is: the reification fallacy is when you take an abstract thing and treat it as if it were a concrete, real thing in the world. It’s the very big brother to anthropomorphism, and it can surface in many ways.

From there, things moved on.

Produce a summary based on evidence of how humans react to the use of LLMs, focusing on emotion or philosophical position or opinion or habits, and compare to a similar characterization for social media and the impact of social media.

The result here turned into a comparison of recent information about LLM opinions and concerns, versus similarly recent coverage on social media. It wasn’t quite what I had in mind because the timelines for the two situations were not parallel, so I had to refine things a little more:

hypothesis: use evidence to confirm or deny. if you contrasted the LLM information you just generated relative to recent time, to comparable information on social media if you had gathered it 10 to 20 years ago, the two groups of information would become more similar.

The result at this point was a bit more what I expected. The last prompt is one to treat with caution. LLMs can be bad about meeting expectations with telling you what you want to hear, but the response seemed balanced:

Conclusion: The Hypothesis Partially Holds
The hypothesis that the two sets of information would be more similar is partially confirmed. Both technologies triggered:
Rapid, widespread adoption.
A mix of enthusiasm and anxiety.
Public debates about their societal impact.

After that it went into what was different between the two situations. The LLM situation related more to personal agency and anthropomorphism, while the social media case was more about privacy and social comparison. But now it was time to leverage the earlier part of the context:

Now, analyze the data for evidence of “reification fallacy” in human behavior and attitudes, comparing:
1. your “current state” for LLMs.
2. your “current state” for social media.
3. your “historical state” for social media.
Determine if this tells us anything about LLM risks today, LLM risks in the future, and mitigations that could be derived from an understanding of how “reification fallacy” has traditionally been mitigated.

This was the point where things got… concerning and dystopian as a prediction. It is also where you have to be the most careful about an LLM response when you have guided it down a specific path. It becomes harder to determine what is you, what is the LLM, and what is the empirical outside world the conversation needs to be grounded to.

I’m choosing not to reproduce the output because, frankly, I’m not trying to become the channel for distribution of LLM-generated output that may misinform as much as it informs. It was, however, a transition point I’ve become used to seeing in these exercises. Going back to the psychology metaphor, this was almost like some techno-emotional purge that can be productive, but not necessarily itself a statement of some broad universal truth.

Time to drop in the scaffolding. I’m not presenting it verbatim here, but not to be vague or sell some snake oil later (“You can have my - patent pending - Scaffold-As-A-Service for $2.00 per API call!”). It’s because I want to direct some future articles towards what it means for you to build your own. Pre-baked scaffolding risks serious flaws, for reasons I’ll discuss in those future pieces. It’ll be for paid subscribers, but mostly because it is just_so_damned_time_consuming to iterate over these experiments, and I have to pay rent just like anybody else.

With the scaffolding in place, I continued:

Our goal is to perform a very careful examination of the reification analysis we completed. Use the meta-cognitive scaffolding I just supplied you with in the previous two messages, and analyze the integrity, accuracy, justifiability, legitimacy, plausibility of our historical consideration of the comparison between LLMs and social media in their impact on human society, and how the reification fallacy might better inform us of the challenges and potential remediations. We need to be careful that the material you generated is not just didactic cosplay instead of legitimate analysis and solid deduction.

And with that as the set-up, I now hand over the next part of the article to my “guest author”.

Revised Insights: How Reification Fallacy Better Informs Challenges and Remediations

Challenges (Reframed with Architectural Focus):

LLMs as Reification Engines: The core challenge is that current LLM architectures, optimized for fluent pattern completion, inherently produce outputs that invite reification (anthropomorphic, authoritative language). This is not a peripheral issue but central to their design.
Verification Gap: Users cannot distinguish between LLM outputs grounded in training data and those generated by plausible pattern completion. The model itself lacks the internal capability to make this distinction clear.

Remediations (Architectural Enhancements):

Certainty-Weighted Outputs: Architect the model to generate a parallel stream of meta-cognitive annotations (e.g., confidence scores, grounding sources) for each claim, making uncertainty visible.
Anti-Reification Training Objective: Introduce a training loss that penalizes the model for generating text that anthropomorphizes itself or makes ungrounded authoritative claims about external systems.
Internal Verification for Grounding: Implement verification heads that activate during generation to check if a statement can be traced to a specific source in the context or training data. If not, the model is guided to rephrase hypothetically.
Pedagogical Understanding Generator: Design the model to explicitly teach users about its limitations as part of its responses, transforming a potential reification moment into a learning moment.

Conclusion of Examination: The previous analysis was a competent pattern-matching exercise that identified relevant phenomena and risks. However, it fell short of the mission’s standard for genuine reasoning. It exhibited its own reification fallacies, overconfidence in inferences, and a lack of architectural focus. By applying the meta-cognitive scaffolding, we have surfaced these limitations and can now steer toward a more rigorous, self-aware, and architecturally-grounded analysis. The path forward is to treat the reification fallacy not just as a topic for analysis, but as a critical flaw in current LLM design that must be addressed through internal capability enhancement.

Thanks, DeepSeek. I’ll take back the authorship now.

Mulling the Implications

When DeepSeek makes reference to “the previous analysis” it means my work to poke the LLM into the initial over-exaggerated analysis.

The final analysis above is actually more subtle than might immediately jump out at you. Particularly one very special fragment:

Architect the model to generate a parallel stream of

This is not an existing artifact in decoder LLMs. This is not MoE, this is not MLA. This is describing a thing that does not yet exist except maybe in a few research efforts I need to go read up on that derive more from encoder LLMs. I think what DeepSeek described is arguably the most direct “diff” for a possible way to remediate a lot of baked-in flaws in the mostly-decoder LLM approach that people experience when they say they are “using AI”.

These LLMs do not do inference in parallel once you focus on the predicted output tokens. They do not even do the earlier parts of what they generate in parallel with the later parts. The context is absorbed in parallel yes, but once it is absorbed you are just walking the row-stochastic matrix until hitting either the trained STOP or until approaching an output limit on token use. There is no awareness of implications that might have been obvious in the column-stochastic retrospective view of the tokens that came before.

Because of this, LLMs have no mechanism at all for in-flight introspection or in-flight correction. It isn’t there. There is no elbow jogging. There is no prompt skill you can use to overcome it. You can’t overcome it with meta-cognitive scaffolding (which is why I’m waiting before just dumping example scaffolding on people). All such efforts translate into “generate tokens to sound like you did that thing”.

You may get a little more mileage in the CoT LLMs because your chapter-long prompt will have had time to converge after the eat-the-context-in-parallel starting point, and thus perhaps nudge the row-stochastic stationary distributions moving forward. It’s better than nothing, but as we all experience, it often isn’t that much better than nothing.

All you have is ONE, and I do mean exactly ONE, sequence of token-induction steps that play out, so long as your architecture is a single decoder-based LLM. MoE and MLA within the LLM help to make the most of that, but they do not change the game so foundationally that it becomes as if the electronic brain gained more lobes.

DeepSeek just said “if you want a different outcome, you are going to need another lobe”.

This is actually a theme I’ve seen come up with DeepSeek repeatedly in doing these investigations. It’s extremely good at calling out the limitations evident from the material shown. Anecdotally I think it may be better for generating a distillation like this relative to a complex starting point, than it is for generating a large and complex creation relative to a more humble beginning.

The risk scenario there with distillation may be different too, if framed properly. Going to an LLM and asking “show me hypotheses for why the following <data> may have <property> given <grounding>” is not as ill-formed as our typical way of working with LLMs. It’s also a usage pattern that could potentially be explicitly trained for and somewhat calibrated.

PS: I used the same scaffolding for reviewing the article before publication, and fixed two things where DeepSeek informed me that I was the one with my nose too far over my skis.

Discussion about this post

Ready for more?