LLM Hallucinations Are Compression Artifacts — And That Explains Everything

Imagine being handed 10 terabytes of text and being told to compress it into a 70-gigabyte file. Not just store it—but make it usable. At any moment, someone might ask a question, and you’d need to reconstruct a meaningful answer from that compressed version.

Not perfectly. Not bit-by-bit. But close enough to make sense.

At that point, one realization becomes unavoidable: this is lossy compression. Some information will be lost. It’s not a flaw—it’s a mathematical inevitability.

And that’s exactly what large language models (LLMs) are doing.

The concept of AI performing sophisticated compression, similar to how LLMs handle text, is also revolutionizing graphics; discover how Neural Texture Compression by NVIDIA: How AI Cuts VRAM Usage by 85% uses AI to drastically cut VRAM usage in games.

Prediction Is Compression (Not a Metaphor)

This idea might sound poetic, but it’s not. It’s grounded in information theory.

Back in 1948, Claude Shannon demonstrated something profound: predicting the next symbol in a sequence and compressing data are mathematically equivalent problems. If you can predict well, you can compress well. And if you can compress well, you inherently understand patterns in the data.

That means when a model like GPT predicts the next token, it’s not just generating text—it’s effectively decompressing a compressed representation of knowledge.

At its most fundamental level, this is what an LLM does:

# What does LLM do at the most fundamental level:
def predict_next_token(context: str) -> Distribution:
 """This is both a prediction and a decompression"""
 pass

# The better the prediction, the fewer bits are needed for encoding.
# The fewer bits, the better the compression.

The better the prediction, the fewer bits are required to encode information. And fewer bits mean better compression.

So here’s the shift in perspective:
The weights of a language model are not just parameters—they are a compressed version of its training data.

The JPEG Analogy for Language Models

If you’ve ever over-compressed a JPEG image, you’ve seen what happens.

Large, simple structures—like a face or a blue sky—remain recognizable. But small details? They vanish first. Text becomes unreadable. Edges get weird halos. Colors appear that were never there.

And yet, the image still looks plausible.

Now replace pixels with knowledge.

Large structures → common patterns, general knowledge
Fine details → rare facts, exact numbers, specific dates
Artifacts → hallucinations

A hallucination, then, isn’t random nonsense. It’s what happens when the model knows something should be there—a number, a citation, a fact—but the exact information wasn’t preserved during compression. So it reconstructs a plausible approximation.

Just like JPEG invents pixels, LLMs invent facts.

Why LLMs Excel at Code but Struggle with Math

This compression perspective suddenly clarifies something many people notice.

Why are LLMs so good at writing code?

Because code is highly compressible. It’s structured, repetitive, and follows strict rules. Patterns like for i in range(n) appear millions of times. That makes them easy to encode efficiently with minimal loss.

Math, however, is a different story.

Precise numbers don’t compress well. There’s no shortcut for something like 1847 × 9283. It’s not a pattern—it’s a specific computation. Either you compute it exactly, or you store it explicitly.

LLMs do neither. They approximate.

That’s why small calculations might work, but larger ones often fail in subtle ways—off by a digit, slightly incorrect, yet still believable.

Model Size, Temperature, and “Creativity”

If hallucinations are compression artifacts, then increasing model size is essentially increasing bitrate.

Think of it like moving from a low-quality JPEG to a high-quality one. More parameters mean more capacity to preserve detail. Fewer artifacts. Better reconstruction.

But never perfect—because compression still exists.

Then there’s temperature, often misunderstood as “creativity.”

In reality, it behaves more like a quality slider:

Low temperature → sharp, deterministic output (but rigid artifacts)
Medium → balanced sampling
High → noisy, diverse, but less accurate

What we call creativity is often just sampling from less probable reconstructions—not true invention, but variation within compressed knowledge.

RAG, Fine-Tuning, and Prompting Through the Lens of Compression

Once you adopt this framework, many AI techniques become surprisingly intuitive.

RAG (Retrieval-Augmented Generation)
Injects lossless data into the process. Instead of relying on compressed memory, the model gets access to the original information.
Fine-tuning
Reallocates compression priorities. It’s like saying: “Preserve legal language better, even if something else degrades.”
Prompt engineering
Guides the decompression process—telling the model where to “look” within its compressed representation.
RLHF
Adjusts perceived quality, similar to how audio codecs optimize for human perception.

Seen this way, we’re not “teaching intelligence.” We’re managing compression and reconstruction.

Can Hallucinations Ever Be Eliminated?

Here’s the uncomfortable truth.

If hallucinations are artifacts of lossy compression, then they cannot be completely eliminated.

You can:

Increase model size (more bits)
Add external memory (RAG)
Improve architecture (better codec)

But as long as you’re compressing massive datasets into finite models, information loss is unavoidable.

Anyone claiming otherwise is either oversimplifying—or ignoring information theory.

Humans Are Lossy Codecs Too

This is where things get interesting.

Try recalling what you had for lunch last Thursday. Or what was on slide 14 of yesterday’s presentation.

Chances are, you can’t.

Human memory works the same way. We compress experiences into patterns, discard details, and reconstruct them later. Psychologists call this confabulation—filling in gaps with plausible information.

In other words, we hallucinate too.

The difference is time. Our “codec” has been optimized over millions of years. LLMs have had only a few.

Final Perspective: LLM as Artificial Memory

Maybe the biggest misconception is thinking of LLMs as thinking machines.

They’re not.

They’re closer to something else: artificial memory systems. Extremely dense, incredibly powerful, but inherently imperfect.

Once you accept that, a lot of confusion disappears.

You stop expecting perfect accuracy. You stop fearing sudden sentience. And you start treating LLMs like what they are:

A tool for reconstructing meaning from compressed knowledge.

Not truth. Not understanding. But something surprisingly useful in between.

LLM Hallucinations Explained: Why AI Makes Things Up (Compression Theory)

LLM Hallucinations Are Compression Artifacts — And That Explains Everything

Table of Contents

Prediction Is Compression (Not a Metaphor)

The JPEG Analogy for Language Models

Why LLMs Excel at Code but Struggle with Math

Model Size, Temperature, and “Creativity”

RAG, Fine-Tuning, and Prompting Through the Lens of Compression

Can Hallucinations Ever Be Eliminated?

Humans Are Lossy Codecs Too

Final Perspective: LLM as Artificial Memory

Minarin

Leave a Reply Cancel reply

Related Posts

Most Popular Games 2026: 20 Biggest Titles Dominating Right Now

Suggestions

LLM Hallucinations Explained: Why AI Makes Things Up (Compression Theory)

LLM Hallucinations Are Compression Artifacts — And That Explains Everything

Table of Contents

Prediction Is Compression (Not a Metaphor)

The JPEG Analogy for Language Models

Why LLMs Excel at Code but Struggle with Math

Model Size, Temperature, and “Creativity”

RAG, Fine-Tuning, and Prompting Through the Lens of Compression

Can Hallucinations Ever Be Eliminated?

Humans Are Lossy Codecs Too

Final Perspective: LLM as Artificial Memory

Related posts:

Leave a Reply Cancel reply

Related Posts

Don't Miss