March 2025 – Mind Matters

For years, the dominant paradigm in AI language models has been autoregressive (AR) learning, where models predict text sequentially—word by word, token by token. While this approach has been tremendously successful, it comes with inherent inefficiencies. It struggles with long-range coherence, it learns knowledge hierarchies through brute-force sequential training, and it often lacks a structured conceptual understanding of language.

But what if there was a better way? What if AI could learn the way biological minds do—building up conceptual frameworks before refining them into detailed knowledge? Enter Diffusion-Based Large Language Models (dLLMs), an approach that could fundamentally change how we think about AI.

From Next-Token Prediction to Hierarchical Understanding

Autoregressive models like GPT learn through next-token prediction, where each word is generated based on the previously generated ones. While this allows for fluency, it has major drawbacks:

Inefficient Learning – Knowledge structures emerge implicitly rather than explicitly.

Lack of Hierarchical Representation – AR models don’t explicitly learn concepts before details; everything is learned simultaneously.

Long-Context Struggles – Since AR models are trained sequentially, they have trouble maintaining coherence across long passages.

Diffusion-based models flip this on its head. Rather than predicting the next word one-by-one, dLLMs start with a noisy or rough version of text and refine it step by step—a process called hierarchical denoising.

How Gross-to-Fine Diffusion Mirrors Human Thinking

A core feature of dLLMs is the gross-to-fine training approach, which models information hierarchically. This means:

High-Level Concepts First – The model first learns broad categories and relationships (e.g., “What is an operating system?”).
Mid-Level Refinement – The model learns structured breakdowns (e.g., “An OS has process management, memory management, and file systems.”).
Fine-Grained Detail – The model fills in the technical specifics (e.g., “Here’s an implementation of virtual memory paging in C.”).

This structured learning process mimics how humans build understanding—from broad concepts to refined knowledge.

Lessons from Diffusion Models in Image Generation

Diffusion models have already revolutionized image generation, creating highly detailed visuals from random noise. The process of denoising an image to gradually add meaningful structure parallels how text diffusion models refine linguistic structures. Just as an image begins as a blurry representation before sharpening into a high-resolution output, diffusion-based LLMs start with an approximate idea and progressively refine it into well-formed text.

Both approaches implicitly construct a connected vector space of concepts—a layered representation that encodes relationships between ideas, whether visual or linguistic. This layered refinement, reminiscent of splitting the probability distribution across conceptual levels, suggests that structured learning with progressive refinement will drive networks toward generating coherent details in visual, textual, and even auditory spaces.

What Does Splitting the Distribution Mean?

One key concept in both image and text diffusion models is splitting the distribution—the process where a broad, undifferentiated concept gradually divides into finer details. This mirrors how human perception and cognition work:

Coarse Categorization: At first, an object or idea is recognized as a general category (e.g., “a bird”).
Feature Differentiation: More details emerge to refine the concept (e.g., “a parrot” vs. “a hawk”).
Fine-Grained Specificity: At the final stage, minute distinctions are recognized (e.g., “a blue-and-yellow macaw”).

This is not unlike how diffusion models work.

In images, noise is progressively removed until fine textures and details emerge.
In text models, broad concepts are first generated before syntax, structure, and vocabulary
In audio, raw waveforms can first be structured into rhythmic patterns, and later into distinct phonemes and words.

This multi-scale refinement means that, instead of simply predicting the next token, the model is learning the very structure of meaning—how ideas branch and develop across multiple layers of abstraction.

Why This Approach Is More ‘Actual’ Than ‘Artificial’

For AI to move beyond just mimicking human language and start reasoning meaningfully, it must learn the hierarchical relationships between ideas rather than treating all words as independent predictions. Diffusion-based LLMs take a step in this direction by:

Encoding Knowledge in a Coherent Structure – Rather than hoping coherence emerges, dLLMs enforce it through structured denoising.

Solving Long-Term Context Issues – By structuring information retrieval hierarchically, models can better recall earlier concepts.

Generating Meaning Rather Than Memorization – Because the model first forms high-level structures, it is less prone to memorizing data and more capable of generalizing knowledge.

The Future of AI: A Shift in Thinking

This is a radical shift in how we think about AI. If autoregressive models were the first big step in language AI, diffusion models might be the second. This is because they take a more natural, structured, and efficient approach to learning and generating text.

So, do we finally have a model that gets AI from ‘Artificial’ to ‘Actual’? While we’re not fully there yet, this approach represents one of the most promising shifts in AI architecture in years. If AI is going to transition from predicting text to understanding and reasoning, diffusion-based models might just be the key to unlocking that future.

Month: March 2025

Do We Finally Have a Model That Gets AI from ‘Artificial’ to ‘Actual’?