{"id":343,"date":"2025-03-07T18:34:17","date_gmt":"2025-03-07T18:34:17","guid":{"rendered":"https:\/\/www.mentic.com\/blog\/?p=343"},"modified":"2025-03-08T00:55:59","modified_gmt":"2025-03-08T00:55:59","slug":"do-we-finally-have-a-model-that-gets-ai-from-artificial-to-actual","status":"publish","type":"post","link":"https:\/\/www.mentic.com\/blog\/2025\/03\/do-we-finally-have-a-model-that-gets-ai-from-artificial-to-actual\/","title":{"rendered":"Do We Finally Have a Model That Gets AI from &#8216;Artificial&#8217; to &#8216;Actual&#8217;?"},"content":{"rendered":"\n<p>For years, the dominant paradigm in AI language models has been <strong>autoregressive (AR) learning<\/strong>, where models predict text sequentially\u2014word by word, token by token. While this approach has been tremendously successful, it comes with inherent inefficiencies. It struggles with long-range coherence, it learns knowledge hierarchies through brute-force sequential training, and it often lacks a structured conceptual understanding of language.<\/p>\n\n\n\n<p>But what if there was a better way? What if AI could learn the way biological minds do\u2014building up <strong>conceptual frameworks<\/strong> before refining them into detailed knowledge? Enter <strong>Diffusion-Based Large Language Models (dLLMs)<\/strong>, an approach that could fundamentally change how we think about AI.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>From Next-Token Prediction to Hierarchical Understanding<\/strong><\/h3>\n\n\n\n<p>Autoregressive models like GPT learn through <strong>next-token prediction<\/strong>, where each word is generated based on the previously generated ones. While this allows for fluency, it has major drawbacks:<\/p>\n\n\n\n<p><strong>Inefficient Learning<\/strong> \u2013 Knowledge structures emerge implicitly rather than explicitly.<\/p>\n\n\n\n<p><strong>Lack of Hierarchical Representation<\/strong> \u2013 AR models don\u2019t explicitly learn concepts before details; everything is learned simultaneously.<\/p>\n\n\n\n<p><strong>Long-Context Struggles<\/strong> \u2013 Since AR models are trained sequentially, they have trouble maintaining coherence across long passages.<\/p>\n\n\n\n<p>Diffusion-based models flip this on its head. Rather than predicting the next word one-by-one, dLLMs start with <strong>a noisy or rough version of text<\/strong> and refine it step by step\u2014a process called <strong>hierarchical denoising<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How Gross-to-Fine Diffusion Mirrors Human Thinking<\/strong><\/h3>\n\n\n\n<p>A core feature of dLLMs is the <strong>gross-to-fine training approach<\/strong>, which models information hierarchically. This means:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>High-Level Concepts First<\/strong> \u2013 The model first learns broad categories and relationships (e.g., &#8220;What is an operating system?&#8221;).<\/li>\n\n\n\n<li><strong>Mid-Level Refinement<\/strong> \u2013 The model learns structured breakdowns (e.g., &#8220;An OS has process management, memory management, and file systems.&#8221;).<\/li>\n\n\n\n<li><strong>Fine-Grained Detail<\/strong> \u2013 The model fills in the technical specifics (e.g., &#8220;Here\u2019s an implementation of virtual memory paging in C.&#8221;).<\/li>\n<\/ol>\n\n\n\n<p>This structured learning process mimics how humans build understanding\u2014from <strong>broad concepts to refined knowledge<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Lessons from Diffusion Models in Image Generation<\/strong><\/h3>\n\n\n\n<p>Diffusion models have already revolutionized <strong>image generation<\/strong>, creating highly detailed visuals from random noise. The process of <strong>denoising<\/strong> an image to gradually add meaningful structure parallels how <strong>text diffusion models refine linguistic structures<\/strong>. Just as <strong>an image begins as a blurry representation before sharpening into a high-resolution output<\/strong>, diffusion-based LLMs start with an approximate idea and <strong>progressively refine it into well-formed text<\/strong>.<\/p>\n\n\n\n<p>Both approaches <strong>implicitly construct a connected vector space of concepts<\/strong>\u2014a layered representation that encodes relationships between ideas, whether visual or linguistic. This layered refinement, reminiscent of <strong>splitting the probability distribution<\/strong> across conceptual levels, suggests that <strong>structured learning with progressive refinement<\/strong> will drive networks toward generating <strong>coherent details in visual, textual, and even auditory spaces<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>What Does Splitting the Distribution Mean?<\/strong><\/h3>\n\n\n\n<p>One key concept in both <strong>image and text diffusion models<\/strong> is <strong>splitting the distribution<\/strong>\u2014the process where a broad, undifferentiated concept gradually divides into finer details. This mirrors how human perception and cognition work:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Coarse Categorization:<\/strong> At first, an object or idea is recognized as a general category (e.g., &#8220;a bird&#8221;).<\/li>\n\n\n\n<li><strong>Feature Differentiation:<\/strong> More details emerge to refine the concept (e.g., &#8220;a parrot&#8221; vs. &#8220;a hawk&#8221;).<\/li>\n\n\n\n<li><strong>Fine-Grained Specificity:<\/strong> At the final stage, minute distinctions are recognized (e.g., &#8220;a blue-and-yellow macaw&#8221;).<\/li>\n<\/ol>\n\n\n\n<p>This is not unlike how diffusion models work.<\/p>\n\n\n\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\"><summary><br><strong>In images<\/strong>, noise is progressively removed until fine textures and details emerge.<br><strong>In text models<\/strong>, broad concepts are first generated before syntax, structure, and vocabulary<br><strong>In audio<\/strong>, raw waveforms can first be structured into rhythmic patterns, and later into distinct phonemes and words.<\/summary>\n<p><\/p>\n<\/details>\n\n\n\n<p>This <strong>multi-scale refinement<\/strong> means that, instead of simply predicting the next token, <strong>the model is learning the very structure of meaning<\/strong>\u2014how ideas branch and develop across multiple layers of abstraction.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Why This Approach Is More \u2018Actual\u2019 Than \u2018Artificial\u2019<\/strong><\/h3>\n\n\n\n<p>For AI to move beyond just mimicking human language and start <strong>reasoning meaningfully<\/strong>, it must learn the <strong>hierarchical relationships between ideas<\/strong> rather than treating all words as independent predictions. Diffusion-based LLMs take a step in this direction by:<\/p>\n\n\n\n<p><strong>Encoding Knowledge in a Coherent Structure<\/strong> \u2013 Rather than hoping coherence emerges, dLLMs enforce it through structured denoising. <\/p>\n\n\n\n<p><strong>Solving Long-Term Context Issues<\/strong> \u2013 By structuring information retrieval hierarchically, models can better recall earlier concepts. <\/p>\n\n\n\n<p><strong>Generating Meaning Rather Than Memorization<\/strong> \u2013 Because the model first forms high-level structures, it is less prone to memorizing data and more capable of generalizing knowledge.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The Future of AI: A Shift in Thinking<\/strong><\/h3>\n\n\n\n<p>This is a radical shift in how we think about AI. <strong>If autoregressive models were the first big step in language AI, diffusion models might be the second.<\/strong> This is because they take a more natural, structured, and efficient approach to learning and generating text.<\/p>\n\n\n\n<p>So, do we finally have a model that gets AI from &#8216;Artificial&#8217; to &#8216;Actual&#8217;? While we&#8217;re not fully there yet, <strong>this approach represents one of the most promising shifts in AI architecture in years.<\/strong> If AI is going to transition from <strong>predicting text<\/strong> to <strong>understanding and reasoning<\/strong>, diffusion-based models might just be the key to unlocking that future.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Further Reading &amp; Resources<\/strong><\/h3>\n\n\n\n<p>&#x1f517; <strong>Try a <em>d<\/em>LLM<\/strong> @ <a href=\"https:\/\/chat.inceptionlabs.ai\/\">Inception Labs Chat<\/a>  from <a href=\"https:\/\/www.inceptionlabs.ai\/\">https:\/\/www.inceptionlabs.ai\/<\/a><\/p>\n\n\n\n<p>&#x1f4c4; <strong>Research Paper on Diffusion LLMs (LLaDA)<\/strong>: <a href=\"https:\/\/arxiv.org\/abs\/2502.09992\">https:\/\/arxiv.org\/abs\/2502.09992<\/a><\/p>\n\n\n\n<p>&#x1f6e0;&#xfe0f; <strong>Open-Source dLLM Code (LLaDA GitHub Repo)<\/strong>: <a href=\"https:\/\/github.com\/ML-GSAI\/LLaDA\">https:\/\/github.com\/ML-GSAI\/LLaDA<\/a><\/p>\n\n\n\n<p>&#x1f4da; <strong>Introduction to Diffusion Models (Hugging Face Blog)<\/strong>: <a>https:\/\/huggingface.co\/blog\/diffusion-models<\/a><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Could diffuse large language multi-modal models embody actual intelligence?<\/p>\n","protected":false},"author":1,"featured_media":344,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-343","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-top-down"],"_links":{"self":[{"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/posts\/343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/comments?post=343"}],"version-history":[{"count":3,"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/posts\/343\/revisions"}],"predecessor-version":[{"id":349,"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/posts\/343\/revisions\/349"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/media\/344"}],"wp:attachment":[{"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/media?parent=343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/categories?post=343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mentic.com\/blog\/wp-json\/wp\/v2\/tags?post=343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}