ai – Pietro Mingotti

How LLMs extract and quote snippets

Pietro Mingotti — Tue, 19 Aug 2025 13:29:09 +0000

TL;DR

Browsing models fan-out multiple short queries, fetch top results, skim titles + intros, and compose a synthetic answer. Citations are added only when the system is confident about attribution.
The most reused fragments: page title, the first ~500–1000 characters, and any definition/answer block directly under a heading. Meta descriptions (your SERP snippet) matter more than you think.
Links are probabilistic. Clear structure, named entities, and “answer-first” copy raise your odds; blended sources and marketing fluff lower them.
Technical SEO still matters: fast HTML-first rendering, schema, SSR/static output. If retrievers can’t parse you quickly, you’re invisible.

If you’re still treating AI answers like “blue links with extra steps,” you’re going to miss where visibility actually happens. LLMs generate answer, they don’t rank or index anything. Then how do llms extract content and when do they quote it and link it?

In browsing-enabled modes (ChatGPT w/ Bing, Bing Copilot, SGE, Perplexity, Claude), models don’t read your whole page like a human. They assemble answers from tiny, extractable fragments, and only sometimes attach a link.

Below I’ll show the pipeline, what gets lifted, when links appear, and how to format pages so they’re quote-friendly.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

What happens during RAG

I’ve written a dedicated article extracted from the research paper on how LLMs work under the hood; however, what you need to know here is that when a user asks a complex question, the assistant:

Rewrites the prompt into several short sub-queries (≈3–5 words).
Calls search (usually Bing/Google) and gets back titles, URLs, and snippets.
Scrapes partial content from a handful of top results (intros, definition blocks, sometimes FAQs).
Composes an answer; may attach citations if a source fragment is used verbatim or near-verbatim and attribution confidence is high.

This aligns with how ChatGPT’s browsing mode and similar tools are described publicly: search first, skim, then synthesize; links appear depending on product heuristics.

What LLMs extract (and what they don’t)

Models are on a tight budget: limited fetches, timeouts, and small “content windows” per page. That means they’ll often only lift:

Title (and H1 if distinct)
The first ~500–1000 characters of body copy
A tight definition or answer block immediately below a heading
FAQ/HowTo fragments (if clearly marked and near the top)

Practical consequences:

Front-load the definition or direct answer.
Keep early paragraphs short, declarative, and standalone.
Treat meta title + meta description like ad copy: these are sometimes the only words the model sees before deciding whether to fetch.

Think in “info windows”: one heading + 1–2 concise paragraphs + a bulleted list. This maps to how multi-vector retrieval compresses and ranks cohesive segments.

When do LLMs show links to sources?

Linking is not the default; it’s an emergent behavior triggered when internal rules agree that the source is relevant, extractable, and safely attributable:

More likely to link when

You provided a direct quote/definition the answer depends on
The domain is official/high-trust (gov, edu, Wikipedia, major trade sources)
The page shows clear authorship, date, and clean structure

Less likely when

The model blended multiple sources into one sentence
Your layout is messy, interactive, or slow to render
The text reads like “general knowledge” rather than a specific, attributable fact block

Observed platform patterns (abridged):

ChatGPT (Browsing): sometimes cites 1–3 sources; paraphrases heavily.
Bing Copilot: more visible links; favors clean lists/definitions.
SGE: mixes sources; often drops links in the primary summary.
Perplexity: aggressive inline citations; excellent for long-form attribution.
Claude: cites when docs are provided or web context is enabled.

Why structure beats style (every time)

Answer engines reward extractability, not flourish. To raise your quote probability:

Answer first. Put the definition/conclusion in the first 2–3 sentences under each H2.
Keep blocks self-contained. Each section should make sense if lifted in isolation.
Prefer lists and tables. Step-by-steps and comparisons are regularly mirrored in AI output.
Use schema. FAQPage/HowTo/Article/Organization raise machine legibility and attribution confidence.
Brand early. Name, entity, and author metadata near the top helps the model name-drop correctly when it does cite.

The tech behind what LLMs extract and quote

You can’t be quoted if you can’t be fetched:

Robots & llms.txt. Allow GPTBot/ClaudeBot/Gemini/Perplexity unless you intend to be excluded from future retrieval/training.
HTML-first delivery. Avoid JS-gated copy, heavy modals, and client-side redirects.
SSR / static export. Guarantee retrievers get real text on first paint.
Speed + simplicity. Timeouts and fragile hydration mean skipped content.

A quick AI extractability and citation potential checklist

→ Every H2 opens with a one-sentence definition or answer
→ First 500–1000 chars read like a standalone snippet
→ FAQ/HowTo blocks exist and are marked up
→ Meta title/description state the answer, not just tease it
→ Tables for comparisons; lists for steps/principles
→ Org/Author/Article schema + clear dates/ownership
→ SSR/static build; no content behind modals/cookie walls
→ Robots.txt/llms.txt allow AI crawlers you want to influence

Conclusions

You don’t “rank” in an answer engine; you get selected in tiny pieces. Build pages as a series of clean, attributable information windows, and you’ll see your words show up where users actually read: inside the answer itself.

How do LLMs decide which snippet to use?

They fan-out the user prompt into several short sub-queries, fetch top results, skim titles/intros/FAQs, then synthesize an answer using the most extractable fragments (definition-first, lists, short paragraphs). This is generation, not ranking. For a deeper primer on generation vs. retrieval, see your technical overview. → How LLMs Work – Deep Technical Overview.

Do LLMs always include a link when they quote me?

No. Links are not guaranteed. Even when your text influences the answer, the model may paraphrase without attribution, especially on zero-click surfaces. For context on why “ranking” expectations don’t apply, see “Why You Can’t Rank on ChatGPT”.

What parts of a page get extracted most often?

Page title, meta snippet, and the first ~500–1,000 characters, plus any clearly marked FAQ/definition blocks. Put the answer first. For the macro shift to “answer engines,” see “How LLMs are Disrupting Search Marketing.”

Does traditional SEO still matter for citations?

Yes, because retrieval-enabled LLMs pull from search indexes. If you don’t surface in Bing/Google for the fan-out queries, you’re effectively invisible at retrieval time. → See: “How LLMs are Disrupting Search Marketing.”

What page structures increase AI citation likelihood?

Definition-first paragraphs (“X is…”), bullet lists, short steps, Q&A sections, and clean semantic HTML. This aligns with how transformers attend to local structure and how retrieval pipelines skim. → See: “Understanding Transformer Architecture – A Guide for Marketers.”

Do backlinks make content more quotable on AI?

Indirectly at best. They may help you rank in SERPs (thus be seen by the retriever), but the selection is driven by clarity, extractability, and answer fit, not PageRank. → See: “Why You Can’t Rank on ChatGPT”

How LLMs Work – Deep Technical Overview

Pietro Mingotti — Sun, 03 Aug 2025 14:53:31 +0000

Have you ever explored how llms work? If not, how can we talk of AEO and GEO and so on? Most marketing strategies today are still based on metaphors that no longer apply. Terms like “ranking,” “indexing,” and “domain authority” may be appropriate for search, but they have little meaning in the architecture of a Large Language Model.

This is where things get technical. While it might feel overwhelming at first, I encourage you to read slowly, pause, and let the ideas settle.

To influence a system, even indirectly, you need to first understand how it works.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Unlike traditional search engines, LLMs like GPT-5, Claude 4.1 Opus, or Gemini w.5 Pro / Flash do not retrieve web pages and choose the “best one.” Instead, they generate answers word by word based on patterns learned during pre-training, shaped during fine-tuning, and generally enhanced by real-time tools like search or code interpreters.

This chapter explains how that process works:

How LLMs are trained
How they encode language and meaning
How their internal structure allows emergent abilities like attempt at reasoning, planning, and memory-like behavior
Why outputs are fluent but unpredictable
And why direct optimization is impossible, but indirect influence is probable.

We’ll cover key concepts such as embeddings, self-attention, transformers, token prediction, and reinforcement learning with human feedback (RLHF).

We’ll also demystify why LLMs “sound smart” even when they’re wrong, and how the illusion of reasoning emerges from statistical computation.

Picture credit: Rateb Al Drobi, https://www.linkedin.com/posts/rateb-al-drobi_post-410-how-llms-actually-generate-text-activity-7300199524680032256-njGZ/

This collection of posts will also include references to retrieval-augmented generation (RAG) and multi-modal capabilities, as many LLMs now use external tools (search, calculators, code) and work with not just text, but images, audio, and even video.

Here below you can find the posts that have been written by extrapolating from the full research paper. For your convenience, here’s the natural order of the articles and how they should be read:

How LLMs extract and quote snippets

TL;DR Browsing models fan-out multiple short queries, fetch top results, skim titles + intros, and compose a synthetic answer. Citations…

Continue Reading How LLMs extract and quote snippets

How LLMs are disrupting Search Marketing

TL;DR LLMs provide direct answers to queries, reducing clicks on traditional SERPs. Ranking signals like backlinks and CTR lose importance;…

Continue Reading How LLMs are disrupting Search Marketing

Why you can’t rank on ChatGPT and other LLMs

TL;DR You can’t rank on ChatGPT like on Google Search. LLMs don’t use ranking systems. LLMs rely on static training…

Continue Reading Why you can’t rank on ChatGPT and other LLMs

Inside LLMs: How Pre‑Training Shapes What ChatGPT Knows

TL;DR Pre-Training teaches LLMs to predict the next word, not to “understand” meaning. Models are trained on trillions of tokens…

Continue Reading Inside LLMs: How Pre‑Training Shapes What ChatGPT Knows

Inside LLMs: Neural Networks & Attention

TL;DR Neural networks are layered architectures where each layer extracts patterns and relationships from text. Attention mechanisms allow LLMs to…

Continue Reading Inside LLMs: Neural Networks & Attention

Inside LLMs: RLHF, RLAIF & the Evolution of Model Alignment

TL;DR RLHF (Reinforcement Learning with Human Feedback) uses human annotators to rank model outputs and train reward models that align…

Continue Reading Inside LLMs: RLHF, RLAIF & the Evolution of Model Alignment

Inside LLMs: why LLMs don’t really “know” things

TL;DR LLMs don’t “know” facts. they predict tokens based on training data. Their knowledge is limited to the pre‑training corpus…

Continue Reading Inside LLMs: why LLMs don’t really “know” things

Inside LLMs: Understanding Transformer Architecture – A Guide for Marketers

TL;DR Transformers are the core architecture of modern LLMs like GPT‑4, Claude, and Gemini. Each transformer block includes multi‑head attention,…

Continue Reading Inside LLMs: Understanding Transformer Architecture – A Guide for Marketers

How LLMs are disrupting Search Marketing

Pietro Mingotti — Sun, 03 Aug 2025 14:53:30 +0000

TL;DR

LLMs provide direct answers to queries, reducing clicks on traditional SERPs.
Ranking signals like backlinks and CTR lose importance; structured content gains weight.
LLMs rely on probabilistic training + retrieval, not live search indexing.
Brand visibility depends on clear, factual, citation‑ready content rather than keyword tricks.
Marketers must shift from “ranking” tactics to AI visibility strategies, focusing on schema, semantic clarity, and authoritative positioning.
Takeaway: SEO isn’t dead, it’s becoming even more crucial. But to be seen in AI answers, optimize for inclusion and citation probability, not just SERPs.

For over two decades, Search Engine Marketing strategies have revolved around one concept: ranking high in search engine results pages (SERPs) for the most valuable queries.

To put it simply, and in this context deliberately not taking into account what happens after the click, success on the Search Marketing was measurable by clicks, impressions, and keyword positions. Marketers competed for blue links. The higher your position, the more attention, authority, and revenue you captured.

That era is now undergoing a seismic transformation.

The rise of Large Language Models (LLMs) deployed inside tools like ChatGPT, Bing Copilot, and Google SGE / AIO / AIM (Search Generative Experience), as well as Claude.ai, Perplexity, and multimodal search in Gemini 2.5 Pro is impacting traffic and performance in such a palpable way, that agencies (because of clients) are starting to invent all sorts of new terms to try and adapt SEO to LLMs.

But, the rules have changed.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

From search engines to answer engines

Traditional search engines respond to a user query by displaying a ranked list of URLs. The user must choose where to click and decide what content to trust with two highly costly currencies: their time, and their attention.

In contrast, LLM-based answer engines generate a natural language response, often without linking to any source at all. This is the first important step to consider.

Getting links from LLMs is a rare occurrence in the general usage of these systems.

Instead of merely listing links, these models have become the interface for information consumption itself, synthesizing, citing, and sometimes rewriting your content before a user ever sees your site.

This shift creates a radically different marketing environment:

There is no first page. There’s only the answer, which is probabilistic and mathematically irreducible.
There is no guarantee of visibility, even for top-ranked content on Search Engines.
The user may never visit the interested website, even if that very content shaped the AI’s response.

The result? A paradigm where visibility depends not on keyword optimization, but on whether and how the AI “remembers,” “finds,” or “decides to cite” you.

The rise of AI Snapshots and zero-click experiences

With the rise of Google SGE / AIO / AIM or Gemini, ChatGPT, Claude.ai, and Perplexity, we’re seeing the normalization of zero-click experiences across nearly every major LLM interface (specifically for a very defined subset of non-action driven queries).

Users receive summarized information without visiting any external website.

Google SGE / AIO / AIM displays AI-generated answers atop traditional search results, often complete with contextual cards and expandable citations;
Bing Copilot integrates directly into the SERP or Edge sidebar, producing full responses with citations, but not always visibly.
ChatGPT-5 fetches information in real time using a mix of Bing’s API and OpenAI’s in-house synthetic data index, selecting content based on clarity, phrasing, and structural legibility.

Google specifically seems to be trying extra hard to keep the user in the SERP (and it doesn’t take a genius to understand why, from a data retention standpoint in the privacy-centric world wide web of today).

The result is a new ecosystem where traffic no longer follows success and viceversa.

A website might rank first in Google and still be ignored by ChatGPT if its content is too complex, poorly structured, or unavailable to Bing’s index.

This has created friction and confusion for marketers: visibility in traditional search no longer guarantees visibility in AI-generated responses. And worse, AI responses may outrank your brand’s own voice, quoting forums, summaries, or third parties instead.

For a practical playbook on the exact fragments LLMs lift (and how to format yours), see how LLMs extract and quote snippets.

How ChatGPT, Bing Copilot, Google SGE / AIO / AIM and other LLMs differ

Platform	LLM Core Versions	Source of Information	Citation Behavior	Implications
ChatGPT (OpenAI)	GPT‑5 (flagship, replaces all prior models) and GPT‑4o (legacy multimodal)	Pre-training corpora; “Deep Research” tool for real-time web search	Can include citations via “Deep Research” when retrieving live info	Unified model simplifies UX; long-context (256K tokens), strong reasoning, rich agentic tools; citation strength depends on mode
ChatGPT + Copilot (Microsoft)	GPT-5 integrated across Microsoft 365, GitHub Copilot, etc.; Smart mode automatically selects optimal variant	Accesses both pre-training data and real-time Web (via Copilot)	Frequently includes live citations especially in search-powered tasks	Deep integration with productivity tools; excels in coding, rich context handling, and multi-modal tasks
Claude (Anthropic)	Claude 4 family: Opus 4, Sonnet 4, now Opus 4.1	Pre-training corpora, plus web search for paid users; recently added memory across sessions	Offers citations when sourcing from the web; “Artifacts” can embed sources directly	Extremely capable in coding, complex reasoning, and creative tasks; long context (200K+ tokens) and session memory improve continuity
Google Gemini	Gemini 1.5 / 2.x models (multimodal)	Search-indexed pages, tightly integrated with Google ecosystem	Typically includes citations or links; performance varies	Strong for real-time, multimodal Google-ecosystem queries; excels in translation and integration‚ less suited for deeply technical reasoning
Perplexity.ai	Uses mixture of models (OpenAI, Claude, DeepSeek)	Live web search + internal document graph	Always includes citations clearly displaying sources	Ideal for research or fact-checking; transparent sourcing; interface can be dense but great for accuracy
xAI Grok (by Elon Musk)	Grok 4 (and Grok 4 Heavy) with native tool use, real-time search, reasoning “Think” / “SuperGrok” modes	Live search (X/Twitter and web), tool integrations	Typically includes context, less filtered voice; citation style less formalized	Bold, multimodal, embedded in Tesla/X ecosystem; flexible but may lack the citation discipline of others

These differences aren’t just technical. They determine whether your brand appears at all. Understanding which LLM is powering a tool, and how it chooses to cite, is now a prerequisite for modern Technical Marketing.

Picture credit: Pietro Mingotti, CEO & Head of Digital @Fuel LAB® – Miro Sketches from Google I-O presentation

The Real-World Impacts on Marketing

This evolution in how content is delivered and consumed is producing measurable effects, throwing companies and agencies alike into panic and a nonsensical posting frenzy:

Brand authority is decoupled from web traffic Your expertise may be acknowledged in the AI summary, yet no user visits your page. In this context, branded searches gain a different traction and meaning. Increase in branded searches is something to start considering and correlate.
Clicks are being intercepted AI-generated content answers the user’s query before they see your meta title. Therefore informative content is what actually looses traction for sure. Prioritizing the strategy for things you can “get”, instead of things you can “learn”, can be an important consideration to make.
SEO metrics are becoming misleading Traditional impressions and position reports will not reflect actual visibility inside LLMs.
Content visibility is now probabilistic It depends on your site’s presence in the training data, its structure, and how the AI interprets relevance in real time, plus a large amount of other factors we’ll exlplain.
Visibility occurs outside analytics tools AI is barely tracked by Google Analytics or Search Console. You may appear in hundreds of AI answers and never see a traffic signal.

In truth, this creates both a threat and an opportunity. Brands that will fail to adapt, will likely become invisible in the AI Query Fan Out layer, even if they dominate classical SERPs, and we will see how.

But those who invest the time in learning and understanding how LLMs work, and thus select, cite, and synthesize information will have a chance at reclaim visibility by designing content that machines can parse effortlessly, and humans find immediately useful

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Faqs

How are LLMs disrupting traditional SEO?

LLMs bypass ranking by generating direct answers, which reduces organic clicks from search engines and reshapes how users discover brands.

Do backlinks and CTR still matter in the age of LLMs?

They matter for Search Engines, but not directly for LLMs. What matters is structured, authoritative content that AI can parse and cite.

What is the difference between ranking in Google and being cited by ChatGPT?

Google and similar Search Engines use ranking algorithms; LLMs generate answers based on training data and retrieval. There is no ranking — only inclusion and citation.

What should marketers do to adapt?

First of all, stop spreading misinformation; we have to accept that we have to do the heavy lifting of studying an entirely different system, and develop tactics based on facts, not on easy ready-made checklists. This, is the scope of our Research at Fuel LAB®

Does this mean SEO is dead?

No. Once again, not. SEO becomes even more important for RAG in LLMs, and Organic Search Engineering evolves to encompass AI visibility strategy — preparing content for both search engines and large language models.

Why you can’t rank on ChatGPT and other LLMs

Pietro Mingotti — Sun, 03 Aug 2025 14:53:29 +0000

TL;DR

You can’t rank on ChatGPT like on Google Search. LLMs don’t use ranking systems.
LLMs rely on static training data, not live crawling, if not for single prompts.
Attribution is inconsistent; your content may be used without citation.
Authority signals like backlinks or keywords don’t drive visibility for LLMs.

Probably the most important misunderstanding in the search marketing industry today is not whether we should respond to the rise of generative AI. Response is inevitable.

The real problem is that most teams are treating this shift as just another algorithm update, assuming these models behave like search engines. They don’t.

This reaction is rooted in discomfort; faced with a black box most don’t really understand, teams default to familiar tactics and clients stress agencies with impossible expectations.

Impossible in the most technical term, like “ranking” on LLMs.

Agencies reach for what they already know, rebrand it as AEO, and hope that the old playbook still brings the same measurable results. But it doesn’t. If we want to influence how LLMs engage with our content, we have to start learning how the system actually works, thus accepting that it can only be partially and probabilistically influenced; not controlled.

Impossible in the most technical term, like “ranking on ChatGPT” or generally on LLMs.

Instead of ‘ranking,’ you’ll need to optimize your pages for the fragments models actually use (see how LLMs extract and quote snippets.)

Agencies reach for what they already know, rebrand it as AEO, close their eyes and hope that the old playbook still brings the same measurable results. But it doesn’t.

If we want to influence how LLMs engage with our content, we have to start learning how the system actually works, thus accepting that it can only be partially and probabilistically influenced; not controlled.

You can watch my podcast episode with Dinghy Studio, or keep reading for the full breakdown.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

Reason 1: LLMs don’t use a Ranking Algorithm like Search Engines do

Large Language Models like GPT-5 are trained using a process called pre-training, where the model processes hundreds of billions of tokens from massive datasets, including Common Crawl, Wikipedia, books, academic papers, and code repositories.

The model learns statistical patterns, not exact facts, and encodes them into billions of parameters (weights) inside its neural network.

I’ll explain exactly what this means in Chapter IV.

Once pre-training ends, the model’s weights are frozen. This means:

You cannot inject new facts into the model after training.
You cannot “submit your site” to be added to its knowledge base.
You cannot ask for an update unless OpenAI & co retrains or fine-tunes the model with your content (which rarely happens, if not for the phenomenon of Synthetic Data).

While the core weights of models like GPT-5 remain frozen after pre-training (meaning you cannot inject new knowledge into their neural network without retraining), some new architectures use modular systems that allow parts of the model to be updated or extended via:

Retrieval-Augmented Generation (RAG): dynamically fetching current information from search APIs or vector databases.
Memory Modules (e.g., ChatGPT’s “custom GPTs” or persistent chat memory): storing user-specific preferences or facts outside the model weights.
Tool Use and Plug-ins: calling APIs or calculators to fetch real-time information, not generated from internal knowledge.

This could lead to think something has changed, however, none of these mechanisms alter the model’s internal knowledge base (the billions of statistical weights learned during training).

RAG and Memory Modules only influences that one user’s specific experience, and won’t persist outside of that chat session. That core knowledge is still static.

Even when models appear “updated” (as with GPT5 or Claude 4) the model will not know about it, unless it’s accessed via a live retrieval mechanism. Even then, it’s not guaranteed to cite you.

✓ The model can “look up” new facts via tools or retrieval.

The model cannot “learn” new facts in the traditional way unless it is retrained or fine-tuned.

Reason 2: Training Data isn’t a live index of the web

Unlike search engines, LLMs, even as of August 2025 with the release of GPT-5 and Gemini 2.5 Pro:

Don’t systematically use a crawler that updates constantly.
Don’t respond to sitemap.xml or Search Consoles submissions.
Don’t use page freshness as a visibility signal.
Don’t store full documents, but statistical abstractions.

This means there is no equivalent to SEO submission protocols like:

Indexing APIs
URL inspection tools
Backlink tracking
… and more

However, it is important to note that as of August 2025 OpenAI’s GPTBot and ClaudeBot by Anthropic do crawl the web on an ongoing basis.

While these crawls are not tied to live model updates, they might feed future snapshots of the model, or more likely support faster retrieval-augmented systems like ChatGPT-5 and similar releases.

In other words, your content will not be seen by the model unless it was:

Present in the training dataset (which is closed and static), or
Retrieved via search during a user interaction with RAG (and so in truth only for that chat session)

Even in those cases, being seen does not guarantee being cited.
Citation remains a probabilistic and extractability-dependent behavior.

Reason 3: LLMs are not Search Engines. They are Generators.

As introduced, it’s tempting to think that platforms like ChatGPT, Claude, or Google SGE / AIO / AIM work like search engines. But they don’t.

They are Large Language Models (LLMs) → statistical generators, not retrieval engines.

LLMs do not evaluate, rank, or index documents like search engines do. They generate responses based on either:

Pre-trained internal knowledge, or
On-the-fly retrieval of external text, used to guide or ground the generation.

This second approach, known as retrieval-augmented generation (RAG), is what powers systems like:

Platform	Core Model	Retrieval Behavior	Citation Behavior
ChatGPT (GPT‑5)	Deep Research (autonomous web reports) & Agent Mode (multi-step tool execution)	Links shown in research reports and agent interactions	Cites sources in Deep Research outcomes; Agent Mode shows process steps and reference context
Microsoft (Copilot Suite)	GPT-5 integrated with real-time router across M365, GitHub, VS Code	Includes links in structured outputs	Citations are clear when referencing external content; integrated across apps
Google Gemini (1.5 Pro / 2.x)	Google search index + ecosystem tools	Selective linking	Snippet-based answers; links more prominent when using tools
Perplexity.ai	Live crawling + custom index	Frequently includes URLs	Strong inline citations with multiple sources visible
Claude (Anthropic)	Optionally live web search for paid users	Links when context provided	Generates citations only under explicit source context
xAI Grok 4	Real-time web and X/Twitter search	Contextual storytelling	Informal, narrative citation style—less formal sourcing

So if you don’t rank on chatGPT, what’s actually happening?

Most of these systems don’t “rank” documents the way Google Search does. Instead, they:

Rewrite the user’s query into internal sub-questions or semantic fragments
Retrieve supporting content, often in the form of titles, snippets, and intros
Compose an answer, selecting and synthesizing fragments based on internal scoring: fluency, clarity, and helpfulness; not traditional authority.

Even when citations are shown, they are:

Not guaranteed to have a link
Not complete or reliable
Not always attributed

In other words: your site might be quoted, paraphrased, or ignored and you won’t know unless you test the prompt yourself.

And the core issue remains: LLMs don’t understand what they’re saying

Whether pre-trained or grounded in live search, LLMs do not reason or verify. They don’t actually “understand” or “know” facts. LLMs complete patterns.

Their answers are fluent approximations, not factual commitments, and that’s true across:

ChatGPT (OpenAI)
Claude (Anthropic)
Gemini (Google)
Perplexity (independent)
Bing Copilot (Microsoft)

Some simulate reasoning better than others. Actually, from experience, I would say depending on your luck with one specific chat session, one might simulate reasoning better, and another could collapse, resulting in an infuriating behavior. But none of them retrieve or cite in the way marketers are used to from traditional search engines.

Picture credit: Harsh Gupta, Medium, Revolutionize Fine-Tuning: Reduce Training Time and Memory Usage with LoRA

Reason 4: your web or app property isn’t “In ChatGPT”unless it was in the Training Set or Search Results

As we are introducing the concept that citations are not guaranteed or reliable, one would want to be “present” in the internal knowledge of the model. A critical realization for any digital team: you are invisible to LLMs by default.

Your content exists only if it was present in the open training corpus (e.g., in Common Crawl, Wikipedia, maybe Synthetic Data in future releases). These LLMs simply cannot “remember” you unless you were in the data it was trained on.

Differently, your content could be used on the fly and for a single user experience, if:

It is discoverable through a live search engine API (Bing Search, Google Programmable Search, Perplexity’s crawler, Claude’s web lookup)
It is structured in a way that is machine-readable and extractable in real time
You are a commonly cited source in public datasets (e.g., Stack Overflow, Reddit, academic databases)

Even in browsing mode, visibility depends on how search engines index and display your content, not on LLM awareness.

With this, I am starting to introduce the point for SEO and Organic Search Engineering. SEO is not dead, as usual. It’s even more important than we could fathom years ago, and it’s going to stay crucial for every search API call in the future.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

In Summary:

You cannot force inclusion into ChatGPT’s knowledge.
You cannot update the model manually.
You cannot push fresh content (unless it’s linked or included in a source ChatGPT can retrieve in browsing mode).
You cannot learn what the model “knows”; only test what it says
Optimization must shift from “ranking” to “being found , understood and chosen.”

It’s necessary to stop thinking in terms of “tricking the model” and start thinking in terms of clarity, availability, and machine readability, all of which we’ll explore in the next chapters.

FAQs

Can I make my website rank inside ChatGPT like on Google?

No. ChatGPT and other LLMs don’t use ranking systems like Google. Instead, they generate answers based on probabilities learned during training or retrieval. You can, however, increase visibility by structuring content so it is more likely to be included and cited.
How is ChatGPT’s training data different from Google’s index?

Google constantly crawls and updates live web pages. LLMs like ChatGPT are trained on static snapshots plus curated datasets. Unless retrieval systems (RAG) are used, updates to your website won’t be reflected in the model until the next training cycle.
Why doesn’t ChatGPT always cite my website when it uses my content?

Attribution in LLMs is inconsistent. Sometimes sources are cited, sometimes not. This depends on the model provider’s interface, the dataset, and the prompt. Structured blocks, FAQs, and authoritative definitions increase your chance of being cited.
Does link-building or keyword optimization help with AI visibility?

Traditional SEO tactics like backlinks or keyword density do not influence LLMs directly. Instead, focus on semantic clarity, schema markup, and extractable content structures.
How can I improve my chances of being mentioned by LLMs?

Download the full Research Paper, study and understand how generative models work, and build your framework based on facts, not on easy copy-paste solutions. Or, get in touch with us and get the blueprint we’ve created.

Inside LLMs: How Pre‑Training Shapes What ChatGPT Knows

Pietro Mingotti — Sun, 03 Aug 2025 14:53:28 +0000

TL;DR

Pre-Training teaches LLMs to predict the next word, not to “understand” meaning.
Models are trained on trillions of tokens from sources like Common Crawl, Wikipedia, GitHub, and books.
Training uses loss functions and gradient descent across hundreds of billions of parameters, costing millions of dollars.
Once pre-training ends, model weights are frozen: LLMs cannot update themselves without retrieval or fine-tuning.
Marketers must focus on content visibility in training sets and retrieval systems to be “seen” by LLMs.

The foundation of any Large Language Model (LLM) lies in a process called pre-training.

This is where the model learns how language works by processing an immense volume of human-generated text. Pre-training is self-supervised, non-interactive, and results in a static model: it defines what the model “knows”, and more importantly, what it doesn’t.

Pre-training teaches the model how language works, not by reading for meaning, but by guessing the next word over and over until it becomes really good at it.

See it as a like a hyper-fast version of the autocomplete on your phone keyboard, trained on most of the internet.”

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

What is Pre-Training in LLMs for?

At its core, pre-training is based on a single, elegant objective:

Given a sequence of tokens, predict the next token.

This task is called causal language modeling or autoregressive token prediction. The model is shown fragments of text from short phrases to full paragraphs, and learns to estimate the probability distribution of the next word (or sub-word token) based on all preceding context.

But what is a token?

A token is a chunk of content, in this case text, often a word, part of a word, or even a punctuation mark, that the model processes individually.

For example, the sentence “ChatGPT is smart!” might be split into tokens like:

JavaScript

["Chat", "G", "PT", " is", " smart", "!"].

["Chat", "G", "PT", " is", " smart", "!"].

Tokens are the atomic units of meaning for the model. They don’t align perfectly with words or syllables, but they let the model work with language in a flexible and compressed form.

We’ll keep referring to tokens throughout this paper, so just remember: a token is the model’s version of a word. Sometimes bigger, sometimes smaller.

Put into practice:

JavaScript

Input: "The capital of France is"
Output: ["Paris" = 0.91, "London" = 0.02, "Berlin" = 0.01, ...]

Input: "The capital of France is"
Output: ["Paris" = 0.91, "London" = 0.02, "Berlin" = 0.01, ...]

The numbers you see here are token probabilities. The model assigns high probability to the word Paris, and lower probabilities to other possible continuations like London or Berlin.

It doesn’t know Paris is the capital. It has just learned that, in most texts, that phrase tends to be followed by Paris.

This deceptively simple task trains the model to master everything from grammar and syntax to world knowledge, and trough fine-tuning and reinforcement learning we get to reasoning patterns, analogies, and seemingly even emotion. Not because it understands them, but because predicting text requires modeling all the latent structure of language.

The model doesn’t ‘understand’ what comes next. It just gets really good at predicting it, similarly to how your phone guesses the next word in your text message

Training Datasets: Scale and Source

To learn these patterns effectively, LLMs are trained on massive datasets often containing hundreds of billions to trillions of tokens.

These datasets are compiled from public web content, licensed databases, books, code repositories, academic papers, and forum-like discussions from platforms like Reddit and Quora.

Common sources include:

Common Crawl, which includes billions of pages scraped from across the web
Wikipedia, offering clean, structured factual content
BooksCorpus and Project Gutenberg, providing long-form literary and narrative language
GitHub, used for training code-centric models like Codex
ArXiv, PubMed, Stack Overflow, and Reddit, contributing community Q&A, technical writing, and scientific discourse
Synthetic Data, as of 2025

GPT-3, for instance, was trained on ~300 billion tokens.

GPT-4 and Claude 3 have exceeded 1–2 trillion tokens, and models like Gemini 2.5 Pro operate with context windows of up to 10 million tokens, allowing them to “read entire books” during training or inference.

Importantly, companies like OpenAI, Anthropic, and Google have stopped publishing full dataset lists. However, leaked reports and research suggest a continued reliance on large-scale mixtures of Common Crawl, Wikipedia, code, academic papers, and publicly available Q&A sources.

The training goal is to expose the model to diverse styles, subjects, domains, and discourse forms, so it becomes a generalist capable of generating coherent output on almost any topic.

The application goal is instead to simulate human-like interaction and produce answers that feel satisfying to users. This is also where things get ethically complex.

The model is not trained to say “I don’t know,” because in truth, it never knows.

Instead, it will often prefer to confidently generate a wrong answer rather than admit uncertainty.

Optimization: Loss Functions and Gradient Descent

To adjust its internal behavior during training, the model needs a way to measure whether it’s making good predictions. This is where concepts like tokens, weights, and loss come into play.

Don’t worry, I’ll explain each of these in more depth later. For now, here’s what you need to know.

In optimization, every prediction made by the model during pre-training is compared against the actual next token. The difference between the model’s prediction and the correct token is measured using a loss function, typically cross-entropy loss.

Loss function is used to measure whether LLM correctly predicts the next token. According to the loss score, weights in LLM will be slightly increased or decreased. The weights will keep changing until the prediction of LLM can’t be improved anymore. This process forces the model to generate exactly the same token as what’s in the training data so that LLM can fully grasp the pattern distribution of the training data.

Over millions of training steps, the model gradually adjusts its internal weights using gradient-based optimization (most often AdamW or LAMB, instead of classical SGD) to reduce error and build a statistical model of how language behaves.

The process is computationally enormous:

Hundreds of GPUs or TPUs operating for weeks or months
Optimizing hundreds of billions of parameters
Training costs exceeding millions of dollars per model (but sure, people just want a free PDF with a quick optimization guide and call it a day… or a drawing on a LinkedIn Post)

Yet the outcome is a model that has internalized linguistic priors, not by memorizing text, but by learning which types of language structures are more likely to follow others.

Put simply:

Each time the model guesses a word wrong, it gets a score telling it how far off it was. Then it nudges itself to get better, millions of times over.

Frozen Models vs Modular Systems

Once pretraining is complete, the result is a frozen foundation model: a large, self-contained statistical structure that cannot update itself with new facts or content.

However, many LLMs in production today (e.g., ChatGPT, Claude, Gemini) are no longer “just” frozen models. They are modular systems that include:

Retrieval components, which pull fresh content from search engines or internal knowledge bases
Tool use capabilities, such as calculators or code interpreters
Memory modules, which can store user-specific information across sessions

As a result, newer LLMs can “appear” up to date, but only if they were:

Connected to a live retrieval API (e.g., Bing Api in ChatGPT),
Passed the content through a long-context input (e.g., full PDF upload),
Or retrained or fine-tuned explicitly with new datasets (rare, expensive).

For marketers, this means: you are still invisible to the base model unless your content enters a retrieval pipeline or fine-tuning dataset.

There is no pinging, no reindexing, no real-time discoverability unless retrieval is on. This is truly important to understand, as it means:

✓ The base model is the same for everyone.
The chat session with retrieval on, is for that chat session only.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Summary

Element	Description (2025)
Model Goal	Predict next token (causal modeling)
Method	Autoregression with attention
Training Data	Enormous web corpora + curated datasets
Optimization	Cross-entropy loss + AdamW / LAMB
Scale	100B–1T+ parameters; trillions of tokens
Limitations	No self-updating; no model memory of web crawling
Extensions	Some models support tools, memory, and live search
Outcome	A statistical engine that simulates understanding efficiently

Faqs

What is pre-training in large language models?

Pre-training is the first stage of training an LLM, where the model processes massive datasets to learn statistical patterns of language. It does not give the model factual knowledge or real understanding, but it teaches it to predict the next word in a sequence.

Which datasets are used in LLM pre-training?

LLMs are typically trained on a mix of large-scale public and licensed datasets, including Common Crawl, Wikipedia, BooksCorpus, Project Gutenberg, GitHub code, ArXiv research papers, PubMed articles, and Reddit discussions.

How do LLMs actually learn during pre-training?

They learn by predicting the next token (a word or word fragment) in a text sequence. Through repeated predictions across trillions of tokens, the model adjusts its weights using loss functions and gradient descent, improving its ability to generate coherent language.

What is the difference between pre-training and fine-tuning?

Pre-training builds the general language foundation of the model. Fine-tuning and reinforcement learning (RLHF or RLAIF) are later stages that adapt the pre-trained model to follow instructions, align with human preferences, or specialize in certain tasks.

Why can’t LLMs update themselves after pre-training?

Once pre-training is complete, the model’s weights are frozen. The system cannot update its knowledge automatically like a search engine. Any updates require fine-tuning, retraining, or connecting the model to retrieval systems that bring in external, live data.

Why does pre-training matter for SEO and AI visibility?

Understanding pre-training helps marketers see why you can’t “rank” in ChatGPT like on Google. Since models rely on static data and probability, visibility strategies must focus on making content clear, structured, and authoritative enough to be included in training or cited via retrieval.

Inside LLMs: Neural Networks & Attention

Pietro Mingotti — Sun, 03 Aug 2025 14:53:27 +0000

TL;DR

Neural networks are layered architectures where each layer extracts patterns and relationships from text.
Attention mechanisms allow LLMs to weigh the importance of each word relative to others, enabling context‑aware responses.
Transformers combine neural networks with attention, replacing older sequential models (RNNs, LSTMs) with parallel processing that scales.
Self‑attention lets models process entire sequences at once, understanding long‑range dependencies in language.
Positional encoding provides order awareness, since attention alone does not capture word sequence.
For marketers, understanding attention is key: models prioritize structured, contextually clear, and semantically rich content.
Takeaway: You cannot “rank” in ChatGPT, but you can design content that aligns with how attention mechanisms process information, improving inclusion and citation probability.

At the heart of every Large Language Model lies a special kind of neural network architecture called the transformer.

Originally designed for natural language processing, transformers have since become the foundational architecture across AI domains, powering not only text generation, but also image recognition, video understanding, audio synthesis, and multimodal reasoning.

But before we explore how the transformer revolutionized natural language processing, we must first understand the building blocks it evolved from, and why attention became the key breakthrough that allowed models like GPT to scale beyond anything that came before.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

What Are Neural Networks in Large Language Models?

A neural network is, at its core, a stack of mathematical operations layered to extract patterns. Early language models relied on architectures like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs).

These models processed text sequentially, one token at a time, and powered early applications like Apple’s Siri, voice-to-text tools, and predictive typing, and were even used in Google Translate before 2017. But they had major limitations:

Limited memory of earlier tokens
Vanishing gradients during training
Poor parallelization, making them inefficient to scale

The breakthrough came in 2017, when a well known team at Google Brain and Google Research published a paper titled “Attention Is All You Need.”

The authors, Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, introduced a completely new architecture: the transformer.

Unlike RNNs and LSTMs, transformers eliminated recurrence altogether. They used a mechanism called self-attention, which allowed the model to weigh the importance of all tokens in a sequence regardless of their position, and do so in parallel.

The result? Dramatically improved performance, better long-term memory, and the foundation for models like GPT, BERT, Claude, and LLaMA.

The transformer wasn’t just an upgrade, it was a turning point.

Within a year of its introduction, the transformer architecture had replaced RNNs in virtually every state-of-the-art language model.

Today, in late 2025, transformer architectures remain the universal backbone of advanced AI systems, powering not only text generation (as in GPT-5, Claude 4.1, and LLaMA 3), but also image synthesis (e.g., DALL·E 4, Midjourney 7), video generation and analysis (Gemini 2.0 Pro, Veo 2), multimodal agents (GPT-5 with Agent Mode, Claude Artifacts), and even embodied AI for robotics and autonomous planning.

Their scalability, parallelism, and self-attention mechanism made them the first architecture to unify language, vision, and action into a single computational framework

In simple terms, a transformer allows the model to look at all the words in the input at once, using a mechanism called attention to decide which ones matter most for predicting the next word.

Each token is generated one at a time, but within that generation, the transformer uses a deep stack of feedforward operations and attention layers, looking at the entire previous context in parallel. There’s no memory loop like in RNNs; just a massive, structured, one-pass reasoning engine per token.

For example, when reading “The cat sat on the mat, but it ran away”, the model can figure out that “it” refers to “the cat”, not “the mat”.

The Attention Mechanism Explained

Self-attention allows the model to examine every word in a sentence, not just in order, but in relation to all the others, and assign a weight to how much each word should influence the interpretation of the others.

Imagine you’re trying to understand the meaning of the word “bank” in the sentence:

“He sat by the bank and watched the river flow.”

The word “bank” could mean a financial institution or the edge of a river. Attention mechanisms help the model weigh nearby words like “sat”, “watched”, and “river” more heavily, making it more likely to interpret “bank” as part of a landscape, not a place for money.

Technically, the model builds a matrix of relationships where every word gets a numeric score indicating how strongly it relates to every other word. This allows it to:

Capture context far beyond adjacent words
Resolve ambiguity more accurately
Keep track of relevant meaning over long sequences

Finally, because attention looks at all tokens simultaneously, it also enables parallel processing, allowing transformers to train faster and scale bigger than older architectures like RNNs or LSTMs.

While the original transformer architecture performed full self-attention over every token, today’s models, like GPT-4o and Gemini 1.5, use variants like grouped-query attention and sliding windows to reduce compute.

This allows them to handle context windows of up to 1 million tokens, enabling document-level and multi-session reasoning.

Multi-Head Attention

But attention is not applied just once. In transformer models like GPT-5, attention is multi-headed. This means that the model learns to look at the sequence through multiple perspectives at once.

Each “head” learns to focus on different types of relationships:

One might learn syntax (subject–verb agreement)
Another might capture coreference (who is “she”?)
Another might specialize in world knowledge (capital cities, famous names)

The outputs of all heads are then combined and transformed into a single, unified representation that the model can use to predict the next token. This, is the part that resembles “intelligence” the most (or defines it, this is up to subjective interpretation).

Picture credit: Pietro Mingotti, CEO & Head of Digital @ Fuel LAB® – Miro Sketches – Multi-head Attention

This mechanism allows the model to encode multiple layers of meaning simultaneously, making it incredibly powerful, and incredibly hard to interpret.

Multi-head attention lets the model look at the same sentence in different ways at the same time, like having several readers, each highlighting a different kind of meaning, then merging their notes into one understanding.

Positional Encoding: Giving Order to Chaos

There’s one challenge with transformer models: they don’t know word order by default.

Because self-attention treats all tokens simultaneously, like a “bag of words”, the model has no built-in concept of before or after. To solve this, transformer models add positional encodings, which are mathematical signals (often sine/cosine waves or learned vectors) that give each token a sense of position in the sequence

This allows the model to learn, for example, that:

“Paris is the capital of France” ≠ “France is the capital of Paris”

Even though the words are the same, the order, and thus the meaning, differs. Positional encodings restore this distinction.

Since transformers read all words at once, they need extra clues to know what came first. Positional Encoding gives each word a timestamp, so ‘Paris is the capital of France’ doesn’t sound like the reverse

Embedding Layers: From Words to Vectors

Before attention can operate, input text must be converted from language to numbers. This is done in two steps: first, the text is broken into tokens, and then each token is mapped to a unique vector, a list of numbers that captures its meaning:

Tokenization splits text into chunks (words, subwords, or characters)
Embedding layers assign each token a high-dimensional vector (e.g., 768 to 12,000+ numbers)

These vectors don’t just represent words. They represent relationships between words. Tokens with similar meanings (like dog and puppy) are located close together in this multi-dimensional space, while very different words (like dog and rocket) are farther apart.

For example, the model might learn a pattern like this: if you take the vector for “king”, subtract “man”, and add “woman”, the result is close to the vector for “queen”.

If that sounds abstract, here’s one of the most famous examples:

JavaScript

vec("king") - vec("man") + vec("woman") ≈ vec("queen")

vec("king") - vec("man") + vec("woman") ≈ vec("queen")

This property, known from word2vec and carried into modern LLMs, is what enables analogical reasoning, synonymy, and deep language modeling.

Embeddings turn each word into a cloud of numbers, where similar meanings land close together, so ‘dog’ and ‘puppy’ are neighbors, while ‘dog’ and ‘spaceship’ live worlds apart in a vectorial model.

In newer models like Gemini 1.5 and GPT-4o, embeddings also apply to images, audio, and video, enabling cross-modal reasoning. A picture of a dog and the word “dog” can land in similar regions of the same vector space.

How Attention Affects Content Visibility

Understanding attention and embeddings isn’t just academic; it has practical implications for visibility that you will have by now guessed, if you’ve connected the dots.

✓ Models pay attention to semantic clarity.

Well-structured, unambiguous content is more likely to be parsed and cited correctly. Marketing claims are worth nothing. Clear explanations are golden.

✓ Embeddings are used to match your content to user prompts.

If your article is semantically similar to what a user asked, it’s more likely to be retrieved or referenced.

✓ Positional clarity helps models assign weight during attention.

Thus headings, bullet points, answer-first structure are highly relevant.

In short: the better your content speaks to the model’s architecture, the better chance it might have to surface in AI-generated responses.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Summary

Concept	Role in LLMs
Self-Attention	Weighs token relationships across the sequence
Multi-Head Attention	Allows parallel views of meaning and structure
Positional Encoding	Gives tokens sequential context
Embeddings	Transforms words into vectors for computation
Transformers	Replace recurrence with parallelized attention

Faqs

What are neural networks in large language models?

Neural networks are layered mathematical systems that process input text step by step, learning patterns and relationships in language.

What is the attention mechanism in LLMs?

Attention allows the model to assign different weights to words in a sequence, focusing on the most relevant context to generate accurate responses.

How do transformers use attention?

Transformers apply self‑attention, which processes all tokens in parallel and captures both short‑range and long‑range dependencies in text.

How are transformers different from older models like RNNs?

Unlike recurrent networks that process text sequentially, transformers handle entire sequences simultaneously, making them faster and more scalable.

Why should marketers understand neural networks and attention for GEO / AEO?

Because these mechanisms determine how content is processed, retained, and cited by AI models. Clear, structured writing increases visibility in LLM outputs.

Inside LLMs: RLHF, RLAIF & the Evolution of Model Alignment

Pietro Mingotti — Sun, 03 Aug 2025 14:53:26 +0000

TL;DR

RLHF (Reinforcement Learning with Human Feedback) uses human annotators to rank model outputs and train reward models that align responses with human expectations.
Steps in RLHF: Supervised fine‑tuning → reward modeling → reinforcement loop with algorithms like PPO.
RLAIF (Reinforcement Learning with AI Feedback) reduces reliance on humans by using AI systems to evaluate outputs against defined principles.
Anthropic’s Constitutional AI is a leading example of RLAIF, where a “constitution” of rules guides feedback instead of thousands of annotators.
Limitations: Both methods risk bias, scaling challenges, and “preference drift.”
Takeaway for marketers: Alignment explains why LLMs generate safe, brand‑friendly answers — and why citation probability depends on producing content that matches preferred, high‑scoring outputs.

While pre-training equips Large Language Models (LLMs) with a broad statistical understanding of language, it does not make them helpful, safe, or aligned with user expectations.

Left in their raw form, these models can be verbose, biased, evasive, or simply unhelpful, even when technically accurate.

To bridge the gap between linguistic fluency and user alignment, LLM developers introduced additional fine-tuning steps after pre-training. Until 2023, the dominant approach was Reinforcement Learning with Human Feedback (RLHF).

In 2024–2025, that pipeline is evolving into RLAIF (Reinforcement Learning with AI Feedback) and DPO (Direct Preference Optimization), which dramatically reduce cost and increase scalability.

But the goal remains the same: to tune the model toward behavior that humans rate as clear, cooperative, safe, and useful; and in doing so, shape which types of content are preferred, cited, and trusted in generative outputs.

This is where and how models stop being just token predictors and start becoming assistants.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

Why model alignment: pre-training alone isn’t enough

Pre-training teaches a model to predict the next word, not to follow instructions, be helpful, or behave safely.

This results in several limitations:

Incoherent or verbose answers
Inappropriate, biased, or unsafe completions
Overconfidence in wrong answers
No natural tendency to cite or clarify

Thus, a second phase (often called alignment) is needed.

This phase tunes the model to behave in ways that humans prefer, through supervised fine-tuning, ranking, and reinforcement signals.

The original alignment pipeline (RLHF Now Legacy)

While no longer dominant, RLHF is foundational for understanding how modern alignment techniques work. It can be divided in three steps.

1. Supervised Fine-Tuning (SFT)

Before using reinforcement learning, the base model is first trained on example conversations written by human trainers. These demonstrations teach it to:

Respond helpfully and concisely
Use an appropriate tone
Avoid harmful or misleading content

This creates a supervised baseline, a first attempt at controlled output.

2. Reward Model Training

Next, multiple answers to the same prompt are generated, and human labelers rank them from best to worst.

These rankings are used to train a separate reward model, a neural network that learns to estimate human preferences.

For example:

Prompt: “Explain blockchain to a 6-year-old.”

Labelers rank: Response A > Response C > Response B

Reward model learns: clarity, simplicity, accuracy = higher reward

3. Reinforcement Learning (PPO)

The final step is actual reinforcement learning. The model generates responses, and the reward model scores them. Using a technique called Proximal Policy Optimization (PPO), the LLM updates its parameters to maximize the expected reward.

It’s like training a dog: the model tries something, gets a “treat” if humans like it, and learns to prefer similar behaviors.

Not to break the romantic aspects here, but for the sake of clarity, in this case, the “treat” isn’t dopamine (like it would be for a human brain); it’s a numerical reward score, usually between 0 and 1, computed based on how closely the output aligns with what humans rated highly in the past.

The model doesn’t “feel” this reward, but it mathematically adjusts its behavior to produce outputs that are more likely to receive higher scores.

Over thousands of iterations, this fine-tunes the model to respond in ways that align with human intent, giving us ChatGPT as we know it: helpful, conversational, and most of the time, cautious.

Note: While OpenAI continues to use PPO in many alignment pipelines, newer approaches such as Direct Preference Optimization (DPO) and KTO (Kullback–Leibler Training Objective) are gaining popularity for their efficiency and simplicity.

Modern Alignment Approaches: RLAIF and DPO

By late 2023 and into 2024, frontier labs recognized that RLHF was:

Expensive (requiring thousands of human labelers)
Inconsistent (different raters had different preferences)
Slow (weeks to months per iteration)

Two new approaches emerged to address this.

A. RLAIF – Reinforcement Learning with AI Feedback

Instead of hiring humans to rank outputs, labs like OpenAI and Anthropic now use other models to act as preference judges.

A trained model evaluates the output of the main model and scores or ranks responses.

Removes human labor bottleneck
Increases consistency of preference judgments
Enables daily or hourly alignment updates based on usage data

For example, OpenAI now uses specialized GPT-4o-mini or domain-tuned GPT-5 variants to score outputs from GPT-5, enabling continuous feedback loops without human raters. Anthropic applies similar pipelines with Claude Sonnet evaluators refining Claude Opus models.

We are already in an era where machines optimize machines, with alignment models evolving alongside their parent systems

B. DPO – Direct Preference Optimization

DPO simplifies the process even further:

No reward model is needed
The base model is trained directly to prefer output A over B when A was preferred
Uses a contrastive loss function to push the model toward better outputs

This allows faster training, fewer moving parts, and more efficient tuning cycles, especially when combined with synthetic data.

Why Model Alignment Matters for Content Visibility

Whether models use RLHF, RLAIF, or DPO, they are being tuned to favor certain traits in content:

Clear, concise phrasing
Answer-first formatting
Neutral, factual tone
Helpful framing over promotional language

This has direct consequences for marketers, SEOs, and content strategists:

Reward-trained models are less likely to cite ambiguous, unstructured, or marketing / sales-heavy pages
They prefer sources that appear cooperative and educational
Content that mimics the structure of “preferred answers” (definitions, FAQs, bullet points) is more likely to be included or paraphrased

This applies not only to OpenAI’s ChatGPT, but to

Claude

Gemini

Perplexity

Does “Thumbs Up” matter?

Today, user feedback like or on answers does not update the model in real time, but:

It is logged
It is aggregated
It may feed back into future SFT batches or reward model updates

So, while individual votes don’t shift behavior immediately, they still inform the direction of future fine-tuning, especially for tone, style, and safety.

In practice, this means models may favor answer formats and tone that have previously been rated helpful, even if your page is more informative.

If your content looks overly promotional, reads like an ad, or contains unclear formatting, it is less likely to pass these emergent reward filters, whether human or AI-learned.

This has direct implications for content strategy. If your page is difficult to parse or feels manipulative or sales-heavy, it may be actively disfavored by the model; even if the information is correct.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Summary

Phase	Purpose
SFT (Supervised Fine-Tuning)	Teaches the model how to follow instructions and converse
Reward Model (RLHF only)	Scores outputs based on human rankings
PPO (RLHF)	Optimizes responses to match human-labeled rewards
RLAIF	Uses model-generated preferences to scale reward scoring
DPO	Directly optimizes for “preferred output” without reward model
User Feedback ( )	Logged for SFT and future tuning, but no live update
Marketing Relevance	Models favor clarity, structure, helpfulness and not sales-language

Alignment techniques (whether RLHF, RLAIF, or DPO) fundamentally shape how a model “decides” what to say, and what content to favor. These techniques are invisible to users, but they explain why some content gets cited and others ignored.

Next, we’ll explore what happens when this system fails: hallucinations, false confidence, and the limits of what LLMs can know.

Faqs

What is RLHF in large language models?

Reinforcement Learning with Human Feedback (RLHF) is a training process where humans rank outputs, and a reward model teaches the LLM which responses are preferred.

How does RLHF work in practice?

It starts with supervised fine‑tuning, adds a reward model based on human rankings, and then applies reinforcement learning to optimize the model toward those preferences.

What is RLAIF and how is it different?

RLAIF (Reinforcement Learning with AI Feedback) uses AI systems instead of humans to provide feedback, often guided by written rules or constitutions.

Why is alignment important for ChatGPT and other LLMs?

Alignment ensures models don’t just predict text, but generate answers that are safe, useful, and in line with human values.

What does model alignment mean for marketers?

It explains why LLMs won’t “rank” your site like Google. Instead, they generate outputs that fit their alignment rules. To increase citation probability, marketers need structured, factual, and preference‑friendly content.

Inside LLMs: why LLMs don’t really “know” things

Pietro Mingotti — Sun, 03 Aug 2025 14:53:25 +0000

TL;DR

LLMs don’t “know” facts. they predict tokens based on training data.
Their knowledge is limited to the pre‑training corpus and cut‑off dates.
Hallucinations occur when models generate fluent but incorrect answers.
Factual reliability improves with retrieval‑augmented generation (RAG).
Marketers must treat LLM outputs as probabilistic, there’s no ranking and no guarantee.
Strategy: structure content for higher citation probability rather than expecting factual “ranking.”

Despite their remarkable fluency, Large Language Models (LLMs) don’t “know” anything in the human sense of the word. They do not reason with will or identity. They do not retrieve. They do not store facts in a database. What they do is predict, based on statistical patterns.

There is no will. There is no “intelligence” the way we are used to define it.

This leads to one of the most misunderstood and critical limitations of modern AI: LLMs often get things wrong, and they sound very confident about the wrong things they output.

This section explains why.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

Hallucination Is a Feature, Not a Bug

In AI, a hallucination is when a model generates text that is fluent but factually incorrect.

Examples include:

Inventing fake quotes or sources
Asserting wrong historical dates
Confabulating statistics, products, or company names
Coming up with data that never existed in the one you fed it

The reason LLMs hallucinate is fundamental to how they’re built:

They were trained to …

… predict the next token

This means:

If a false statement is statistically plausible based on the training data, the model may confidently generate it.
If a rare or nuanced fact was never seen during training, the model will fill in the gap with what seems “likely.”
If you ask for a specific answer (e.g. “Give me 15 famous transexual blonde astrophysicists”), the model will oblige, even if it has to fabricate names to meet your request.

The model isn’t lying. It has no concept of truth; only percentage-weighted probability of likelihood.

Newer models now attempt on-the-fly factual grounding using RAG pipelines or citation prioritization, especially in tools like Perplexity, Gemini, and ChatGPT w/ browsing.

Compression, Not Memorization

It’s tempting to think of LLMs as having read the internet and memorized it. Most of us, including myself, had this feeling. But that’s not how they work.

Modern LLMs are lossy compressors. They are trained to absorb billions of tokens into a finite number of parameters (e.g., 175B in GPT-3, possibly 1T+ in GPT-4). This process:

Discards low-frequency details
Prioritizes common, central representations
Blurs the edges of uncommon knowledge

This means:

✓ Facts that appear frequently and consistently are well-modeled

☞ Rare, subtle, or contradictory facts may be “averaged out” or lost entirely

☞ Models may paraphrase incorrectly or attribute facts to the wrong sources

So when a model seems to “know” something, what it really has is a statistical generalization, not an entry in a local Wikipedia.

With MoE (Mixture of Experts) architectures like GPT-5 and Gemini 2.5, the model can route different inputs to specialized subnetworks, improving retention of edge-case knowledge, but at the cost of even greater opacity in how and why some facts are prioritized over others.

Which can be an aggravating problem for the scope of this research.

Precision vs. Coverage: A Trade-Off

LLMs are trained to be generalists; to speak about medicine, politics, literature, software, relationships, and thousands of other domains and semantic entities.

But this breadth comes at the cost of precision. As the model tries to be competent across all topics, it becomes less reliable on the edge cases.

This creates a problem for industries that depend on accuracy:

Legal: hallucinated laws or court decisions
Medical: fabricated conditions or outdated guidelines
Finance: incorrect math or citation of non-existent rules

In our marketing context, this means the models might:

☞ Attribute products to the wrong company

☞ Confuse competitors

☞ Reference articles that don’t exist

That’s why fact-checking every AI-generated output is not optional, it’s mandatory. But also, that’s why currently every Marketing framework will never be fail-proof.

Not because of our best efforts as MarTech professionals, but because of how the models work.

Why LLMs Sound Confident Even When Wrong

Part of the confusion stems from how LLMs were trained to communicate. ChatGPT doesn’t hedge or express doubt unless prompted to, and that’s simply because hedging is statistically rare in confident writing.

There’s no conspiracy theory to be found here; as explained, the model doesn’t mean to lie. It’s just that in all the training data, everyone always sounded confident.

Consider:

“I’m not sure, but I think the capital of France is Paris.”
✓ “The capital of France is Paris.”

The second sentence is more probable in the training data, so the model prefers it, even when it’s unsure. This default confidence leads users to overtrust the model, a risk amplified by polished tone and rapid response.

In the case the readers of the SEO industry were wondering, that’s actually why it’s unethical for Google to offer SGE in SERP.

We all use A.I. models daily, and that’s why it’s even more important to have a reliable, independent and ranking-based search engine for our facts checking.

Implications for Marketers and SEOs

If a model can:

Misquote your brand
Confuse your product with a competitor’s
Link to your site but summarize it incorrectly
Invent facts with your name attached…

… then your visibility strategy must include AI fact hygiene:

✓ Monitor citations across ChatGPT, Bing Copilot, and SGE
✓ Regularly test how the model represents your company, products, and messaging
✓ Publish structured, explicit, unambiguous content reducing the chance of confabulation
✓ Create canonical sources of truth (e.g. FAQ pages, schema-enhanced content) that can be used as “grounding material”

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Summary

Limitation	Cause
Hallucination	Prediction over verification
Loss of rare facts	Compression during training
Incorrect citations	Lack of retrieval or grounding
Overconfidence	Imitation of confident writing in training data
Confused brand mentions	Semantic similarity, not identity tracking
Model collapse (future risk)	AI self-sampling & degraded truth anchoring

LLMs are powerful, but they are not fact engines. They are text simulators, and while they can express truth, they do not know it.

Understanding this is essential not just for prompt engineering, but for building trustworthy, AI-resilient digital brands.

Mojo

[Pre-Training Dataset] ──► [Token Prediction Engine (Transformer)]
                               │
                               ▼
       [Transformer Blocks: Attention + MLP (+ MoE) + Positional Info]
                               │
                               ▼
         [Base Model Weights (Frozen or LoRA-tunable)] ──► [Supervised Fine-Tuning]
                                                                │
                                                                ▼
                                      [Reward Model + RLHF (Human Feedback)]
                                                                │
                                                                ▼
                                 [ChatGPT / Claude / Gemini – Final Model]

                                                ▲
                                                │
                         ┌──────────────────────┘
                         │
                [Prompt Input]
                         │
         [Optional: RAG / Search / Tools Invocation]
                         │
                         ▼
          [Tokenization & Embedding] → [Token-by-Token Output]

[Pre-Training Dataset] ──► [Token Prediction Engine (Transformer)]
                               │
                               ▼
       [Transformer Blocks: Attention + MLP (+ MoE) + Positional Info]
                               │
                               ▼
         [Base Model Weights (Frozen or LoRA-tunable)] ──► [Supervised Fine-Tuning]
                                                                │
                                                                ▼
                                      [Reward Model + RLHF (Human Feedback)]
                                                                │
                                                                ▼
                                 [ChatGPT / Claude / Gemini – Final Model]

                                                ▲
                                                │
                         ┌──────────────────────┘
                         │
                [Prompt Input]
                         │
         [Optional: RAG / Search / Tools Invocation]
                         │
                         ▼
          [Tokenization & Embedding] → [Token-by-Token Output]

Faqs

Do LLMs like ChatGPT actually understand facts?

No. LLMs don’t store or reason over facts like humans. They predict likely words based on statistical patterns in training data.

Why do LLMs sometimes provide incorrect or “hallucinated” answers?

Because they prioritize fluency and probability, not truth. If the training data is limited or conflicting, they may generate plausible but false outputs.

Can retrieval‑augmented generation (RAG) solve LLM factuality issues?

Partially. RAG injects external verified data into prompts, improving factual grounding. But it still depends on model interpretation. Most importantly, these facts are persistent only in that one chat session. The model doesn’t “know” or “learn” anything new.

How should marketers approach LLM knowledge limits?

Marketers should not expect to “rank” on LLMs. Instead, they should create structured, trustworthy, citation‑ready content that LLMs can easily pull from.

What is the main risk of relying on LLM answers?

The risk is mistaking fluency for accuracy. LLMs sound authoritative but can be factually wrong, which can mislead decision‑making if unchecked.

Inside LLMs: Understanding Transformer Architecture – A Guide for Marketers

Pietro Mingotti — Sun, 03 Aug 2025 14:53:24 +0000

TL;DR

Transformers are the core architecture of modern LLMs like GPT‑4, Claude, and Gemini.
Each transformer block includes multi‑head attention, normalization, and MLP layers.
Stacking 90+ layers enables hierarchical abstraction: grammar → semantics → reasoning.
Mixture of Experts (MoE) increases scalability and specialization without extra compute.
This architecture is what allows LLMs to “reason” from words to complex concepts.

So far, we’ve explored the core building blocks that allow Large Language Models to process and predict language: self-attention, positional encoding, and embeddings.

Now, we’ll look at how these components are arranged inside a transformer model and how this architecture enables emergent capabilities like reasoning, abstraction, and memory-like (I emphasize, “like”) behavior.

This is where LLMs like GPT-4 begin to resemble thinking systems, not because they understand, but because of the depth and structure of their computation.

And from a neuroscience perspective, one might argue they begin to show primitive forms of plasticity and interconnectivity.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

The Transformer Stack: Layers on Layers

A modern LLM like GPT‑5 or LLaMA 4 continues to rely on stacks of transformer blocks. However, increasingly common in the latest models, such as GPT‑5, LLaMA 4 (Scout, Maverick), DeepSeek‑V3, DBRX, and Claude 4.1, are Mixture-of-Experts (MoE) architectures. In these systems, only a subset of expert sub-networks (often specialized feed-forward layers) are activated per input, enabling massive capacity scaling with efficient compute use .

Each block performs the same sequence of operations:

Multi-head self-attention
Residual connection and layer normalization
Feedforward network (MLP, MultiLayer Perceptron)
Another residual connection and normalization

This repeating structure is simple, but when stacked deeply (96+ layers in GPT-4), it becomes extremely powerful.

Each layer builds upon the representations learned by the previous one, enabling hierarchical abstraction:

Early layers learn syntax and local phrase structure
Mid layers learn sentence-level semantics and entity resolution
Deep layers learn reasoning, logic, and instruction-following

This depth allows the model to “climb the abstraction ladder”, from characters and words to facts, concepts, and ideas.

Think of a transformer model like a multi-story building; each floor processes the input a little more deeply.

Early floors look at grammar, middle ones connect ideas, and top floors start “reasoning”. GPT-4 has over 90 of these stacked floors

Picture credit: Pietro Mingotti, CEO & Head of Digital @ Fuel LAB® – Miro Sketches – Transformer Stack and LLMs flowchart

1. Residual connections (shortcut memory paths)

Instead of passing information only through the complex transformations of each layer, residual connections add a shortcut. The input to a layer is added back to the output, like saying:

“Here’s the new version of the input… but don’t forget what we started with.”

This helps the model preserve low-level patterns (like word structure or local grammar) even as deeper layers start learning abstract concepts or logical relationships.

It also makes learning easier and faster, because the network doesn’t have to rebuild everything from scratch at each step. Which encompasses logical threats to the idea of “influencing learning”.

2. Layer normalization: keeping the signal clear

After every major operation (attention or MLP), the outputs are normalized (adjusted) to keep the values within a balanced range.

Most modern LLMs no longer use classic LayerNorm alone. Instead, they often apply variants like RMSNorm (Root Mean Square Normalization), which omit mean-centering, or ScaleNorm, which normalize only by scaling.

These variants are computationally cheaper and more stable at scale, especially in deep architectures like LLaMA 3 or Gemini 2.5.

In short: layer normalization has evolved from a simple stabilizer into a design choice that affects training convergence, scaling law efficiency, and model generalization.

This is like rebalancing a sound mixer after each effect is applied: it prevents any one signal from getting too loud (or too quiet), and ensures the model doesn’t “blow out” during training.

Without these two ingredients, models with more than a few layers would become unstable or forgetful. With them, LLMs can grow to hundreds of billions of parameters without collapsing.

MLPs as Fact Storage

Each transformer block contains not just attention heads but also a position-wise feedforward network, what is called a multilayer perceptron (MLP). Though simple in design, MLPs play a surprising role: they store and manipulate factual associations.

Recent interpretability research, including work by Anthropic, OpenAI, and Redwood Research, has shown:

MLPs encode structured knowledge, like country capitals or famous people. If you’ve been doing SEO right in the last 8 years, you know what Semantic Entities are. This is a very similar concept.
Specific neurons* activate when certain facts are invoked (e.g., “Barack Obama” → “President”.)
These MLPs often contain memorized fragments of training data, despite no explicit database.

This is how a model “remembers” that Paris is the capital of France, or that Tesla makes electric cars not because it retrieves this data from a source, but because it has encoded the pattern during pre-training.

*In neural networks, a neuron is just a mathematical unit; a function that transforms input into output based on learned parameters.

Inside each layer of the model are tiny mathematical switches (MLPs, or the neurons) that light up for facts it has seen enough to remember, like a brain cell whispering ‘Paris… capital… France”

This has led to the discovery of monosemantic neurons, individual units that reliably activate for specific concepts (e.g., “CEO”, “weapon”, or “Paris”), making LLMs more interpretable at a mechanistic level.

Sparse Computation with Mixture of Experts (MoE)

Most frontier models in 2025, including GPT-5, Claude 4.1 Opus, Gemini 2.5 Pro, and LLaMA 3 400B, use a new architectural technique called Mixture of Experts (MoE).

Unlike standard transformer blocks that use a single feedforward (MLP) layer per layer, MoE layers contain multiple parallel “experts”, mini-networks that specialize in different tasks or representations.

The model doesn’t use all experts at once. It uses a router to activate only the most relevant ones per token, usually 2 out of 16 or more. This allows the model to scale model capacity without increasing compute per token.

This has two key benefits:

Scalability: Models can grow to trillions of parameters without linear cost increases.
Specialization: Different experts learn different sub-skills, improving reasoning, math, or multilingual abilities.

Picture credit: Pietro Mingotti, CEO & Head of Digital @ Fuel LAB® – Miro Sketches – Mixture of Experts Model vs Triage

For marketers and content strategists, this has implications for prompt performance and content matching. Some queries may be routed to different experts depending on linguistic structure, tone, or subject matter, making semantic clarity more important than ever.

Directionality and Causal Masking

In models like GPT-4, the transformer architecture is unidirectional; it reads and predicts text from left to right, just like we do when writing a sentence.

To enforce this, the model uses something called a causal mask, which blocks it from “looking ahead” during training. This ensures the model can only base its predictions on the tokens that came before, never on future ones.

In other words, it has to guess the next word based on what it’s already seen; not what’s coming. This design mirrors human language production:

We write one word at a time, choosing the next based on the context we’ve built so far.

This is called autoregressive generation. The model generates text one token at a time, feeding its own output back in as input.

Because of causal masking and unidirectional attention, GPT produces language that flows fluently from start to finish, even when it’s inventing everything from scratch.

By contrast, other models like BERT are bidirectional: they look both left and right at once. That makes them great for understanding existing text, but unsuitable for generating it.

So GPT’s left-to-right, autoregressive structure isn’t a flaw; it’s what makes it such a powerful writer.

Interpretability and Latent Concepts

For years, LLMs were considered black boxes: powerful, but opaque. Today, thanks to interpretability research, we’re beginning to see inside the machine and what we’re finding is both fascinating and useful; yes, for marketing too.

Inside transformer models, we have discovered patterns like:

Groups of neurons that consistently activate when the model sees abstract themes, such as violence, humor, or gender.
Geometric relationships in embedding space, where word meanings form consistent directions, such as we have seen before: vec(king) – vec(man) + vec(woman) ≈ vec(queen)
Specialized zones in the model: some layers tend to handle syntax, others perform reasoning, and some respond to cultural or emotional tone.
Monosemantic neurons (a recent interpretability milestone) show that some individual units in models like GPT-4 and LLaMA 3 consistently represent a single human-interpretable concept, such as “CEO”, “spider”, or “accusation”. This enables much more transparent debugging.

In other words, LLMs aren’t just chaos under the hood. They build internal representations that mirror human concepts; not perfectly, but enough that we can often guess what part of the model is doing what.

This helps researchers (and marketers) in two ways:

Debugging behavior By locating which parts of the model respond to certain inputs, we can understand why it produces biased or inaccurate answers.
Fine-tuning more precisely Instead of retraining the whole model, developers can now edit or influence specific components, such as fact neurons or sentiment layers.

Tools like OpenAI’s Logit Lens and mechanistic interpretability frameworks developed by Anthropic and Redwood Research now allow scientists to trace token-level influence through the model’s depth.

Open-source efforts, especially on LLaMA 3, have used automated labeling techniques to assign semantic tags to latent directions, building large-scale maps of conceptual activation across the network.

Even if we can’t fully explain every step, we’re getting closer to tracing how a prediction is built, from concept activation to final output.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Summary

Component	Function
Transformer blocks	Process input via attention + MLPs
Residual connections	Preserve information across depth
Layer normalization	Stabilize training and generalization
MLPs (feedforward nets)	Store associations and manipulate meaning
Causal masking	Enforce left-to-right prediction
Interpretability tools	Reveal latent directions and concept activation

The transformer architecture is not just a technical scaffold: it is the core engine of fluency and reasoning in LLMs.

The next chapter will explain how LLMs models are tuned after training, also using human feedback, and why that step matters enormously for marketing and visibility strategies.

Faqs

What is transformer architecture in AI?

Transformer architecture is a neural network design that uses self‑attention, residual connections, and feedforward layers to process sequences like language. It is the backbone of all modern LLMs.

How do transformers differ from traditional neural networks?

Unlike RNNs, transformers process all tokens in parallel, using attention to capture relationships between words at any distance.

Why are transformers important for marketers?

Understanding transformers explains why LLMs can generate human‑like answers. This helps marketers adapt strategies for AI visibility and citation. It also helps them sto trying to “rank” and making themselves ridicule online with false promises that depict lack of expertise on how these systems work.

What is a Mixture of Experts (MoE) in transformers?

MoE adds specialized sub‑networks (“experts”) that activate selectively, allowing larger models to scale efficiently and improve on reasoning, math, or multilingual tasks.

How many layers does a transformer have?

Nice question. Modern LLMs can stack dozens to hundreds of layers. GPT‑4 has 90+ layers, enabling abstraction from syntax to reasoning.

SEO for AI. Optimizing Your Website Content for Generative AI (ChatGPT & Co.)

Pietro Mingotti — Sat, 21 Jun 2025 10:16:11 +0000

In this research, I’ll try to address the SEO for AI topic by explaining how AI models find and select web content, and how you can optimize your site to become the source that AI references.

Generative AI models and LLMs like ChatGPT are becoming a new layer in content discovery, and have partially mangled some of the SEO market (as an example, publisher traffic): they answer user questions directly, thus extinguishing often the search intent and resulting in a zero-click-search, that is having a significant impact on Organic Search effots and goals for companies worldwide. I should know; all of our clients at Fuel LAB® have been asking for this… and we’ve been research for years.

These AI-driven results (be them from Google Gemini, ChatGPT, Claude, Perplexity…) are often citing and linking to sources. Many businesses are asking “How can we get our site cited or recommended by AI models?”.

While clear-cut rules are still evolving, and we can’t give a science-based framework for something that is, indeed, generative, early evidence suggests that once again, SEO is not dead: technical optimization and content quality remain key.

Index

How AI Models Find and Cite Web Content

Before optimizing, it helps to know how ChatGPT and similar AI systems fetch information. Modern generative models typically don’t have your website “memorized” unless it was in their training data – instead, since some time, they use a real-time search and retrieval process.

For example, ChatGPT’s browsing feature relies on web crawlers and search results fetched from Bing! (while Google Gemini, conversely, uses Google Search):

Search integration: ChatGPT (with browsing enabled) formulates search queries and retrieves top results via Bing’s search index. In essence, ChatGPT conducts a search behind the scenes – mostly long-tail queries – and then reads the content of the pages it finds. If your site isn’t showing up in those search results, the AI likely won’t see your content. I would argue, if your site isn’t ranking in the top first 3 results for those queries, the AI won’t see your content. We know this for certain now thanks to the OnCrawl efforts delivered in this PDF and also Jérôme Salomon brilliant and generous public divulgation on LinkedIn early in June ’25.
OpenAI’s web crawlers: OpenAI uses three primary crawlers (user agents) to access web content:
1) ChatGPT-User: a real-time crawler that fetches a page when a user’s prompt triggers it (i.e. ChatGPT “consults” your page to answer a question).
2) OAI-SearchBot: an indexing bot that asynchronously crawls the web to build an index for ChatGPT’s search functionality.
3) GPTBot: a crawler that collects content for training AI models (broad content ingestion).

Insight: The ChatGPT-User bot is the most exciting to see in your analytics – it means an end-user’s prompt caused ChatGPT to visit your page as a source. OAI-SearchBot’s activity indicates your site is being indexed in OpenAI’s “knowledge base” for answering questions, and GPTBot simply means your content may be used in model training (you can actually choose to allow or block this without affecting real-time answers, as discussed later).
Citation and answer construction: Once the AI has gathered relevant pages, the part of the process you can try and influence, is over; the model will now take over and will compose an answer. The model selects facts or text from those pages and cites the sources in the response. Early research indicates that content relevance to the query is the top “ranking factor” for which sources get cited.

In practice, ChatGPT will cite the pages that best answer the question or provide the clearest info, rather than basing it on traditional link-based page rank. This means even a lesser-known site can be cited if it precisely addresses the query, though being indexed and visible in search results is a prerequisite.
No JavaScript rendering: A crucial technical note – OpenAI’s bots do not execute JavaScript when crawling. They fetch the raw HTML. Again, good ol’ SEO on page best practices are still relevant. So any content that relies on client-side scripts (SPA content, lazy-loaded text, etc.) may be invisible to ChatGPT.

In other words, if it’s not in the static HTML, ChatGPT won’t see it. Ensuring your important text is server-rendered (or at least available in the initial HTML) is essential for AI and SEO bots alike.

Bottom line: To be cited, your content must first be found and understood by the AI. That means it should rank in the search results the AI consults, be accessible to crawlers, and be easy for the model to parse.

Technical SEO is fundamental for AI Visibility

From what we’ve observed in Fuel LAB®, technical SEO and site reputation play a foundational role in AI content selection. Many principles of traditional SEO (Search Engine Optimization) carry over into what some are calling “AEO” – Answer Engine Optimization – or LLM SEO.

We started calling this OSE (Organic Search Engineering) a while ago; it was already clear that while Technical and Semantic SEO are still the foundation, many other techniques and tools are required for a successful strategy.

Here are the must-do technical steps to help AI models find and favor your site:

Get indexed (special attention on Bing): ChatGPT’s search capability leans heavily on Bing’s index . Thus, ensuring your pages are indexed on Bing (and ranking well for relevant queries) is step one.

Use Bing Webmaster Tools to submit your sitemap and monitor indexation. Leverage the IndexNow protocol (supported by Bing) if your CMS offers it, to push new content to search instantly.

Fact: Without Bing indexation, your content might as well be invisible to ChatGPT.
Allow OpenAI’s crawlers: Make sure you’re not blocking OpenAI’s user agents (ChatGPT-User, OAI-SearchBot, GPTBot) in your robots.txt or firewall. If you’re part of a large enterprise, this is relevant to you. Many times “security” professionals will be blocking without your knowledge everything that they don’t understand.

In fact, including your XML sitemap in robots.txt is recommended, because ChatGPT’s indexing bot will crawl sitemaps if it finds them listed . This can accelerate discovery of all your important pages.

Note: If for some reason you want to opt out of training but still allow being cited in answers, you can disallow GPTBot while allowing OAI-SearchBot, since ChatGPT can still use your content via the search index even if it’s not in training data. But that’s kind of pointless then to hope to be cited. You gotta kill a cow to make a burger.
Ensure crawl accessibility: Treat ChatGPT’s bots like traditional search engine crawlers; they need to fetch content easily. That means fixing broken links (404s), avoiding fragile client-side rendering, and making sure your site doesn’t require special logins or cookies for core content. If certain pages are frequently crawled by ChatGPT-User or OAI-SearchBot (visible in your server logs), but returning errors, fix those issues promptly.

Tip: Monitor your log files for those user agents to see which pages are getting attention; these are likely candidates for appearing in AI answers.
Page speed and formatting: While we don’t have direct evidence that page load speed affects ChatGPT’s choices (as it would for human UX), it’s wise to ensure your pages are fast and lightweight for crawlers. More importantly, ensure the textual content is easily extractable – for example, avoid burying key info in images or complex HTML that might confuse parsers. A clean, semantic HTML structure (with proper headings, paragraphs, lists) helps AI models quickly identify the main points of your content.
No heavy client-side antics: As already said, don’t hide content behind JavaScript . If you use modern web frameworks, implement server-side rendering or hydrations that output meaningful HTML.

For instance, if you have an FAQ accordion written in React, make sure the FAQ text is present in the HTML (even if initially hidden via CSS) so crawlers can read it. Treat OpenAI bots similar to Googlebot in this regard – except even more restricted, since they never run scripts.
Schema Markup is fundamental, but not the way you think: Schema.org markup helps traditional search engines (like Google and Bing) understand the structure and context of your content. Since many AI models, including ChatGPT with browsing, rely on search engine indexes to find content, schema will indirectly help your content be found by improving how it ranks or gets featured in search results.
- Use FAQ schema (FAQPage) when possible — this helps in both search engine results and makes your content more likely to match question-answer prompts that LLMs handle.
- Use Article, Product, Service, and Organization schema to define what your pages are about and tie them to known semantic entities.
- Mark up authorship and dates to reinforce content freshness and attribution.
- Don’t rely on schema as a replacement for well-structured visible content: LLMs prioritize what’s in plain HTML.
- Don’t assume that adding schema will directly make ChatGPT cite you; it’s an indirect factor.

Ensuring the above will get your site into the AI’s “consideration set”. Think of it as indexability and crawlability for AI. Now, let’s talk about how to stand out among the considered sources.

Good SEO for AI – (Answer Engine Optimization) AEO is just a new name.

Once your site can be seen by AI models, the next challenge is to be selected and cited. That’s where the C-Suite and Technical Marketers will start fighting. The thing is, doesn’t matter how much they don’t like the answer, a Large Language Model doesn’t rank anything. As explained, and proved, it uses search engines, relies on their ranking, and then elaborates the answer.

So here, good old Organic Search Engineering is still key. Content quality, relevance, and structure become the battleground. Generative AI doesn’t “rank” pages by classic SEO metrics like backlinks; it’s trying to find the best answer for the user. So how can you craft content that an AI will judge as the best answer?

Here are strategies:

Cover the topic deeply and semantically: this means your content should thoroughly cover the user’s query, answer the main question and related follow-up questions, define important terms, and provide context.

LLMs are drawn to content that provides comprehensive, in-depth explanations because it gives them more to work with. For example, if the question is “How to optimize a site for ChatGPT citations?”, a shallow 200-word answer on your blog likely won’t be as useful to the AI as a 2000-word guide covering multiple angles (technical steps, content tips, examples, pitfalls).

But will this rank and be found by the model? That depends, as you know. Are you working for a huge brand with a ton of Domain Authority, or are you working with content that struggles to rank anyway? You know the answer to this. Often, even the best practices won’t give expected results, if several of the hundreds of rank factors are not matched.

Expand for Technical Explanation

Mechanistic interpretability of relevance: A recent study shows that LLMs use a multi-stage process—first extracting query/document info, then assessing relevance in layers, and finally using attention heads to rank documents for citation or response generation. This supports the idea that detailed and semantically rich content is more likely to be identified and used by LLMsen.wikipedia.org+1growthmarshal.io+1 arxiv.org.
Structured relevance assessment: Another publication comparing LLM relevance approaches found that models align closely with human judgments (Kendall correlation), indicating that they can accurately evaluate content when it’s structured and covers the query comprehensively.

Provide clear, immediate answers: Large Language Models tend to favor the first or clearest explanation of a concept in a page . So don’t bury them in complex, far away from the ATF (above the fold).

We have observed good results using an inverted pyramid approach: answer the core question in the opening paragraph or two, as clearly as possible, then elaborate further in the subsequent sections. This mirrors Google’s featured snippet optimization, but here it’s about giving the AI a quick grasp of your page’s relevance.

If your page has a succinct definition or answer right up front, ChatGPT might choose to latch onto that and cite you as a source of a clear definition.

Expand for Technical Explanation

Retrieval-augmented generation (RAG): RAG pipelines emphasize feeding relevant document passages into the LLM before answer generation; this means clarity in early answers improves the quality of AI output.
Citation accuracy challenges: Even with RAG, LLMs sometimes hallucinate or wrongly cite sources. Several studies show up to 50% of claims aren’t fully supported by the cited sources. Clear, upfront content reduces misalignment and promotes citation accuracy.

Structure your content for AI comprehension: Proper structure isn’t just for human readers – it also helps AI models understand what your content is and when to surface it . You need to realize that (again, just like with SEO) the time spent elaborating your content, finding it, crawling it, and so on, it’s all a cost for the technology that is operating. They will always count that as a factor, like with crawl budget. Use a good semantic html structure with descriptive headings (H2s, H3s) that outline the questions or subtopics you address. Utilize bullet points or numbered steps for procedural or list-based information.
A well-structured page allows the AI to navigate and extract the exact piece of information it needs. For instance, if you have a section titled “Technical SEO Tips for AI” and the user’s question is about AI crawling, the model can jump to that section. In contrast, a wall of unorganized text is harder for the AI to parse and might be overlooked in favor of a clearly organized competitor’s page.

Retrieval-augmented generation (RAG): RAG pipelines emphasize feeding relevant document passages into the LLM before answer generation; this means clarity in early answers improves the quality of AI output.
Citation accuracy challenges: Even with RAG, LLMs sometimes hallucinate or wrongly cite sources. Several studies show up to 50% of claims aren’t fully supported by the cited sources. Clear, upfront content reduces misalignment and promotes citation accuracy.

Expand for Technical Explanation

LLMs evaluate relevance in structured layers (query-document representation, instruction processing, attention heads). That means having clearly marked sections (e.g., H2/H3 for sub-& follow-up topics) aligns well with how LLMs “read” and prioritize text.

Demonstrate expertise and authority: While LLMs don’t directly measure E-A-T (Expertise, Authoritativeness, Trustworthiness) like Google might, they do analyze the content’s language and detail. Content that is persuasive and authoritative tends to “win” in AI answers.

This means writing with confidence, citing facts or data (yes, the AI can see if you reference statistics or reputable sources in your text), and providing insightful, original perspectives – not just generic fluff. Original research, unique case studies, or specific expert quotes on your pages can make them stand out to an AI looking for trustworthy information to share.

Expand for Technical Explanation

Citation practices in science: Research on how LLMs recommend academic citations reveals they display a bias toward highly cited, authoritative sources—indicating that perceived authority influences what gets referenced.

Enhancing transparency: A position paper advocates integrating citations into LLM output to bolster trust and accountability, suggesting that explicitly authoritative, well-sourced content could be more likely to be used

Original and human-friendly content: “Built for both human searchers and the models guiding them,” is a mantra to follow.

In practice – write for humans, but keep machines in mind. Once again, good old SEO rules. An engaging, well-explained article will naturally contain the elements an AI values (clarity, depth). Avoid overt “keyword stuffing” or awkward AI-targeted language. All that stuff is dead. Both ranking algorithms and LLMs use neural processing, in other words, these technologies are build to think as a human would.

Instead, focus on answering likely questions thoroughly. Remember that if a human finds your content valuable, there’s a good chance an AI model will find and use it as well, since human value often correlates with relevance and clarity.

Expand for Technical Explanation

Even models using RAG can hallucinate or misinterpret nuanced phrasing, reinforcing that plain-language clarity and human-centric writing reduce errors, editors improve model citations, and make the output more factual.

I could keep writing a dozen of other reccomendations, but that would all qualify as SEO optimizations. Although this is a research article and not just a blog post, for the sake of readability, let’s stick with LLM directly impacting practices. But first, one last very interesting topic.

Long Tail Keywords, or short nGrams?

I’ve been studying for a long time how ChatGPT and similar models form search queries especially when they use web access to find sources.

While some claim that LLMs favor long-tail queries, evidence suggests that (as always in science) the truth is more nuanced: both long-tail and short-phrase (n‑gram) queries play a role, depending on how the model processes the user prompt.

AI Search Queries: Long but Compressed

When a user asks ChatGPT a complex, multi-part question — like:

“How do I optimize my WordPress site to be recommended by ChatGPT when someone asks about privacy-friendly CRMs?”

The model doesn’t pass this full prompt to a search engine. Instead, it analyzes the intent, identifies core topics, and typically generates multiple shorter subqueries behind the scenes. For instance:

optimize site for ChatGPT
how ChatGPT recommend websites
privacy-friendly CRM wordpress

These are often 4–5 words long, which technically qualify as long-tail keywords, but are still condensed compared to the original prompt. This process is supported by emerging data:

A Semrush/ChatGPT search behavior analysis showed that actual search queries average ~4.2 words, even when user prompts were 20+ words long.
Research on Chain-of-Thought prompting and query rewriting reveals that LLMs often generate multiple short, semantically targeted queries to cover the full intent of a long user question.

So What Should You Optimize For?

Both short and long queries matter, but in different ways:

Query Type	Role in LLM Reasoning	How to Optimize
Short n‑gram queries (2–3 words)	Represent atomic subtopics. Often used for direct retrieval or indexing.	Ensure that H2s, image alts, meta titles, and URLs include clean, high-volume search terms (e.g. `chatgpt seo`, `llm optimization`)
Mid/Long-tail phrases (4–6+ words)	Match more specific intents. Often reflected in how the model frames questions internally.	Include conversational headers (e.g. “How to get my site cited by AI?”) and phrase-level variations in your paragraph content, FAQs, and intro text.

LLMs break down complex prompts into multiple focused queries, often in the 3–5 word range; technically long-tail, but still distilled. That means your content should:

Include short, high-signal phrases for relevance scoring,
Provide deep, semantically complete answers for topic coverage,
Reflect both intent-specific and broad anchor terms in structure and phrasing.

In other words, don’t pick between long-tail and short-tail. Stop thinking in terms of keywords; it’s like trying to build a car while focusing obsessively on screws and bolts. You need to understand and mirror the model’s dual logic: compress intent, then expand coverage.

Useful Tools and Standards for AI SEO

Because AI-driven search is so new, we’re also (finally) seeing new tools and standards designed to help site owners adapt. I myself wished these were available years ago when we started researching.

Here are a few I like and personally use, worth knowing:

LLM analytics & tracking: Since traditional SEO tools don’t tell you when you’ve been cited by an AI, specialized solutions are popping up. For example, Peec AI and similar platforms let you track prompts and see which sources are appearing in AI-generated answers.

Ahrefs has even added an “AI Overview” share-of-voice in their suite to see if your brand is mentioned in Google’s AI answers.

If you’re serious about AI optimization, consider using these to measure your progress; they can reveal, for instance, that a niche forum or competitor is being cited often for topics where you have content gaps.
Log analysis for AI bots: As mentioned, checking server logs is a more technical but effective way to gauge your visibility.

If you use a log analysis tool (like Oncrawl’s log analyzer), you can filter for user agents like ChatGPT-User or OAI-SearchBot to see how often they hit your pages, which pages, and when.

A spike in ChatGPT-User hits might correlate with trending questions in your space that your site is helping answer. You can treat those pages as high priority for maintenance and improvement.

Expand for in-depth Workflow

What AI Bot Log Analysis Reveals

Bot Visits (“Impressions”)
Logs help identify which pages AI bots crawl, a form of impression that’s invisible to standard analytics.
- OAI-SearchBot visits indicate real-time indexing and “search readiness.”
- ChatGPT-User visits show your content was displayed as a source in a ChatGPT answer.
Referral Monitoring (“Clicks”)
You can see if users clicked through from ChatGPT‑cited links; useful since GA4 often fails to track these properly.
Identify Crawl Patterns & Friction
Analyze crawl frequency, bot hit distribution, error codes (4xx/5xx), and redirect chains. AI bots don’t behave like Googlebot; they may skip JavaScript-heavy or error-filled pages.

LLMs.txt (proposed standard, not adopted as June ’25): You might have heard about llms.txt, a proposed text file standard similar to robots.txt, where site owners could list important content for AI to crawl. The idea is to provide a roadmap of your site’s best “AI-friendly” content (like documentation, product info, FAQs) in a simple format.

However – no major AI services currently use llms.txt. OpenAI, Google, Anthropic, etc. do not yet support it, so adding an llms.txt file today likely has no effect on your visibility. It’s a speculative idea at this stage, much like having a meta tag that no search engine recognizes.

Google’s John Mueller has dismissed llms.txt as ineffective and unused by AI bots so far. This is actually and indication that llms.txt is going to be useful, since John seems to always speak out to debunk myths that actually are not such.

That said, some companies (e.g. Anthropic) have published an llms.txt on their site as a forward-looking measure, and free generators exist if you want to create one. Our advice: don’t rely on llms.txt for now; focus on proven fundamentals, but keep an eye on this space. If adoption grows, it could become another tool in the AI SEO toolbox.
GTP Extensions for Keyword Research: gotta love these, actually gotta love them, the developers who make these little bookmarklettes and extensions available to everyone to enjoy.
- Mike Friedman’s GPT Search Reasoning and Query Extractor
- Julian Redlich BetterGPT Plugin
- You can generate your llms.txt if you want to try that road
Companies presenting themselves as A.I. Analytics: I will update this research when I have actually used and tested all of these, but some interesting proposals that I can’t skip seem to be:

Conclusion and Key Takeaways

Optimizing your website for generative AI models is an emerging discipline more than emerging science, but the early certain lessons are clear: a technically sound site + high-quality, well-structured content = the best chance of being cited by AI.

In practice, that means making your content easily discoverable (indexed on Bing, accessible to OpenAI’s crawlers) and making it genuinely useful (relevant, comprehensive, and clearly presented).

Let’s summarize the key points:

Indexing & Accessibility: If you’re not indexed in search (especially Bing, in the case of ChatGPT) or if your site blocks AI crawlers, you won’t even enter the race. Make your content visible and crawler-friendly (no heavy JS, no login walls).
Relevance is King: In AI answer selection, relevance and depth trumps fame. A lesser-known site that thoroughly answers a niche query can be cited over a top brand with thin content (and on this, long tail keyword optimization is still a winning strategy). Focus on answering questions completely and clearly; the AI will recognize that.
Content Structure Matters: Organize your content with headings, lists, and logical flow. A well-structured page is easier for an AI to digest and use. Think about the questions a user (or AI) might have and make those answers stand out in your text.
Keep it Human: Write in an authoritative but approachable tone, as you would for a savvy reader. Engaging, original content not only appeals to human readers (who ultimately are your customers), but it also tends to contain the nuance and detail that AI systems find valuable.
Monitor and Adapt: Since this field is new, continuously monitor how and when your content is appearing in AI responses. Use log analysis or AI SEO tools to get feedback. If you discover, for example, that ChatGPT is citing a competitor’s article on a topic you haven’t covered, that’s a golden opportunity to create new content and fill the gap.

Likewise, if you see ChatGPT citing you but paraphrasing incorrectly, you might need to clarify that section in your content.

Finally, a mindset note:

We are in the early days of AI-driven search. Best practices will evolve. Treat your optimization efforts as experiments.

What works for getting cited by ChatGPT today might shift as the models and algorithms improve. Stay informed with the latest research and community findings.

ai – Pietro Mingotti

How LLMs extract and quote snippets

What happens during RAG

What LLMs extract (and what they don’t)

When do LLMs show links to sources?

Why structure beats style (every time)

The tech behind what LLMs extract and quote

A quick AI extractability and citation potential checklist

Conclusions

How do LLMs decide which snippet to use?

Do LLMs always include a link when they quote me?

What parts of a page get extracted most often?

Does traditional SEO still matter for citations?

What page structures increase AI citation likelihood?

Do backlinks make content more quotable on AI?

How LLMs Work – Deep Technical Overview

How LLMs are disrupting Search Marketing

From search engines to answer engines

The rise of AI Snapshots and zero-click experiences

How ChatGPT, Bing Copilot, Google SGE / AIO / AIM and other LLMs differ

The Real-World Impacts on Marketing

Faqs

How are LLMs disrupting traditional SEO?

Do backlinks and CTR still matter in the age of LLMs?

What is the difference between ranking in Google and being cited by ChatGPT?

What should marketers do to adapt?

Does this mean SEO is dead?

Why you can’t rank on ChatGPT and other LLMs

Reason 1: LLMs don’t use a Ranking Algorithm like Search Engines do

Reason 2: Training Data isn’t a live index of the web

Reason 3: LLMs are not Search Engines. They are Generators.

So if you don’t rank on chatGPT, what’s actually happening?

And the core issue remains: LLMs don’t understand what they’re saying

Reason 4: your web or app property isn’t “In ChatGPT”unless it was in the Training Set or Search Results

In Summary:

FAQs

Can I make my website rank inside ChatGPT like on Google?

How is ChatGPT’s training data different from Google’s index?

Why doesn’t ChatGPT always cite my website when it uses my content?

Does link-building or keyword optimization help with AI visibility?

How can I improve my chances of being mentioned by LLMs?

Inside LLMs: How Pre‑Training Shapes What ChatGPT Knows

What is Pre-Training in LLMs for?

But what is a token?

Training Datasets: Scale and Source

Optimization: Loss Functions and Gradient Descent

Frozen Models vs Modular Systems

Summary

Faqs

What is pre-training in large language models?

Which datasets are used in LLM pre-training?

How do LLMs actually learn during pre-training?

What is the difference between pre-training and fine-tuning?

Why can’t LLMs update themselves after pre-training?

Why does pre-training matter for SEO and AI visibility?

Inside LLMs: Neural Networks & Attention

What Are Neural Networks in Large Language Models?

The Attention Mechanism Explained

Multi-Head Attention

Positional Encoding: Giving Order to Chaos

Embedding Layers: From Words to Vectors

How Attention Affects Content Visibility

Summary

Faqs

What are neural networks in large language models?

What is the attention mechanism in LLMs?

How do transformers use attention?

How are transformers different from older models like RNNs?

Why should marketers understand neural networks and attention for GEO / AEO?

Inside LLMs: RLHF, RLAIF & the Evolution of Model Alignment

Why model alignment: pre-training alone isn’t enough

The original alignment pipeline (RLHF Now Legacy)

1. Supervised Fine-Tuning (SFT)

2. Reward Model Training

3. Reinforcement Learning (PPO)

Modern Alignment Approaches: RLAIF and DPO

A. RLAIF – Reinforcement Learning with AI Feedback

B. DPO – Direct Preference Optimization

Why Model Alignment Matters for Content Visibility

Does “Thumbs Up” matter?