SEO – Pietro Mingotti

How LLMs extract and quote snippets

Pietro Mingotti — Tue, 19 Aug 2025 13:29:09 +0000

TL;DR

Browsing models fan-out multiple short queries, fetch top results, skim titles + intros, and compose a synthetic answer. Citations are added only when the system is confident about attribution.
The most reused fragments: page title, the first ~500–1000 characters, and any definition/answer block directly under a heading. Meta descriptions (your SERP snippet) matter more than you think.
Links are probabilistic. Clear structure, named entities, and “answer-first” copy raise your odds; blended sources and marketing fluff lower them.
Technical SEO still matters: fast HTML-first rendering, schema, SSR/static output. If retrievers can’t parse you quickly, you’re invisible.

If you’re still treating AI answers like “blue links with extra steps,” you’re going to miss where visibility actually happens. LLMs generate answer, they don’t rank or index anything. Then how do llms extract content and when do they quote it and link it?

In browsing-enabled modes (ChatGPT w/ Bing, Bing Copilot, SGE, Perplexity, Claude), models don’t read your whole page like a human. They assemble answers from tiny, extractable fragments, and only sometimes attach a link.

Below I’ll show the pipeline, what gets lifted, when links appear, and how to format pages so they’re quote-friendly.

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Index

What happens during RAG

I’ve written a dedicated article extracted from the research paper on how LLMs work under the hood; however, what you need to know here is that when a user asks a complex question, the assistant:

Rewrites the prompt into several short sub-queries (≈3–5 words).
Calls search (usually Bing/Google) and gets back titles, URLs, and snippets.
Scrapes partial content from a handful of top results (intros, definition blocks, sometimes FAQs).
Composes an answer; may attach citations if a source fragment is used verbatim or near-verbatim and attribution confidence is high.

This aligns with how ChatGPT’s browsing mode and similar tools are described publicly: search first, skim, then synthesize; links appear depending on product heuristics.

What LLMs extract (and what they don’t)

Models are on a tight budget: limited fetches, timeouts, and small “content windows” per page. That means they’ll often only lift:

Title (and H1 if distinct)
The first ~500–1000 characters of body copy
A tight definition or answer block immediately below a heading
FAQ/HowTo fragments (if clearly marked and near the top)

Practical consequences:

Front-load the definition or direct answer.
Keep early paragraphs short, declarative, and standalone.
Treat meta title + meta description like ad copy: these are sometimes the only words the model sees before deciding whether to fetch.

Think in “info windows”: one heading + 1–2 concise paragraphs + a bulleted list. This maps to how multi-vector retrieval compresses and ranks cohesive segments.

When do LLMs show links to sources?

Linking is not the default; it’s an emergent behavior triggered when internal rules agree that the source is relevant, extractable, and safely attributable:

More likely to link when

You provided a direct quote/definition the answer depends on
The domain is official/high-trust (gov, edu, Wikipedia, major trade sources)
The page shows clear authorship, date, and clean structure

Less likely when

The model blended multiple sources into one sentence
Your layout is messy, interactive, or slow to render
The text reads like “general knowledge” rather than a specific, attributable fact block

Observed platform patterns (abridged):

ChatGPT (Browsing): sometimes cites 1–3 sources; paraphrases heavily.
Bing Copilot: more visible links; favors clean lists/definitions.
SGE: mixes sources; often drops links in the primary summary.
Perplexity: aggressive inline citations; excellent for long-form attribution.
Claude: cites when docs are provided or web context is enabled.

Why structure beats style (every time)

Answer engines reward extractability, not flourish. To raise your quote probability:

Answer first. Put the definition/conclusion in the first 2–3 sentences under each H2.
Keep blocks self-contained. Each section should make sense if lifted in isolation.
Prefer lists and tables. Step-by-steps and comparisons are regularly mirrored in AI output.
Use schema. FAQPage/HowTo/Article/Organization raise machine legibility and attribution confidence.
Brand early. Name, entity, and author metadata near the top helps the model name-drop correctly when it does cite.

The tech behind what LLMs extract and quote

You can’t be quoted if you can’t be fetched:

Robots & llms.txt. Allow GPTBot/ClaudeBot/Gemini/Perplexity unless you intend to be excluded from future retrieval/training.
HTML-first delivery. Avoid JS-gated copy, heavy modals, and client-side redirects.
SSR / static export. Guarantee retrievers get real text on first paint.
Speed + simplicity. Timeouts and fragile hydration mean skipped content.

A quick AI extractability and citation potential checklist

→ Every H2 opens with a one-sentence definition or answer
→ First 500–1000 chars read like a standalone snippet
→ FAQ/HowTo blocks exist and are marked up
→ Meta title/description state the answer, not just tease it
→ Tables for comparisons; lists for steps/principles
→ Org/Author/Article schema + clear dates/ownership
→ SSR/static build; no content behind modals/cookie walls
→ Robots.txt/llms.txt allow AI crawlers you want to influence

Conclusions

You don’t “rank” in an answer engine; you get selected in tiny pieces. Build pages as a series of clean, attributable information windows, and you’ll see your words show up where users actually read: inside the answer itself.

How do LLMs decide which snippet to use?

They fan-out the user prompt into several short sub-queries, fetch top results, skim titles/intros/FAQs, then synthesize an answer using the most extractable fragments (definition-first, lists, short paragraphs). This is generation, not ranking. For a deeper primer on generation vs. retrieval, see your technical overview. → How LLMs Work – Deep Technical Overview.

Do LLMs always include a link when they quote me?

No. Links are not guaranteed. Even when your text influences the answer, the model may paraphrase without attribution, especially on zero-click surfaces. For context on why “ranking” expectations don’t apply, see “Why You Can’t Rank on ChatGPT”.

What parts of a page get extracted most often?

Page title, meta snippet, and the first ~500–1,000 characters, plus any clearly marked FAQ/definition blocks. Put the answer first. For the macro shift to “answer engines,” see “How LLMs are Disrupting Search Marketing.”

Does traditional SEO still matter for citations?

Yes, because retrieval-enabled LLMs pull from search indexes. If you don’t surface in Bing/Google for the fan-out queries, you’re effectively invisible at retrieval time. → See: “How LLMs are Disrupting Search Marketing.”

What page structures increase AI citation likelihood?

Definition-first paragraphs (“X is…”), bullet lists, short steps, Q&A sections, and clean semantic HTML. This aligns with how transformers attend to local structure and how retrieval pipelines skim. → See: “Understanding Transformer Architecture – A Guide for Marketers.”

Do backlinks make content more quotable on AI?

Indirectly at best. They may help you rank in SERPs (thus be seen by the retriever), but the selection is driven by clarity, extractability, and answer fit, not PageRank. → See: “Why You Can’t Rank on ChatGPT”

SEO for AI. Optimizing Your Website Content for Generative AI (ChatGPT & Co.)

Pietro Mingotti — Sat, 21 Jun 2025 10:16:11 +0000

In this research, I’ll try to address the SEO for AI topic by explaining how AI models find and select web content, and how you can optimize your site to become the source that AI references.

Generative AI models and LLMs like ChatGPT are becoming a new layer in content discovery, and have partially mangled some of the SEO market (as an example, publisher traffic): they answer user questions directly, thus extinguishing often the search intent and resulting in a zero-click-search, that is having a significant impact on Organic Search effots and goals for companies worldwide. I should know; all of our clients at Fuel LAB® have been asking for this… and we’ve been research for years.

These AI-driven results (be them from Google Gemini, ChatGPT, Claude, Perplexity…) are often citing and linking to sources. Many businesses are asking “How can we get our site cited or recommended by AI models?”.

While clear-cut rules are still evolving, and we can’t give a science-based framework for something that is, indeed, generative, early evidence suggests that once again, SEO is not dead: technical optimization and content quality remain key.

Index

How AI Models Find and Cite Web Content

Before optimizing, it helps to know how ChatGPT and similar AI systems fetch information. Modern generative models typically don’t have your website “memorized” unless it was in their training data – instead, since some time, they use a real-time search and retrieval process.

For example, ChatGPT’s browsing feature relies on web crawlers and search results fetched from Bing! (while Google Gemini, conversely, uses Google Search):

Search integration: ChatGPT (with browsing enabled) formulates search queries and retrieves top results via Bing’s search index. In essence, ChatGPT conducts a search behind the scenes – mostly long-tail queries – and then reads the content of the pages it finds. If your site isn’t showing up in those search results, the AI likely won’t see your content. I would argue, if your site isn’t ranking in the top first 3 results for those queries, the AI won’t see your content. We know this for certain now thanks to the OnCrawl efforts delivered in this PDF and also Jérôme Salomon brilliant and generous public divulgation on LinkedIn early in June ’25.
OpenAI’s web crawlers: OpenAI uses three primary crawlers (user agents) to access web content:
1) ChatGPT-User: a real-time crawler that fetches a page when a user’s prompt triggers it (i.e. ChatGPT “consults” your page to answer a question).
2) OAI-SearchBot: an indexing bot that asynchronously crawls the web to build an index for ChatGPT’s search functionality.
3) GPTBot: a crawler that collects content for training AI models (broad content ingestion).

Insight: The ChatGPT-User bot is the most exciting to see in your analytics – it means an end-user’s prompt caused ChatGPT to visit your page as a source. OAI-SearchBot’s activity indicates your site is being indexed in OpenAI’s “knowledge base” for answering questions, and GPTBot simply means your content may be used in model training (you can actually choose to allow or block this without affecting real-time answers, as discussed later).
Citation and answer construction: Once the AI has gathered relevant pages, the part of the process you can try and influence, is over; the model will now take over and will compose an answer. The model selects facts or text from those pages and cites the sources in the response. Early research indicates that content relevance to the query is the top “ranking factor” for which sources get cited.

In practice, ChatGPT will cite the pages that best answer the question or provide the clearest info, rather than basing it on traditional link-based page rank. This means even a lesser-known site can be cited if it precisely addresses the query, though being indexed and visible in search results is a prerequisite.
No JavaScript rendering: A crucial technical note – OpenAI’s bots do not execute JavaScript when crawling. They fetch the raw HTML. Again, good ol’ SEO on page best practices are still relevant. So any content that relies on client-side scripts (SPA content, lazy-loaded text, etc.) may be invisible to ChatGPT.

In other words, if it’s not in the static HTML, ChatGPT won’t see it. Ensuring your important text is server-rendered (or at least available in the initial HTML) is essential for AI and SEO bots alike.

Bottom line: To be cited, your content must first be found and understood by the AI. That means it should rank in the search results the AI consults, be accessible to crawlers, and be easy for the model to parse.

Technical SEO is fundamental for AI Visibility

From what we’ve observed in Fuel LAB®, technical SEO and site reputation play a foundational role in AI content selection. Many principles of traditional SEO (Search Engine Optimization) carry over into what some are calling “AEO” – Answer Engine Optimization – or LLM SEO.

We started calling this OSE (Organic Search Engineering) a while ago; it was already clear that while Technical and Semantic SEO are still the foundation, many other techniques and tools are required for a successful strategy.

Here are the must-do technical steps to help AI models find and favor your site:

Get indexed (special attention on Bing): ChatGPT’s search capability leans heavily on Bing’s index . Thus, ensuring your pages are indexed on Bing (and ranking well for relevant queries) is step one.

Use Bing Webmaster Tools to submit your sitemap and monitor indexation. Leverage the IndexNow protocol (supported by Bing) if your CMS offers it, to push new content to search instantly.

Fact: Without Bing indexation, your content might as well be invisible to ChatGPT.
Allow OpenAI’s crawlers: Make sure you’re not blocking OpenAI’s user agents (ChatGPT-User, OAI-SearchBot, GPTBot) in your robots.txt or firewall. If you’re part of a large enterprise, this is relevant to you. Many times “security” professionals will be blocking without your knowledge everything that they don’t understand.

In fact, including your XML sitemap in robots.txt is recommended, because ChatGPT’s indexing bot will crawl sitemaps if it finds them listed . This can accelerate discovery of all your important pages.

Note: If for some reason you want to opt out of training but still allow being cited in answers, you can disallow GPTBot while allowing OAI-SearchBot, since ChatGPT can still use your content via the search index even if it’s not in training data. But that’s kind of pointless then to hope to be cited. You gotta kill a cow to make a burger.
Ensure crawl accessibility: Treat ChatGPT’s bots like traditional search engine crawlers; they need to fetch content easily. That means fixing broken links (404s), avoiding fragile client-side rendering, and making sure your site doesn’t require special logins or cookies for core content. If certain pages are frequently crawled by ChatGPT-User or OAI-SearchBot (visible in your server logs), but returning errors, fix those issues promptly.

Tip: Monitor your log files for those user agents to see which pages are getting attention; these are likely candidates for appearing in AI answers.
Page speed and formatting: While we don’t have direct evidence that page load speed affects ChatGPT’s choices (as it would for human UX), it’s wise to ensure your pages are fast and lightweight for crawlers. More importantly, ensure the textual content is easily extractable – for example, avoid burying key info in images or complex HTML that might confuse parsers. A clean, semantic HTML structure (with proper headings, paragraphs, lists) helps AI models quickly identify the main points of your content.
No heavy client-side antics: As already said, don’t hide content behind JavaScript . If you use modern web frameworks, implement server-side rendering or hydrations that output meaningful HTML.

For instance, if you have an FAQ accordion written in React, make sure the FAQ text is present in the HTML (even if initially hidden via CSS) so crawlers can read it. Treat OpenAI bots similar to Googlebot in this regard – except even more restricted, since they never run scripts.
Schema Markup is fundamental, but not the way you think: Schema.org markup helps traditional search engines (like Google and Bing) understand the structure and context of your content. Since many AI models, including ChatGPT with browsing, rely on search engine indexes to find content, schema will indirectly help your content be found by improving how it ranks or gets featured in search results.
- Use FAQ schema (FAQPage) when possible — this helps in both search engine results and makes your content more likely to match question-answer prompts that LLMs handle.
- Use Article, Product, Service, and Organization schema to define what your pages are about and tie them to known semantic entities.
- Mark up authorship and dates to reinforce content freshness and attribution.
- Don’t rely on schema as a replacement for well-structured visible content: LLMs prioritize what’s in plain HTML.
- Don’t assume that adding schema will directly make ChatGPT cite you; it’s an indirect factor.

Ensuring the above will get your site into the AI’s “consideration set”. Think of it as indexability and crawlability for AI. Now, let’s talk about how to stand out among the considered sources.

Good SEO for AI – (Answer Engine Optimization) AEO is just a new name.

Once your site can be seen by AI models, the next challenge is to be selected and cited. That’s where the C-Suite and Technical Marketers will start fighting. The thing is, doesn’t matter how much they don’t like the answer, a Large Language Model doesn’t rank anything. As explained, and proved, it uses search engines, relies on their ranking, and then elaborates the answer.

So here, good old Organic Search Engineering is still key. Content quality, relevance, and structure become the battleground. Generative AI doesn’t “rank” pages by classic SEO metrics like backlinks; it’s trying to find the best answer for the user. So how can you craft content that an AI will judge as the best answer?

Here are strategies:

Cover the topic deeply and semantically: this means your content should thoroughly cover the user’s query, answer the main question and related follow-up questions, define important terms, and provide context.

LLMs are drawn to content that provides comprehensive, in-depth explanations because it gives them more to work with. For example, if the question is “How to optimize a site for ChatGPT citations?”, a shallow 200-word answer on your blog likely won’t be as useful to the AI as a 2000-word guide covering multiple angles (technical steps, content tips, examples, pitfalls).

But will this rank and be found by the model? That depends, as you know. Are you working for a huge brand with a ton of Domain Authority, or are you working with content that struggles to rank anyway? You know the answer to this. Often, even the best practices won’t give expected results, if several of the hundreds of rank factors are not matched.