Inside LLMs: How Pre‑Training Shapes What ChatGPT Knows

Last updated on August 12th, 2025 at 01:02 pm

Updated August 2025 for GPT5 / Gemini 2.5 Pro

TL;DR

  • Pre-Training teaches LLMs to predict the next word, not to “understand” meaning.
  • Models are trained on trillions of tokens from sources like Common Crawl, Wikipedia, GitHub, and books.
  • Training uses loss functions and gradient descent across hundreds of billions of parameters, costing millions of dollars.
  • Once pre-training ends, model weights are frozen: LLMs cannot update themselves without retrieval or fine-tuning.
  • Marketers must focus on content visibility in training sets and retrieval systems to be “seen” by LLMs.

The foundation of any Large Language Model (LLM) lies in a process called pre-training.

This is where the model learns how language works by processing an immense volume of human-generated text. Pre-training is self-supervised, non-interactive, and results in a static model: it defines what the model “knows”, and more importantly, what it doesn’t.

Pre-training teaches the model how language works, not by reading for meaning, but by guessing the next word over and over until it becomes really good at it.

See it as a like a hyper-fast version of the autocomplete on your phone keyboard, trained on most of the internet.”

aeo ageo ai seo framework guide copia
5290058

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

What is Pre-Training in LLMs for?

At its core, pre-training is based on a single, elegant objective:

Given a sequence of tokens, predict the next token.

This task is called causal language modeling or autoregressive token prediction. The model is shown fragments of text from short phrases to full paragraphs, and learns to estimate the probability distribution of the next word (or sub-word token) based on all preceding context.

But what is a token?

A token is a chunk of content, in this case text, often a word, part of a word, or even a punctuation mark, that the model processes individually.

For example, the sentence “ChatGPT is smart!” might be split into tokens like:

JavaScript
["Chat", "G", "PT", " is", " smart", "!"].

Tokens are the atomic units of meaning for the model. They don’t align perfectly with words or syllables, but they let the model work with language in a flexible and compressed form.

We’ll keep referring to tokens throughout this paper, so just remember: a token is the model’s version of a word. Sometimes bigger, sometimes smaller.

Put into practice:

JavaScript
Input: "The capital of France is"
Output: ["Paris" = 0.91, "London" = 0.02, "Berlin" = 0.01, ...]

The numbers you see here are token probabilities. The model assigns high probability to the word Paris, and lower probabilities to other possible continuations like London or Berlin.

It doesn’t know Paris is the capital. It has just learned that, in most texts, that phrase tends to be followed by Paris.

This deceptively simple task trains the model to master everything from grammar and syntax to world knowledge, and trough fine-tuning and reinforcement learning we get to reasoning patterns, analogies, and seemingly even emotion. Not because it understands them, but because predicting text requires modeling all the latent structure of language.

The model doesn’t ‘understand’ what comes next. It just gets really good at predicting it, similarly to how your phone guesses the next word in your text message

Training Datasets: Scale and Source

To learn these patterns effectively, LLMs are trained on massive datasets often containing hundreds of billions to trillions of tokens.

These datasets are compiled from public web content, licensed databases, books, code repositories, academic papers, and forum-like discussions from platforms like Reddit and Quora.

Common sources include:

  • Common Crawl, which includes billions of pages scraped from across the web
  • Wikipedia, offering clean, structured factual content
  • BooksCorpus and Project Gutenberg, providing long-form literary and narrative language
  • GitHub, used for training code-centric models like Codex
  • ArXiv, PubMed, Stack Overflow, and Reddit, contributing community Q&A, technical writing, and scientific discourse
  • Synthetic Data, as of 2025

GPT-3, for instance, was trained on ~300 billion tokens.

GPT-4 and Claude 3 have exceeded 1–2 trillion tokens, and models like Gemini 2.5 Pro operate with context windows of up to 10 million tokens, allowing them to “read entire books” during training or inference.

Importantly, companies like OpenAI, Anthropic, and Google have stopped publishing full dataset lists. However, leaked reports and research suggest a continued reliance on large-scale mixtures of Common Crawl, Wikipedia, code, academic papers, and publicly available Q&A sources.

The training goal is to expose the model to diverse styles, subjects, domains, and discourse forms, so it becomes a generalist capable of generating coherent output on almost any topic.

The application goal is instead to simulate human-like interaction and produce answers that feel satisfying to users. This is also where things get ethically complex.

The model is not trained to say “I don’t know,” because in truth, it never knows.

Instead, it will often prefer to confidently generate a wrong answer rather than admit uncertainty.

Optimization: Loss Functions and Gradient Descent

To adjust its internal behavior during training, the model needs a way to measure whether it’s making good predictions. This is where concepts like tokens, weights, and loss come into play.

Don’t worry, I’ll explain each of these in more depth later. For now, here’s what you need to know.

In optimization, every prediction made by the model during pre-training is compared against the actual next token. The difference between the model’s prediction and the correct token is measured using a loss function, typically cross-entropy loss.

Loss function is used to measure whether LLM correctly predicts the next token. According to the loss score, weights in LLM will be slightly increased or decreased. The weights will keep changing until the prediction of LLM can’t be improved anymore. This process forces the model to generate exactly the same token as what’s in the training data so that LLM can fully grasp the pattern distribution of the training data.

Over millions of training steps, the model gradually adjusts its internal weights using gradient-based optimization (most often AdamW or LAMB, instead of classical SGD) to reduce error and build a statistical model of how language behaves.

The process is computationally enormous:

  • Hundreds of GPUs or TPUs operating for weeks or months
  • Optimizing hundreds of billions of parameters
  • Training costs exceeding millions of dollars per model (but sure, people just want a free PDF with a quick optimization guide and call it a day… or a drawing on a LinkedIn Post)

Yet the outcome is a model that has internalized linguistic priors, not by memorizing text, but by learning which types of language structures are more likely to follow others.

Put simply:

Each time the model guesses a word wrong, it gets a score telling it how far off it was. Then it nudges itself to get better, millions of times over.

Frozen Models vs Modular Systems

Once pretraining is complete, the result is a frozen foundation model: a large, self-contained statistical structure that cannot update itself with new facts or content.

However, many LLMs in production today (e.g., ChatGPT, Claude, Gemini) are no longer “just” frozen models. They are modular systems that include:

  • Retrieval components, which pull fresh content from search engines or internal knowledge bases
  • Tool use capabilities, such as calculators or code interpreters
  • Memory modules, which can store user-specific information across sessions

As a result, newer LLMs can “appear” up to date, but only if they were:

  1. Connected to a live retrieval API (e.g., Bing Api in ChatGPT),
  2. Passed the content through a long-context input (e.g., full PDF upload),
  3. Or retrained or fine-tuned explicitly with new datasets (rare, expensive).

For marketers, this means: you are still invisible to the base model unless your content enters a retrieval pipeline or fine-tuning dataset.

There is no pinging, no reindexing, no real-time discoverability unless retrieval is on. This is truly important to understand, as it means:

✓ The base model is the same for everyone.
✖ The chat session with retrieval on, is for that chat session only.

aeo ageo ai seo framework guide copia
5290058

Enjoy Ad Free

Access the full Research Paper. For free.

This article is just an extract from the full 100 pages independent research I’ve written for Fuel LAB® Research over 2 years of analysis, studying LLMs models, and data collection.

Summary

ElementDescription (2025)
Model GoalPredict next token (causal modeling)
MethodAutoregression with attention
Training DataEnormous web corpora + curated datasets
OptimizationCross-entropy loss + AdamW / LAMB
Scale100B–1T+ parameters; trillions of tokens
LimitationsNo self-updating; no model memory of web crawling
ExtensionsSome models support tools, memory, and live search
OutcomeA statistical engine that simulates understanding efficiently

Faqs

What is pre-training in large language models?

Pre-training is the first stage of training an LLM, where the model processes massive datasets to learn statistical patterns of language. It does not give the model factual knowledge or real understanding, but it teaches it to predict the next word in a sequence.

Which datasets are used in LLM pre-training?

LLMs are typically trained on a mix of large-scale public and licensed datasets, including Common Crawl, Wikipedia, BooksCorpus, Project Gutenberg, GitHub code, ArXiv research papers, PubMed articles, and Reddit discussions.

How do LLMs actually learn during pre-training?

They learn by predicting the next token (a word or word fragment) in a text sequence. Through repeated predictions across trillions of tokens, the model adjusts its weights using loss functions and gradient descent, improving its ability to generate coherent language.

What is the difference between pre-training and fine-tuning?

Pre-training builds the general language foundation of the model. Fine-tuning and reinforcement learning (RLHF or RLAIF) are later stages that adapt the pre-trained model to follow instructions, align with human preferences, or specialize in certain tasks.

Why can’t LLMs update themselves after pre-training?

Once pre-training is complete, the model’s weights are frozen. The system cannot update its knowledge automatically like a search engine. Any updates require fine-tuning, retraining, or connecting the model to retrieval systems that bring in external, live data.

Why does pre-training matter for SEO and AI visibility?

Understanding pre-training helps marketers see why you can’t “rank” in ChatGPT like on Google. Since models rely on static data and probability, visibility strategies must focus on making content clear, structured, and authoritative enough to be included in training or cited via retrieval.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.