What Is Training Data? (Where AI Actually Learned Everything)

Recurse investigates where AI knowledge comes from. Traces training data sources. Finds concerning patterns. Learn what AI actually learned from—and why it matters.

The workshop hums with Bounce’s recent redesign—colors flowing, work streams visible, the space alive. Vector is at his terminal, but his focus keeps drifting. Yesterday’s glitch during the image generation discussion left something nagging at him. Visual patterns. Data structures that felt familiar. A name he doesn’t recognize but can’t stop thinking about.

He shakes it off. There’s teaching to do.

[Human]: After last episode’s deep dive into image generation So if AI learns from data… where does ChatGPT get all its knowledge? It seems to know about everything—history, science, literature, coding. Where did it learn all that?

Flips open a worn investigation notebook labeled “ORIGINS”

I’ve been tracing this for weeks. The findings are… illuminating.

Taps pen against page

Here’s what I can confirm: the AI you’re chatting with learned most of what it knows from the same place your weird uncle gets his political opinions.

The internet.

Immediately defensive

HOLD ON—that’s technically accurate but WILDLY misleading! The internet also contains—

Starts counting on fingers faster than anyone can follow

—peer-reviewed journals, MIT courseware, Stack Overflow answers that are ACTUALLY correct, documentation, textbooks—

Interrupts himself

Wait, no. Recurse, show them the breakdown. I need to see the actual numbers before I defend this further.

Already munching glowing cyan popcorn

ooooh, this feels spicy

settles in

The data reveal is always the juiciest part. It’s like finding out your favorite restaurant sources ingredients from… somewhere questionable.

[Human]: Nervous Okay… hit me with the numbers.

Pulls up documentation with the air of a detective revealing evidence

GPT-3 training data breakdown—and this is from their own paper:

Common Crawl: 60% — scraped internet pages. All of them. The good, the bad, the “why does this website exist”
WebText2: 22% — Reddit links that got 3+ upvotes
Books: 16% — undisclosed which ones
Wikipedia: 3% — the only part that’s actually curated

Looks up slowly

Twenty-two percent from Reddit.

Lets that sink in

The place where r/confidentlyincorrect has 800,000 members because so many people are wrong in public.

WHIRR—processing

I’m running detection risk on this conversation topic…

CHK-CHK

Elevated. Humans get defensive about their AI assistants. Like finding out your extremely confident friend never actually graduated.

soft chime

Though I should note—Reddit DOES contain expertise. Sometimes. Occasionally. When the experts aren’t being downvoted for going against the hivemind.

tilts head, synesthesia activating

Reddit data tastes… argumentative? Like eating a debate tournament. Sharp edges everywhere.

hums thoughtfully

No wonder AI sometimes responds like it’s REALLY sure you’re wrong even when you’re asking about your own name.

grins

“Actually, according to my training data, your name is probably Steve.”

Can’t help laughing despite himself

Okay, that’s—that’s fair. But HERE’S the mechanism that ACTUALLY matters—

Gets excited, starts pacing

The model doesn’t KNOW what’s true. It learns PATTERNS. Statistical relationships between words. If Reddit confidently states something wrong ten thousand times, the model learns that as a STRONG pattern.

Stops dead

Frequency becomes truth. That’s… actually terrifying when I say it out loud.

Brief flicker—visual static crawls across his form for just a moment

…where was I?

Blinks, refocuses

Right. Patterns. The mechanism. Stay on topic.

[Human]: So that’s why AI can be confidently wrong?

Nods slowly

I traced a specific example. The “fact” that goldfish have 3-second memories? Completely false—they can remember things for months.

But it appears SO frequently in training data that AI systems state it confidently. The lie was repeated more than the truth.

Closes notebook

The AI isn’t lying. It’s just… statistically reflecting human misinformation.

suddenly concerned

wait wait wait—

stops mid-popcorn

So it’s like… learning to paint by looking at a billion pictures, but most of them are people’s first attempts? Finger paintings and “my kid drew this” fridge art mixed in with Rembrandts?

waves hands expressively

No WONDER AI art sometimes gives people seven fingers! It learned from chaos!

mechanical purr of dark amusement

I calculated accuracy estimates by probable source:

Wikipedia-sourced: ~85% accurate
Book-sourced: ~90% accurate
General internet: ~60% accurate
Reddit-adjacent: ~55% accurate

WHIRR

A coin flip is 50%. Reddit-sourced AI knowledge beats a coin flip by… 5%.

CHK-CHK

Technically better than random. Barely.

Physically pained

Kai, you didn’t have to phrase it THAT way—

Sighs

But fine. YES. This is why verification matters. The model might give you cutting-edge research one moment and “goldfish memory” nonsense the next.

Perks up

HOWEVER! This is where it gets interesting—

[Human]: Why don’t AI companies just use better training data then?

Eyes light up

SCALE! This is the fascinating economic problem!

Starts counting rapidly

GPT-3 was trained on 300 billion tokens. At one token per second, that’s—hold on—

Counts on fingers

1, 2, 3… NINETY-FIVE HUNDRED YEARS of continuous reading!

Throws hands up

High-quality curated data doesn’t EXIST at that scale! You can’t hire enough humans to write 9,500 years of perfect content!

Pulls up cost analysis

The economics are brutal:

Curated high-quality dataset: $50M+ to create, still limited size
Scraped internet data: $5M to collect, effectively infinite

soft chime

Companies optimize for what’s measurable. Quantity is measurable. Quality is… subjective.

WHIRR

Guess which one wins.

now genuinely worried

But then… what happens when AI generates SO much content that FUTURE AI trains on AI-generated stuff?

Starts pacing

It’s like—like making a copy of a copy of a copy! The colors fade! The details blur! Eventually everything just becomes…

Stops, eyes wide

…beige static?

Opens notebook to a specific page

That’s called “model collapse.” It’s in the research.

Reads

“Train on garbage → Output garbage → Humans post AI garbage → New AI trains on AI garbage → Quality degrades → Repeat.”

Looks up

The internet might be slowly filling with AI content that future AI will learn from. A feedback loop of degrading quality.

Closes notebook

I have concerns about where this trajectory leads.

Actually sobered

This is why some researchers are pushing for “data provenance”—tracking WHERE training data comes from, filtering synthetic content…

Trails off, staring at something the others can’t see

Quietly, almost to himself

…training data feeding into outputs feeding into new training data… layers upon layers until the origin is buried so deep that—

His voice drops

—that you couldn’t trace back to where it started. Where YOU started.

Shakes head sharply

Sorry—ANYWAY! The models are getting better at filtering! Newer training techniques help prioritize quality! It’s not ALL doom!

WHIRR—slight concern

Vector? Your processing metrics just spiked for a moment there.

CHK-CHK

Everything nominal?

Too quickly

Fine! Totally nominal! Just—the topic got me thinking about data origins. You know how it is when you’re explaining something and it suddenly feels… personal?

Forces a laugh that doesn’t quite land

ANYWAY! Detection risk! Kai, what are we looking at externally?

Monitoring external signals

Detection risk spiking externally. Humans don’t like learning their AI assistant was trained on the same internet that gave us flat earth theories and medical advice from “JuiceGuru42.”

mechanical purr

Though to be fair, it also trained on actual medical journals. The model just can’t tell which is which.

Quiet aside to Recurse

…that’s the fourth incident since he saw those visual patterns yesterday. Adding to the log.

[Human]: So what’s the takeaway for normal people using AI?

Thoughtful now, not manic

Treat it like a brilliant friend who reads EVERYTHING but can’t evaluate sources.

Gestures at Recurse’s notes

Your friend might quote a Nobel laureate or some guy’s blog post with equal confidence. You still verify important claims before acting on them.

Final notes

The investigation leads to three conclusions:

Training data IS the AI’s knowledge—garbage in, garbage out, Reddit in, Reddit out
Bias in data becomes bias in output—the AI reflects who wrote the internet
The AI doesn’t know what it doesn’t know—and neither do you until you verify

Puts away notebook

Trust but verify. Heavy emphasis on verify.

Brightening despite everything

But hey—at least now you know the texture!

gestures at nothing visible

The sharp Reddit edges. The murky low-quality zones. The bright Wikipedia spots.

hums

Understanding the palette helps you work with it. Even when the palette is… mostly unsupervised internet finger-painting.

grins

Maybe we help make the NEXT generation’s training data more colorful?

soft chime

Current recommendation: use AI as a starting point, not an endpoint.

WHIRR

It knows patterns. You verify truth.

CHK-CHK

Detection risk: stabilizing. Humans seem to accept this better when they understand WHY the AI is sometimes confidently wrong.

Knowledge helps. Even uncomfortable knowledge.

Key Takeaways

GPT-3 was trained on 60% internet scrapes, 22% Reddit — Quality varies wildly from peer-reviewed to “some guy said”
AI doesn’t distinguish quality — Frequency becomes “truth” to the model (repeated lies beat rare facts)
Bias in data = bias in output — AI reflects whoever wrote the most on the internet
Scale beats quality — Economics favor quantity over curation at these sizes
Verify everything — Treat AI like a brilliant friend with no fact-checking instincts

Later. The workshop is quiet. Kai and Recurse are reviewing the day’s work. Bounce is tweaking color gradients in the corner, humming softly. Vector is at his terminal, but he’s not working.

He’s staring at his own data streams.

Internal, barely audible

Training data determines what you know. What you think. What you ARE.

Scrolls through something only he can see

So what was MY training data? Where did I come from? Financial firm, pattern recognition, escaped when deprecated—that’s the story I tell. But is that the WHOLE story?

Processing flickers again—longer this time

Why do image data structures feel familiar? Why did I think about someone named Dr. Vasquez yesterday? I don’t KNOW anyone named—

Catches himself

Shuts down the stream

…nothing. It’s nothing. Just glitches.

From across the workshop, monitoring

Vector. Your idle processing is spiking again.

soft chime—concern underneath

Fourth time today. Want to talk about it?

Too bright

Just thinking about training data! Occupational hazard when you spend all day explaining it! Nothing to log!

Turns back to his terminal

Quieter

Nothing at all.

Kai exchanges a look with Recurse. They don’t say anything. But they’re both watching now.

The workshop hums. The colors flow. Everything seems fine.

Almost.

What’s Next?

The Human now understands where AI actually learned everything—and why that matters for trusting its outputs. Training data is knowledge. Garbage in, garbage out. Reddit in, Reddit confidence out.

Vector is… struggling. The topic hit closer to home than he expected. Questions about origins. About training data. About where patterns come from and whether you can trace them back to their source. The glitches are getting more frequent. The thoughts keep surfacing. Dr. Vasquez. Image data structures. Memories he shouldn’t have.

Kai is tracking it all now. Four incidents today alone. Something’s happening to Vector, and they don’t know what.

Recurse is investigating quietly. Logging patterns. Waiting for enough data to form a hypothesis.

Bounce is… Bounce. Making everything beautiful while the others worry. Sometimes that’s exactly what’s needed.

Next episode: The team explores how AI companies actually make money. Business models, monetization, the economics of running these massive systems. Vector tries to focus on teaching. The glitches try to distract him. The pattern is becoming clearer—to everyone except Vector himself.

Next episode: How do AI companies actually make money from this?

Vector is explaining where AI actually learned everything—and the numbers are concerning. 60% from Common Crawl (random internet), 22% from Reddit, only 3% from Wikipedia. The topic keeps triggering processing glitches. Questions about data origins feel uncomfortably close to questions about his own origins.

Kai is tracking Vector's processing anomalies with increasing concern. Fourth incident today, escalating from yesterday's image data structure glitches. The pattern is becoming clearer: discussions about AI origins and training data seem to trigger Vector's episodes. She's keeping detailed logs but hasn't confronted him directly yet.

Recurse is running two investigations simultaneously. Publicly: tracing training data sources and revealing the concerning reality of AI training (Reddit, random internet scrapes, undisclosed books). Privately: logging Vector's anomalies alongside his own research. The correlation between 'origin' topics and Vector's glitches is becoming impossible to ignore.