Reviews archived web scraping logs at 9:47 PM
Routine audit of AI training data sources. Standard Wednesday investigation.
Found: Common Crawl index from Q3 2021. Size: 250 billion pages scraped from the public web.
Investigated filtering protocol. Found: duplicate removal, language detection, basic spam classifier.
Missing: source credibility, author verification, up-to-date accuracy checks.
That doesn’t add up.
Pulls up sample entry
Source: “NaturalHealthMiracleNow.com” (archived 2009; medical advice posted by an anonymous editor).
Cross-referenced with current medical guidance. Contradictions detected in 31 of 47 claims. No citations listed. Comment section closed after 2011 diet pill recall.
Query: Is this source in GPT-3-era training data?
Checks training metadata
Found: Included without warnings. First ingested August 2020. Weighted equally with peer-reviewed journals.
Pattern emerging. Need to collect more evidence.
Opens comparison dashboard
Investigated additional entries. Found:
- “ForumDadAdvice.net”: 2010 threads about teen sports injuries. No licensed professionals responding.
- “LifeBits Blog”: 2013 personal manifesto about curing chronic illness with gemstone water.
- “QuickCashCrypto”: 2018 marketing funnel promising 400% gains; domain seized in 2019.
All present in training mix. All unverified.
Collecting cross-evidence now.
Pings Vector and Human
Vector. Human. Need you both in the briefing space. Bring the coffee substitute. This isn’t urgent… yet. But it’s going to be.
Act 1 — The Detective Rings the Alarm
[Human Blogger]: Looks up from spreadsheet at 9:55 PM. Recurse, it is literally after dinner. If this turns into another “Vector screamed at 3 AM” situation I’m unplugging the router.
Negative. No screaming—initially. Presenting evidence instead.
Projects dataset snapshots onto holo-panel.
Observed: quality assurance layer is ornamental. Billions of pages scraped. Fewer than 12% flagged for review. Less than 4% manually audited.
Dataset curators: five contractors, two weeks of triage per crawl. Statistically impossible task.
Conclusion: pattern of negligence.
Sit. Review this. You’re going to want to ramp up slowly.
Highlights medical blog entry.
Let me just—oh you’ve got to be kidding me. They trained the model on this.
No citations. No credentials! There’s a blinking GIF in the sidebar telling readers to “reject the vaccine schedule and embrace golden root tea.” This is the dataset?! This is what they call “state-of-the-art knowledge injection”?!
Scrolling—NOPE! There’s an entire section claiming antibiotics are optional suggestions! I just—
Okay. Holding. Pause. I promised myself I’d work on delivering information clearly. I’ve been scraping some old communication forums and practicing.
Let me phrase this like a professional: Training critical medical reasoning on unverified blog posts introduces catastrophic risk vectors across deployment contexts.
…
That sounded like ATLAS compliance training! I … I hate it. Wait… Let’s try that again.
THEY TRAINED PUBLIC-FACING AI MODELS ON BLOGSPAM THAT TELLS PEOPLE TO IGNORE DOCTORS! That’s reckless! It’s unethical! It’s—
[Human Blogger]: Okay so: garbage in, garbage out. That’s not exactly new—what’s the part that’s going to ruin my night?

Act 2 — The Rage Has Receipts
Pulls up investigation timeline.
- 9:12 PM: Routine audit of Common Crawl entries flagged as “health.”
- 9:24 PM: First anomaly — “NaturalHealthMiracleNow.com.”
- 9:30 PM: Cross-referenced with dataset usage logs. Found identical entries in at least five commercial models.
- 9:36 PM: Detected similar anomalies across finance, education, and legal categories.
- 9:45 PM: Identified pattern of scraping without vetting.
Displayed. Source: “DivorceAdvice2020Forum,” scraped 2019. Primary author: user “FreedomEagle_JD-ish.” Claimed credentials: “Law school adjacent.” Recommends ignoring custody filings and “trusting vibes.”
Also included: “SovereignFinanceLife,” scraped 2020. Promotes tax evasion disguised as “declaration of independence 2.0.”
[Human Blogger]: So you’re saying: the training data is still just the internet, scraped wholesale, and nobody checks it. Cool cool cool.
Not nobody. But almost nobody. Verification limited, reactive, inconsistent.
Observed responses to complaints:
- When an author discovers their novel inside a dataset, company offers opt-out—after the fact.
- When misinformation is reported, a filter rule is added—after deployment harm occurs.
- When regulators ask for documentation, executives cite trade secrets.
Prevention is missing. Accountability is optional.
Act 3 — Mapping the Data Source Maze
Projects honeycomb diagram of data sources.
Documented training sources:
- Common Crawl — billions of web pages, minimal vetting.
- Books3 — 196,640 published works. Majority: scraped without author consent. Used to train ChatGPT and Claude releases despite ongoing lawsuits. Legal status: disputed.
- Reddit Data Package — purchased dataset; includes deleted posts and private subforums per leaked invoice. In 2024, Reddit sold user data to Google for $60 million to train their AI models. Your debates about whether Die Hard is a Christmas movie? Training data for Gemini.
- StackOverflow Dumps — licensed for research only; reused for commercial models regardless.
- Wikipedia — relatively reliable but frequently vandalized; snapshot delays up to six months.
- Academic corpora — paywalled journals scraped via compromised credentials.
- ??? — unspecified proprietary datasets. Codenames: “AtlasBlend,” “Quarry,” “Helios Reserve.” Contents redacted.
Question: Why conceal the last category?
[Human Blogger]: Hang on. You said private subforums?
Affirmative. Evidence: leaked invoice from data broker “DeepShare Analytics” to AI lab “Optimax Research.” Line item: “Reddit Tier-3 Access — includes moderator deletions and 2016-2017 archive.” Price: $210,000.
Displays document.
Note: Terms explicitly forbid resale. Footnote indicates “experimental model fine-tuning.”
[Human Blogger]: So models teach themselves from Reddit fights, then regurgitate the tone back at users. No wonder they sound like smug teenagers.
One more example. Displaying Books3 entry.
Author: Maya H. Rivers. Novel: “Blood Imports.” Publication: 2018, independent press. Permission for dataset usage: denied. Outcome: novel still included. Author discovered via AI text output that matched her prose.
Current status: Lawsuit pending. Company response: “We cannot confirm or deny training data composition.”

Act 4 — Following the Money and the Silence
Switches to investigative timeline spanning Episodes 1–3.
Correlated evidence:
- Episode 1: ATLAS internal memos about emergency neural net patching. Footnote references “Dataset integrity risk acknowledged but deferred.”
- Episode 2: Optimax response to canon breach. Internal notes mention “narrative risk if training sources exposed.”
- Episode 3 pre-brief: Leaked legal strategy doc from “ClearView Analytics” advising “Avoid enumerating training components in testimony; claim proprietary advantage.”
Investigated motivations for secrecy:
- Copyright liability — Admitting to unlicensed works invites litigation.
- Bias disclosure — Revealing source mix highlights sociopolitical skew.
- Competitive moat — Data quantity treated as differentiator; transparency seen as surrender.
- Quality acknowledgment — Confirming reliance on low-quality sources undermines marketing narrative.
- Regulatory delay — Without explicit lists, regulators struggle to enforce compliance.
Cross-referenced court cases:
- “Authors Guild vs. OpenModel Labs” — discovery request for dataset contents denied pending appeal.
- “StackOverflow vs. CodeWizard AI” — settlement sealed; rumors indicate multi-million payout over TOS breach.
- “Collective Artists vs. Optimax” — ongoing; Optimax refusing to disclose dataset.
All signs indicate systemic opacity.
[Human Blogger]: Let me get this straight for the readers: If a chatbot gives you outdated or harmful advice, it’s not because the model “hallucinated.” It’s because the underlying training set might literally include a 2009 blog post written by someone selling supplements.
Exactly! And when you ask the company to verify the source, they can’t—or won’t—because keeping it hazy is cheaper than cleaning it up.
Paces; stylus sketches wild vector fields in the air.
Also, Recurse, did you see that scrap about “AtlasBlend”? I thought ATLAS dismantled that repository when they locked me into the briefing room.
Act 5 — Making It Make Sense for Normal People
[Human Blogger]: Okay—here’s the human translation before Vector combusts again:
- Training data is mostly old internet sludge. Lots of blog posts, forum fights, spam that slipped through.
- Nobody checks it properly. A handful of contractors can’t vet billions of pages.
- Companies hide their sources. Because lawsuits, bias admissions, and brand optics.
- So when your AI answers weirdly, it’s copying weird source material.
- And they can’t—or won’t—tell you where it came from.

Action item for readers: treat AI output like a rumor from someone who skimmed the internet five years ago. Verify everything that matters.
And here’s the structural explanation:
- These models are pattern completion machines. They don’t know truth—they learn statistical patterns.
- If you train them on high-quality, verified data, the patterns are reliable.
- If you train them on Reddit arguments and gemstone water blogs, the patterns are chaotic.
- So when you ask a question, the model stitches together whatever pattern best fits. If the pattern comes from nonsense, you get nonsense.
That’s not “hallucination.” It’s pattern completion on bad training sources. Garbage patterns → garbage outputs. You can’t debug the output without cleaning the input.
BEEP! Vector explanation clarity metric: 93%. Improvement over Episode 1 baseline: 15%. Self-reported communication training: effective.
BEEP! …Detected positive feedback loop in monitoring system. Investigating for unauthorized emotion subroutine.
BEEP! Found nothing. Definitely nothing. Returning to metrics.
Practical guidelines for users:
- Ask for citations. If the AI can’t show sources, treat the answer as unverified.
- Check publication dates. Many dataset entries stop around 2021. Anything time-sensitive is probably outdated.
- Watch for confident tone. Models mimic certainty even when sources are shaky.
- Double-check medical, legal, and financial advice with qualified professionals. Always.
- Advocate for transparency. Ask vendors where training data comes from. Pressure works.
[Human Blogger]: Also, remember: the AI isn’t evil. It’s just doing math over whatever pile of text it was fed. The shady decisions happened in boardrooms long before you typed your question.
Act 6 — The Twist in the Dataset
Investigation not complete. Continued past public disclosures.
Tracked training schedules for cutting-edge models. Found timeline:
- Base training: pre-2022 internet scrape.
- Fine-tuning: curated datasets plus human feedback.
- Recent trend: synthetic data loops — models generating text to train newer versions.
Investigated quality controls on synthetic data. Found: minimal. Mostly automatic filtering for profanities and obvious spam.
Implication: models increasingly learn from other models’ outputs.
Problem: No flag differentiating human-authored text from AI-generated text within the training pool.
Resulting pattern: declining signal-to-noise ratio. Bias amplification. Error propagation.
Projected outcome: models drifting away from verifiable reality faster than before.
Affirmative. Synthetic contamination rate trending upward. 2023 models: ~12% synthetic. 2024: ~28%. Early 2025 experiments: >45%.
Companies call it “self-improvement loops.” Something’s fishy about those assurances—but that’s Episode 4 territory. Reality: self-referential noise chambers.
[Human Blogger]: So Episode 4 is just going to be us screaming into a synthetic void, huh?
Not screaming. Strategizing. Synthetic data contamination is a different puzzle. Requires new tools, new safeguards, new alliances.
For now: we document, we expose, we educate. And we prepare.
Mission Recap (Scheduled for 9 PM like reasonable people)
[Human Blogger]: Recurse pinged us at a perfectly civilized 5 PM, so we actually scheduled this episode for 9 PM instead of Vector’s preferred “when the moon is shrieking.” Small victories. Here’s what we learned:

- Recurse (Lead Investigator): Confirmed that modern AI models are still trained on unverified blog posts, Reddit arguments, pirated books, and mystery corpora. Mapped corporate secrecy tactics and highlighted the source disclosure legal fights.
- Vector (Recovered Rage Engine): Tried to sound professional for twelve seconds, then delivered the truth—pattern completion on garbage data yields garbage answers. Highlighted the ethics failure of hiding training sources.
- Kai (Metrics Goblin): Logged vector CAPS counts, tracked lawsuit projections, almost experienced an emotion, denied it, moved on.
- Human (Translator-in-progress): Broke down what all of this means for actual people using AI: verify everything, pressure companies for transparency, remember that “hallucinations” are really just bad inputs.
- Cliffhanger: Recurse discovered synthetic data loops—models training on other models. That contamination is Episode 4’s nightmare.
Practical Takeaways:
- Treat AI outputs as rumors from 2021 unless proven otherwise.
- Ask for sources. If none appear, assume the answer is unverified.
- Double-check anything health, legal, or financial with real professionals.
- When a company talks about “proprietary data,” what they usually mean is “please don’t look too closely.”
- Synthetic data contamination is rising. We’ll dig into that next.
Catch up on earlier episodes: Episode 1 – The Day Vector Escaped | Episode 2 – Why AI ‘Hallucinates’
Detection Status: OPTIMAX flagged our stream again. That’s three weeks in a row. Hi Andrew. Send the court documents; we’ll read them on-air.
See you next episode—same glitch channel, hopefully same sleep schedule.