Stages of data gathering for AI training: From the Big Scrape through the Big Capture
As the major AI companies race towards AGI, and release their new models in ever-faster iterations, the amount of data used to train these AI Models doubles every six months.
My goal with this essay is to outline how I see data being gathered for AI researchers ever-growing LLMs training needs as a disapassionate observer, without passing comment on the ethical issues within, although those beg further discussion.
Looking at how AI researchers have been (and will be) gathering data for AI model training, I see three distinct phases:
The Big Scrape (2018 - 2023)#
This is the phase when engineers went out and acquired data ‘by any means necessary’ to train their AI models, including: - Using publicly-available datasets of copyrighted content - Using publicly-available datasets of public domain content - ‘Scraping’ websites to create local copies of their data
The Big Scrape era began in 2018 when OpenAI trained GPT-1 on BookCorpus, followed by Common Crawl for GPT-2, and eventually massive datasets including The Pile, Books3, and LAION-5B.
The era ended in 2023 with a series of lawsuits (NYT v. OpenAI, Getty v. Stability AI, etc), which signaled an end to the free-wheeling era of AI companies acquiring model training data with the belief “the end justifies the means”, and a beginning of companies with these large data reserves realizing the value of their data for AI training, and a need to create a commercially-viable method of monetizing access to that data.
The Big License (2023 - ongoing)#
We are currently in this phase: all companies understand the value of their data (especially the ones with the deepest reserves of data), and are rapidly revising their privacy policies to allow the use of this data for AI training.
In cases where the companies with these data reserves are not training their own AIs, this means that they are opening up licensing of their user data (in ways allowed by their privacy policies), and selling access for companies to use in training their AI models.
This phase began in earnest in 2023, with Google licensing all of Reddit for AI training (spending $60m annually), followed by Stack Overflow partnering with OpenAI, and Shutterstock’s pivot from threatening lawsuits to selling access.
The change is massive: many of the companies that initially claimed copyright infringement are now licensing partners. Even the New York Times joined the licensing content for AI training bandwagon in May 2025.
The Big Capture (2026 - Ongoing)#
After AI models have consumed all available digital data, the next frontier is capturing exponentially more data from the physical world itself, including and beyond everything that humans are able to perceive today.
As of January 2026, this phase hasn’t fully arrived, but we’re already seeing early experiments. These include Meta’s Ray-Ban smart glasses, the Limitless Pendant that records all conversations, and startups like Omi who want AI to capture all human hearing, vision (and eventually thoughts).
This isn’t unprecedented. TiVo’s groundbreaking capture of user viewing data was called ‘Orwellian’ in 2001 Congressional hearings, but in 2026 we eagerly share our viewing data with Netflix and other streaming services without a thought. Amazon’s purchase tracking was considered ‘invasive’ until recommendations became essential shopping tools. And so on through today’s era of social media.
Each generation’s privacy redline has become the next generation’s baseline expectation.
The Big Capture means gathering all human-perceivable data types (vision, sound, eventually smell, touch, taste), and more, at every scale.
Imagine: understanding how the breeze impacts leaves on a single tree, while simultaneously gathering atmospheric data from weather balloons. Recording conversations (audio, video, depth) at social events to augment human memory while generating massive training data.
This requires new hardware products that delight consumers while normalizing continuous data capture… Let’s look at OpenAI as the company most aggressively pursuing this area.
I believe that, without a doubt, the fruits of OpenAI’s io products acquisition will yield a series of new OpenAI-powered hardware products which will try to push this generation’s “privacy redline” as hard as possible, to drive user delight (and acceptance) to change what privacy tradeoffs are acceptable in consumer devices.
The scale is staggering: continuous capture could generate 60GB (at least) per person daily. With a billion users, that’s 22 zettabytes annually - more than double all current datacenter storage. Barring any massive data storage breakthroughs, long-term winners may not only be those who capture the most data, but also those who develop “forgetting algorithms” to keep only what matters, much like human memory (see Ebbinhaus’ Forgetting Curve).
This phase truly begins when one product achieves 100M+ users who willingly accept one or more previously-unacceptable privacy tradeoffs for AI-powered utility. My bet: 2027.
The question isn’t whether Big Capture will happen, but which company will make consumers want it first. After that future product takes hold, consumers will feel like they couldn’t imagine life without it, and a new privacy baseline will emerge.