Does Anyone Care that AI Bots are Stealing Everyone’s Data?
As artificial intelligence continues to reshape our digital landscape, website owners, content creators, and businesses are grappling with a fundamental question: Who controls how your online content is used to train AI systems? The answer, unfortunately, is more complicated than you might expect.
The Current Legal Landscape: A Patchwork of Uncertainty
The legality of AI data scraping exists in a legal gray area that courts are still working to define. While traditional web scraping for search indexing has generally been accepted under fair use principles, AI training presents new challenges that existing law struggles to address.
However, a recent landmark ruling has provided the first major judicial guidance on this issue. In June 2025, Federal Judge William Alsup ruled in favor of Anthropic in a copyright lawsuit brought by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, finding that AI training on copyrighted books constitutes “fair use”. This was described as “a significant win for AI developers” and established that “the use of copyrighted works to train generative AI models was fair use—even when those works were obtained from unauthorized piracy websites”.
The inconsistency and chaos in current data rights enforcement is perhaps best illustrated by a recent ironic twist: in July 2025, Google was discovered indexing thousands of shared ChatGPT conversations, making private AI chats searchable on the web. OpenAI quickly removed the feature after privacy concerns emerged. The irony is striking—while website owners struggle to prevent AI companies from freely scraping their content for training, here we had Google indexing content generated by AI systems, and suddenly OpenAI became protective of “their” content. This incident perfectly demonstrates the selective and inconsistent approach to data rights that currently governs our digital landscape, where the same companies that freely appropriate others’ content become defensive when their own AI outputs are harvested.
What We Now Know:
- At least one federal court has ruled that AI training on copyrighted material constitutes fair use
- The Computer Fraud and Abuse Act (CFAA) may apply to unauthorized scraping, but enforcement is inconsistent
- Terms of service can provide some protection, but their effectiveness varies
- While Judge Alsup’s ruling is not binding precedent, it may certainly be persuasive to other courts, possibly laying a foundation for siding with trillion dollar tech companies over creatives
What Remains Unclear:
- Whether other courts will follow this ruling
- How to balance public benefit claims against creators’ rights in future cases
- What constitutes “reasonable” technical measures to prevent scraping
- The scope of fair use protection for different types of AI training
The Promise and Problem of Emerging “Standards”
In response to these legal uncertainties, various organizations and companies have proposed technical standards to give website owners more control over AI scraping. However, these proposals highlight a critical gap between intention and implementation.
AI.txt: The Proposed Solution
The ai.txt concept aims to extend the familiar robots.txt protocol specifically for AI systems. Unlike robots.txt, which simply says “crawl” or “don’t crawl,” ai.txt would theoretically allow more granular control:
- Separate permissions for AI training versus AI inference
- Specific licensing terms for AI use
- Detailed attribution requirements
- Commercial versus non-commercial use distinctions
LLMS.txt and Google’s Reality Check
Another proposed standard is LLMS.txt, which aims to help websites communicate with AI systems about how their content should be used. However, Google recently delivered a reality check that underscores the voluntary nature of these standards.
At a recent Google Search event, Google explicitly stated that it won’t be crawling LLMS.txt files and that “normal SEO works for ranking in AI Overviews”. This announcement effectively dismissed one of the proposed AI content standards, making it clear that Google will continue using its existing crawling and indexing methods rather than adopting new AI-specific protocols.
This demonstrates a fundamental problem with these emerging standards: they only work if major players actually adopt them. When Google—arguably the most influential company in web content discovery—publicly states it won’t use a proposed standard, that standard’s effectiveness is severely undermined.
The Reality Check: Standards Without Enforcement
Here’s the problem that website owners need to understand: These “standards” are entirely voluntary and largely unenforceable. The recent Google announcement about ignoring LLMS.txt perfectly illustrates this challenge.
Even the well-established robots.txt protocol is merely a suggestion. While most legitimate search engines respect it, there’s no legal requirement to do so. The situation with AI-specific standards is even more precarious:
- No Universal Adoption: Unlike robots.txt, which has decades of industry acceptance, AI-specific standards are fragmented and competing
- No Legal Backing: These standards carry no legal weight beyond what copyright and contract law already provide
- Easy to Ignore: Bad actors can simply disregard these files with no meaningful consequences—as Google has explicitly done with LLMS.txt
- Retroactive Problem: Much AI training has already occurred using previously scraped data
Practical Steps for Website Owners
Given this uncertain landscape, what should website owners actually do to protect their content?
Is Cloudflare our John Connor?
Before diving into traditional protective measures, it’s worth highlighting a major recent development: Cloudflare has become perhaps the most powerful tool in a website owner’s anti-scraping arsenal.
Cloudflare now offers a “brand new one-click to block all AI bots” that’s “available for all customers, including those on the free tier”. Even more significantly, starting in July 2025, Cloudflare began “blocking artificial intelligence crawlers from accessing content without website owners’ permission or compensation by default” for all new domains.
This is huge because Cloudflare hosts about 20 percent of the Web, and the move is seen as a win for the publishing industry. More than one million customers have since chosen this option since it was introduced.
What makes Cloudflare’s approach particularly sophisticated is its granular control: clients can now “allow or disallow crawling for each stage of the AI life cycle (in particular, training, fine-tuning, and inference) and white-list specific verified crawlers. Clients can also set a rate for how much it will cost AI bots to crawl their website”.
The company has even launched a “Pay Per Crawl” marketplace where content creators can “charge AI crawlers for access to their content”. This represents a fundamental shift from free, unrestricted scraping to a permission-and-compensation model.
Traditional Protective Measures
Immediate Actions
1. Consider Cloudflare Protection If you’re not already using Cloudflare, it may now be worth the switch purely for AI bot protection. Their free tier includes AI scraper blocking, and their granular controls are more sophisticated than anything you can implement manually.
2. Update Your Robots.txt While not foolproof, blocking known AI crawlers in robots.txt remains your first line of defense for non-Cloudflare users:
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
3. Strengthen Your Terms of Service Explicitly prohibit AI training use of your content. While enforcement may be challenging, clear terms provide a stronger legal foundation.
4. Consider Technical Measures Rate limiting, user-agent blocking, and other technical barriers can deter casual scraping, though determined actors may circumvent them.
Long-term Considerations
Document Everything: Keep records of your protective measures, terms of service, and technical implementations. This documentation may prove valuable in future legal proceedings.
Monitor Your Content: Consider using services that can detect when your content appears in AI training datasets or outputs.
Stay Informed: This area of law is evolving rapidly. What’s uncertain today may be settled law tomorrow.
The Need for Legislative Action: Protecting Content Creators from Corporate Overreach
The current patchwork of voluntary standards, inconsistent court rulings, and corporate foot-dragging makes one thing clear: we need comprehensive legislation to protect content creators, website owners, and businesses from unchecked AI data harvesting.
The recent Anthropic ruling, while providing some clarity, actually demonstrates the problem. A single federal judge’s interpretation of decades-old fair use doctrine shouldn’t determine whether entire industries can freely appropriate the creative work of millions without consent or compensation. We need clear, modern legislation that addresses the unique challenges of AI training data acquisition.
The Federal vs. State Approach
While federal legislation would provide the most comprehensive one-and-done solution, the current administration has shown no interest in protecting us from our automated overlords. Because of this, individual states can and should lead the charge. History shows us that when large progressive states like California implement strong digital privacy protections, companies often find it easier to comply nationwide rather than maintain different standards for different jurisdictions.
This state-led approach recently survived a serious threat. The ironically named “One Big Beautiful Bill”, in its original form being advanced by Trump, originally included a provision that would have prevented individual states from passing AI regulations for 10 years (I guess it’s only good to leave it up to the states sometimes?). This corporate-friendly moratorium would have left content creators defenseless against AI scraping while the federal government remained paralyzed by partisan gridlock.
Thankfully, the proposal to prohibit states from regulating artificial intelligence for a decade was soundly defeated in the U.S. Senate. In a glimmer of hope for humanity, this victory preserves states’ ability to protect their residents and businesses from AI overreach.
The fact that this prohibition was rejected in a blow to Big Tech shows there’s growing recognition that we can’t simply let AI companies self-regulate while they systematically harvest the work product of entire industries.
The Bigger Picture: Where We’re Headed
The tension between AI development and content creators’ rights is far from resolved, despite the recent Anthropic ruling providing some initial clarity. The removal of the AI regulation moratorium from the “Big Beautiful Bill” opens the door for much-needed state-level action. We’re likely to see:
- State-Led Innovation: With federal preemption off the table, states like California, New York, and others can now develop comprehensive AI data protection laws
- More Court Decisions: Other federal courts will need to weigh in on whether they’ll follow the Anthropic precedent or chart different courses
- Corporate Compliance Pressure: Once major states implement strong protections, companies may find it easier to comply nationwide
- Legislative Action: Congress may eventually provide clearer guidance on AI training and copyright, particularly in response to creator advocacy
- Appeal Process: The Anthropic decision may be appealed, and higher courts could overturn or refine the fair use analysis
- International Complications: Different jurisdictions are taking varying approaches, complicating global compliance
Recommendations for Website Owners
So what’s a website owner to do? Below are some options that should be considered, assuming they work for the situation or use case:
- Consider Cloudflare: Their AI blocking tools are currently the most comprehensive and effective protection available
- Don’t Rely Solely on Technical Standards: AI.txt and similar proposals are helpful but insufficient protection
- Layer Defenses: Combine Cloudflare protection, technical measures, clear terms of service, and proper documentation
- Understand the Limitations: Even Cloudflare’s protection isn’t foolproof against determined scrapers using novel techniques
- Support Legislative Action: Contact your state representatives about the need for AI data protection laws
- Consult Legal Counsel: If your content has significant commercial value, consider personalized legal strategies
- Stay Engaged: This area of law will continue evolving, and staying informed is crucial
The removal of the federal AI regulation moratorium means states can now act. Website owners, content creators, and business owners should make their voices heard in state capitals before AI companies consolidate their current advantage into permanent legal precedent.
Conclusion
The promise of AI.txt and similar standards reflects a genuine need for better tools to control how our digital content is used. However, website owners should understand that these emerging standards are more aspirational than protective at this point.
The legal landscape around AI data scraping remains unsettled, with courts, legislatures, and industry players still working to establish clear boundaries. The recent Anthropic ruling and the defeat of the federal AI regulation moratorium in the “Big Beautiful Bill” show that this area is rapidly evolving—sometimes in favor of AI companies, sometimes preserving opportunities for stronger protections.
While we’re not quite at Skynet levels of AI autonomy (yet), the speed of AI development and the lack of meaningful oversight should give us all pause. Every news cycle brings new AI capabilities, and with each advancement, the window for establishing proper legal frameworks and ethical boundaries grows smaller. We don’t need AI systems to become self-aware to cause massive harm to creators, writers, and content owners—they’re already systematically harvesting the work product of entire industries without consent or compensation.
While technical standards may eventually provide meaningful protection, today’s website owners need to rely on a combination of traditional legal protections, technical measures, realistic expectations about their limitations, and active advocacy for stronger legal protections.
The digital content we create has value, and we deserve to have a say in how it’s used. But protecting that content in the age of AI requires more than just uploading an ai.txt file—it requires a comprehensive approach that acknowledges both the potential and the limitations of our current legal and technical tools, combined with active engagement in the legislative process.
The window for meaningful state-level action is now open. The question is whether content creators and website owners will seize this opportunity or allow AI companies to continue operating in a legal vacuum that heavily favors their interests over ours. Because if we don’t act soon, we may find ourselves living in a world where human creativity is reduced to mere training data for machines—and by then, resistance may be futile.








Leave a Reply