Cloudflare blocks scraping bots by default on pages with ads

TL;DR: Cloudflare will block crawlers that combine search and AI training by default on pages with ads. The measure, effective from September 2026, aims to give publishers control and foster a fairer ecosystem. Search engines like Google, Bing, and Apple must adapt or face blocks.

What happened?

Cloudflare has announced significant changes to its platform to protect website content from scraping bots, especially those used to train artificial intelligence models. Starting September 15, 2026, new customers and new sites of existing customers will have the default configuration that allows crawling for search but blocks AI training and agents on pages with ads. This configuration will also apply to free plan customers who have not modified their settings. The decision comes in a context where, according to The Register, most internet traffic is already non-human, and publishers have been struggling to control the use of their content by AI companies. Cloudflare, which handles approximately 20% of global web traffic, positions itself as an arbiter in this conflict.

Additionally, Cloudflare renames its 'Pay Per Crawl' service to 'Pay Per Use' and partners with Ceramic.ai and You.com so that publishers receive compensation when their content generates value, not just when it is crawled. It also launches a new Business Insights Dashboard that provides granular visibility into bot consumption, including metrics on which bots access, how often, and what data they extract. This dashboard aligns with the 'data sovereignty' trend that has gained traction since the GDPR in 2018 and recent EU AI regulations.

Why is this important?

This move is crucial because it addresses one of the biggest current conflicts on the web: the uncompensated use of editorial content to train AI models. Until now, publishers were caught between allowing Googlebot crawling (necessary to appear in search results) and preventing their data from being used for AI training. Cloudflare offers a technical solution that separates both uses, giving publishers granular control. The decision is based on the existence of directives like Google-Extended (announced in August 2023) and Applebot-Extended, which allow publishers to opt out of AI training via robots.txt. However, many bots ignore these directives, and Cloudflare aims to enforce blocking at the network level.

The decision directly affects major search engines: Google, Microsoft Bing, and Apple, whose crawlers (Googlebot, Bingbot, and Applebot) have mixed uses. If they do not adhere to exclusion directives, they could see their access to content with ads blocked. According to The Register, Googlebot combines crawling for search and data collection for AI training, and publishers have tolerated this for fear of disappearing from search results. Bingbot and Applebot have similar behaviors. The implementation date, September 2026, gives these giants time to adjust their crawlers, but if they do not, they could lose visibility on a significant portion of the web.

Consequences for the digital ecosystem

For publishers: Greater control over their content and potential revenue from data licensing. However, they could lose search traffic if crawlers cannot access pages with ads. The new analytics dashboard will allow them to monitor bot behavior and make informed decisions. The partnership with Ceramic.ai and You.com offers a compensation model based on generated value, similar to content licensing agreements that News Corp and Axel Springer have signed with OpenAI.
For AI companies: Less free access to training data, which could slow model development or increase costs if they opt to pay for content. This could accelerate consolidation in the AI market, where only companies with large budgets can access high-quality data. It could also incentivize the use of synthetic data or improvements in unsupervised learning techniques.
For search engines: Google, Bing, and Apple must adapt their crawlers to comply with the new rules or risk being blocked on sites with ads. Historically, Google has resisted changes affecting its crawling ability, but in 2020 it accepted the 'noindex' mechanism for paid content. They are likely to follow a similar path, separating their search and AI bots. Apple, with its focus on privacy, may be more receptive.
For users: Possible improvement in content quality if publishers can monetize it better, but also possible reduction in search coverage if bots cannot access certain pages. However, search engines might prioritize ad-free content, leading to a cleaner but less diverse web. Additionally, users may see more results from sites that do not use Cloudflare or allow full crawling.

What should readers know?

If you are a website owner, you should review your Cloudflare settings starting in September. If you use ads, training bots will be blocked by default, but you can adjust this manually. The new analytics dashboard will help you better understand bot traffic, showing which bots access and how often. You can also explore the 'Pay Per Use' option to earn revenue from your content's use in AI, though this is in early stages with specific partners.

For internet users, this change could affect the availability of certain sites in search results, especially those relying on ad revenue. However, it could also foster a fairer ecosystem where content creators are compensated for the use of their data in AI. More broadly, Cloudflare's measure could set a precedent for other web infrastructure providers (like Akamai or Fastly) to implement similar controls, accelerating the transition toward a more regulated web regarding data use.

"Now that most internet traffic is non-human, we must go further and act faster to allow a sustainable ecosystem to emerge," said Matthew Prince, CEO of Cloudflare.

This statement reflects the urgency of a problem that has grown exponentially since the launch of ChatGPT in 2022. Cloudflare's decision not only protects publishers but also sends a clear signal to the industry: the era of free scraping for AI is coming to an end.

Cloudflare blocks scraping bots by default to protect web content

What happened?

Why is this important?

Consequences for the digital ecosystem

What should readers know?

Keep reading