Cloudflare has introduced new tools that let web site owners control whether AI bots are allowed to access their content for model training.
First, customers can let Cloudflare create and manage a robots.txt file, creating the appropriate entries to let crawlers know not to access their site for AI training.
Second, all customers can choose a new option to block AI bots only on portions of their site that are monetized through ads.
Creators that monetise their content by showing ads depend on traffic volume. Their livelihood is directly linked to the number of views their content receives. These creators have traditionally allowed crawlers like Google on their sites to make them more discoverable and drive traffic. Google benefits from delivering better search results and the site owners benefitted through increased views.
But recently, a new generation of crawlers has appeared: bots that crawl sites to gather data for training AI models. While these crawlers operate in the same technical way as search crawlers, the relationship is no longer symbiotic.
AI training crawlers use the data they ingest from content sites to answer questions for their own customers directly, within their own apps. They typically send much less traffic back to the site they crawled.
A Cloudflare study found that, while Google crawls a site about 14 times for each referral it passes on, some AI agents crawl anything from 1 700 to 73 000 times for each referral.
Cloudflare has offered robots.txt for some time, but few customers have implemented it. Now, it is operating fully managed robots.txt that makes it much easier to manage crawlers.