Redirects for AI training: edge-enforcement of canonical content

Presented by: Cam Whiteside, Craig Dennis

Originally aired on April 17 @ 9:30 AM - 10:00 AM EDT

Join Cam Whiteside, Product Manager, and Craig Dennis, Senior Developer Educator for AI, as they discuss how Cloudflare prevents AI models from training on outdated content.

Tune in to learn about these three major updates:

Edge-Enforced Redirects: Automatically redirect AI training crawlers from deprecated pages to authoritative content using a single toggle.
Verified Crawler Detection: Cloudflare identifies bots declared for training to ensure they follow canonical links without manual rule-making.
Zero User Impact: Clean up AI training data at the edge without disrupting human visitors, SEO rankings, or AI assistants.

Read the blog post:

Redirects for AI training: edge-enforcement of canonical content

Visit the Agents Week Hub for every announcement and CFTV episode — check back all week for more!

English

Transcript (Beta)

Hello and welcome everybody. I am here at Agents Week. I am very excited to be talking about this post that I think we've all been probably thinking about and wondering about, and I'm so happy to have one of the authors here. I have Cam Whiteside here with me. Cam, can you tell me, actually, Cam, why don't you introduce yourself? Yeah, hi. I'm Cam Whiteside. I am a product manager on AI crawl control at Cloudflare, and I'm here to talk about redirects for AI training that we're launching today. Yeah. Yeah. So redirects for AI training because the Internet has some old stuff on there, Cam, doesn't it? Sure does. Yep. Yep. So how do we stop that from happening? I know that sometimes, you know, I will run something and I'll see, I mean, I'm going to be very specific here for the developers on here, but like I'll run something on Cloudflare and it will pull up an old TOML file, like the way we used to do things, right? So that's a problem. I don't want people doing that. I want people using the new stuff. Yeah. I'll call it like, it's a build me an app problem. It's how everybody's using Agents today. It's what we're building these incredible tools for. But when you say, let's say I'm Cloud, you say, build me an app on Cloudflare. It turns out Cloud and all these foundational models, they know Cloudflare. You know, they don't always live check these docs, but what happens is that you spin up with, for example, a worker's compatibility date in 2025, or we'll occasionally use these other syntax. And as much as we manage to sort of evolve our product and manage this backwards compatibility, ultimately these really large models that are still powering the newest, latest, greatest agents still have data that is basically growing stale. And that sort of presents like this, this real risk. At the same time, the paradox is, well, we we've got to keep these docs available. And there's folks all the time that are migrating to the newer versions. We have these human eyes that are here to perform this really specific task. But that world kind of falls apart when we sort of talk about this AI training based world to feed sort of our latest agentic tools. Yeah. So I know that we don't show, we have old docs, right? We like, we kind of deprecate content, right? Let's, should we talk about Cloud? Let's talk about Cloudflare docs that people probably watching this have, have thought most about Cloudflare docs. Yeah. I mean, that Toml stuff is probably in the docs someplace that we deprecated, but we've made it so it doesn't show up in the search results when that happens. What's the difference about, what's different about the AI world? Well, maybe it's to talk about the AI world first, we should, we should talk about the world pre-AI and what you sort of make it for, I'd say like two users. One is going to be like your human visiting it. And let's say that I'm a user that I'm using, you know, Wrangler V1, and I'm trying to upgrade to three or four at this point. I'm going to go to a page that says legacy. I'm going to go see this yellow warning banner saying, no, this is not the updated content. Go to the V4 docs here to see the latest. This is for migration purposes. If I'm a search engine crawler, if I'm like, I'm looking at no index tags, looking at no follows, a lot of these large, obviously, you know, your Bing's and your Google's have really built their entire models around sort of representing page structure in this very hierarchical way. This all falls apart with AI training. So if I'm building like an AI training bot, I'm there for volume of content. So I'm crawling any page multiple times, and I might treat a paragraph that says deprecation, just like another paragraph right next to the ones that have the content that I'm after. And it's not really clear exactly how that's weighted once it's in these relatively opaque training pipelines. So that's a little bit opaque, but the consequences are really visible because when you're actually prompting and building on some of these, it's clear that some information that was meant with some nuance and some controlled presentation just isn't being represented the same way when it's being built into apps or building into answers. They're like not following robots.txt, right? Some are. Some are and some aren't. But we have a difficulty with robots.txt, and it's a lot of crawlers follow it, and it's a great tool. Cloudflare offers a lot of tools to help maintain it. But ultimately, we might want, for example, an AI agent who's working on some of this to visit it live and ingest it and notice the deprecation and help somebody with a migration. Or you want a rag model to go in and parse it. So we have this idea that even bots within themselves kind of have a bunch of different use cases. So we rolled out content signals, for example, as a way to say, hey, in my robots.txt, this are sort of like the permitted use, whether it be for search or input or training, and users can specify yes or no. Cloudflare's managed robots.txt helps you do this. But ultimately, maintaining that at the scale that you're producing content and to be able to basically mirror all of your intent with all of your content and keep up as you produce more. As we dug in a little bit closer, we think that we found a better, I should say, a better way to help a lot of users out. That's what this week's all about. Finding better ways, right? What do we find? What's the plan? What are we doing? Well, we're releasing redirects for AI training. So what that does is it's simple. It's a toggle. It's going to be available to paid plans in AI crawl control. And it's just a toggle that you can flip on. When we do that, Cloudflare starts looking for two things. One is going to be a verified AI training crawler. So these are crawlers that have declared their purpose that are known and verifiable to Cloudflare. The other thing that we're going to look for is going to be your canonical links in your HTML responses. So now traditionally, these are just an H, I should say, an SEO world and sort of artifact. So the idea of a canonical link is that you'll visit one page. And if there's an updated or a more current, or I'd say authoritative source of that page, a canonical link will include a reference to the more up-to-date source of information. And now this is a really common pattern. I mean, canonical links actually, even in the agentic Internet are growing. We're seeing that year over year, we're increasing the number of sites with canonical links as increased by 4%. We're close to like 70% now in 2026. And even tools like WordPress or Emdash just have canonical links built in. And that is a signal that currently a lot of site owners have this infrastructure to create and maintain. And it already sort of solves this problem that says, where's the authoritative source of this information? So when you have that combined with AI training crawlers, Cloudflare can enforce that at the edge. And rather than even serving that and risking ingestion into these training pipelines that we can't control, now we can control it by redirecting them to the authoritative content. Nice. And what status code are we using when we redirect? We're doing it 301. Okay. Why a 301? Well, 301 is a permanent redirect. So that's sort of standard for our redirects. And I think the first thing that everybody asks about with this is the cache. So like, how does this, you know, how if Cloudflare is caching my content, what are the downsides of this? And it's important to note that for redirects for AI training actually happens after the cache. So that way, what we're caching is going to be the 200s that will be cached with this canonical link, and then we'll redirect in the response phase. So that way, you're not risking human users or agentic users who can actually handle some of this nuance. Instead, we're really intending to leverage both the cache to basically still work as a cache should, while still getting these bulk ingestion crawlers where they need to go. Super cool. So how could you do this with single redirect rules, like in the past? This is different, though. Like, how is this different? I think the outcome is the same. I mean, really, it is just, it's a redirect. Yeah. But we actually tried that, actually. So this was a problem that, you know, we noticed on a developer doc site. The first thing we did is I sat down and we tried to say, OK, what redirect rules do we need to set up? And we started making a list. And then we said, OK, we've got those lists. This is where our deprecated pages. These are the URLs that have the word legacy in them. These are the URLs where we think we're shipping something that's a little bit better and different next. And then once we got that long list, then we said, OK, here's the crawlers we need to match. And we started trying to figure out what their user agent strings were or their different action categories. And when you're maintaining these deprecated page lists and you're even maintaining these crawler lists, any time that a new one comes up, it's going to be another change you have to make. And I have to say, as you know, in AI crawl control and especially with our bots, when we're really working on expanding these lists further and further and now we're producing content faster and faster, you know, it's a losing battle to kind of go with that message or that method, I should say. So yeah, and that would take a ton of time. But what is the current time? What do we do currently? What does it feel like? Sure. So this is logging on to AI crawl control and flipping a toggle. It's just like a button. Boop! And you're done. OK. Awesome. Yeah. So it'll be 10 seconds. So you click into your domain. It's one toggle right there on AI crawl control at the top. And we'll start sending these 301s. All right, Cam, but how do we do it? What is the technical implementation here? I've got a button that I can press, but what does that look like on our side? Sure. So here's what happens when the toggle is on. We, with minimal latency, I'm talking less than a millisecond, look at the first content of the head in the HTML response. If it contains a link element with a rel attribute of canonical and it has an href element that is both same origin, so we're not redirecting outside of your site, and it's to a different page, so we're not going to be doing any circular redirects, then that's where we'll send the AI training crawler, rather than serving them the body of the HTML, which may contain these big warning banners or this outdated content. So Cam, are there any other limitations I should be aware of? Yeah. So one is that we exclude cross-origin canonicals. So that means these are areas where you're directing to another site on a completely different hostname. So this is all about really redirecting to authoritative versions of your own content, first and foremost. And then, of course, we want you to be able to control both those endpoints. So, you know, we're not looking at crawler traffic thereafter. Training data poisoning is this. This is all about really improving how you're being ingested in these foundational models. Awesome. Any other limitations? I would say it's important to know that this is AI training crawler. So especially when we're talking about agent weeks, we have AI assistance, we have great new search functionality. Right now, we're actually still seeing about 80% of the traffic related to AI is still for training. So as much as we talk about these assistants that are helping humans complete tasks, and we see that floating more in the single digit percentage, this leaves those be sort of similar to the human eyeballs to help with those migrations to help kind of understand the nuance of everything that's on the page. This is for those AI training that are really just taking scraping content and then feeding it directly into models. Awesome. Awesome. And we're seeing some of those. And have we put this on our doc? I'm hoping we did, because this sounds awesome. Is this on the docs? This is on our docs. This is on Cloudflare.com. This is on our blog. So yeah, and that was actually one of the original use cases we built it for was just, you know, how do we improve how Cloudflare shows up in these AI answers? So we turned it on on our docs and we saw 100% redirects on canonical links from AI trainers. Is it 100%? That's amazing. So any verified AI training crawler that sits one of these canonical links on our docs no longer has the option to basically ingest this outdated content. And that means zero deprecated content being served to these verified training crawlers. That is awesome. How do you know if the training actually improves? How do we know that? Yeah, well, this is a hypothesis that we're working on with time. What we know for sure, it prevents the problem from getting worse today. Models will improve as they retrain. And, you know, over months, you know, I expect to see that we have some measurable answer improvement. But the important thing, too, is like with the system in place, as we publish new docs and we deprecate something else, we see the actually crawler's access to this training data drop immediately. And that's the most important part. Right now and before this, when we have this deprecation, they're crawling at the same rates. And it would take, you know, months of potentially recrawling before they get enough feedback to actually drop off. But now this is the unambiguous signal and we can cut it off as soon as it's deprecated. So we expect these like iteration times to improve answers to become much shorter. Did you that that's awesome. It's awesome to be like literally, of course, of course, you could see that the pages are no longer getting hit. That's awesome. Do you are there any other surprises that happen during that? I don't think so. I mean, it's like, which is thankful, you know, it's a reasonable tool. And you know what? Maybe the indicator here is it's zero human impact. You know, folks are still worried about SEO, rightfully so. Folks are worried about a human impact. Can they still get the information that they need? So as we build these tools that kind of change entire, you know, site behavior, the response type to a large percentage of your traffic, you need to make sure that your core business is still, you know, stabilized, whether it be for human visitors, search engine optimization, even AI agents using your site. And that's the nice thing about no surprises that we haven't had any disruptions on those. That's awesome. A part of the so another cool thing that I saw on the blog, because I'm sure people are wondering, is the radar integration, right? So you can actually kind of go and look at stuff. So there's some some pretty awesome numbers. People should check out this blog. You didn't write that part, though, did you? No, that was that was David Belson. Yeah, he wrote the part about understanding what status codes are being sent to AI crawlers across Cloudflare's network. Awesome. So, Cam, I think we are at that part where we need to say, hey, everybody, there's a blog already about this. You need to come. You need to come read this blog and consume it. And I am so excited about what people are able to do now. Right. So thank you. First of all, thank you, Cam, for giving us a button to press. It's so cool. And for the blog post and all that. Thank you for coming. Any last things you want to you want to shout out? I don't think so. It's it's available. It's an AI crawl control one button, 10 seconds to turn on and improve your ingestion and AI answers. Awesome. Thank you so much, Cam, for being here. And thank you, everybody watching at home. Make sure that you check out this blog post and all of the blog posts. I know that there's a lot. You can read them next week. We won't we won't tell. All right. Thanks, everybody. Thanks for hanging out. Thank you so much, Cam, for being here. And we'll see you real soon. Bye.

For Site Owners

Ensure AI crawlers train on your most up-to-date content with Redirects for AI Training.

Read the blog

Agents Week

Join us for Agents Week 2026, where we celebrate the power of AI agents and explore how they're transforming the way we build, secure, and scale the Internet. Be sure to head to the Cloudflare Agents Week Hub for every announcement, blog post, and...

Watch more episodes