🎂 Auditing and controlling AI models

Presented by: Sam Rhea

Originally aired on October 17, 2024 @ 8:30 PM - 9:00 PM EDT

Welcome to Cloudflare Birthday Week 2024!

2024 marks Cloudflare’s 14th birthday, and each day this week we will we announce new things that further our mission — to help build a better Internet.

Tune in all week for more news, announcements, and thought-provoking discussions!

Read the blog post:

Start auditing and controlling the AI models accessing your content

Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!

English

Birthday Week

Transcript (Beta)

All right. Hello, everyone, and welcome to Cloudflare Birthday Week. In case you are unfamiliar with what that means, every year we here at Cloudflare celebrate the anniversary of our first launch. And we do so by giving back to the Internet. Because if you think about the role we get to play, that we're fortunate enough to play in helping build a better Internet, a lot of that depends on the kind of collaboration and trust that we build with other providers in the space, and most importantly, our customers. And so today, we're going to be talking about one of the many features that launched, many, not many, one of the several features that launched this week during Birthday Week. And that's our new AI audit and control experience. So I'm going to go through a bit more about what that means, the problem it's solving, how you can use it, and what's available to every single customer today at no additional cost. But first, what makes a Birthday Week announcement? Why do we launch anything during Birthday Week? And what qualifies as a Birthday Week announcement? That's a really interesting discussion, because a lot of people might think, oh, this is Cloudflare's big annual conference, they're going to be launching large sort of enterprise focused announcements, the kinds of things that only the largest customers might be interested in. And that's really the opposite of what makes a Birthday Week announcement. It's not that our largest customers aren't interested in these announcements. Many of these announcements are going to be really exciting for customers, large and small. But it's that the goal of this week is not necessarily to just focus on the biggest Internet properties. The goal is to help build a better Internet by giving back to the Internet. And that often means taking tools that were historically available only to the largest Internet properties in the world and making them available to everybody. Because again, helping build a better Internet means one that is faster and safer for everyone who's participating on that Internet. And so we're excited today to talk about this new AI audit and control feature, because we think it lines up really closely with this goal that we have for Birthday Week. It starts with a pretty consistent problem that customers, again, large and small experienced. And that's this challenge of, wait a second, I have a website on the Internet. And for the last decade plus and longer, there were good bots and there were bad bots that were hitting that site on the Internet. The bad bots, of course, were the kinds that wanted to ruin my customer's experience by cutting in line if I was selling tickets, for example, or take down my website. And the good bots are the kinds of bots that would index my website so that my website would appear in a search engine, for example. And I want that appearance in that search engine because that drives traffic to my website. And so if you think about Cloudflare's bot management solution, what we've historically tailored it around is how can we give people the tools to block bad bots and to allow good bots. And that was a pretty easy distinction, not easy implementation or technology, but easy distinction, good versus bad for years and years. But over the last couple of years, a new type of bot has popped up on the Internet and it kind of sits in a weird middle ground. And those are the bots associated with large language models, these artificial intelligence tools that consume vast quantities of data to train models that can do things like write an email or write code or draw a picture of myself surfing as a Lego cartoon character here in Lisbon, Portugal, where I'm based. And those tools rely on vast inputs of data information and data and information. And they typically or historically have gotten that by crawling the Internet and essentially just reading the Internet. And that's something that in some senses is powerful because what these tools can do is help all of us create more and do more. But at the same time, it poses a risk for the kind of traditional exchange of value between publishers and sites on the Internet and the kinds of services that aggregate or crawl these kinds of publishers and sites on the Internet. Because historically, if you think about a good bot, it's taking my website, it's indexing it, seeing what it's about. And then a user who's going to search for the kinds of content that is on my website will see my website as a link somewhere, click on it, and suddenly they arrive, I get traffic. And whether that traffic drives direct revenue through ads on my website or just contributes to people maybe coming to my website and signing up for a subscription or generally coming to my website as part of building a brand because I want to sell other goods or services, that's valuable to me. And so I want to be indexed, I want to be crawled. But with these AI tools, it takes away some of that motivation. In fact, it might even discourage websites and Internet properties from participating on the open Internet. Because again, if you're thinking about an LLM-based tool instead of a traditional search engine, what these LLMs might do is take the content of my website, of hundreds or thousands of other websites on the Internet, put it all into a blender and then produce new output for a user. So say I have a personal blog that talks about Lisbon travel, the LLM tool might read it, and all these other travel websites on the Internet. And if somebody is searching for tips about things to do in Lisbon, they will just search inside of that LLM, they'll ask the LLM, and it will produce a response based on my content and the content of everybody else, but without necessarily attributing the content to, hey, you know, this was Sam's blog and this other blog about Lisbon. If you want to learn more, go to those blogs. Instead, a user who's looking for tips of what to do in Lisbon might never go to any of our websites. Instead, they're just going to live in that LLM experience, and that poses a risk. Because if I no longer have motivation to publish content on the Internet about what to do in Lisbon, I might just stop. Because again, my traffic is decreasing because it's going towards these new intermediaries. And so we want to give customers with this kind of problem a new tool set to regain control, to first have some transparent visibility about these bots and these tools. And then second, give them some levers to either stop this kind of scanning and crawling or stop specific types of scanning and crawling and allow others. And then finally, something that we're previewing today is a new place where customers can set a value on their content. And model providers who want to scan it can decide whether or not they want to scan it and pay for it. And that way, if I have, again, that Lisbon travel website, I'm less concerned if that content comes through an LLM or a traditional search engine. Because if it's coming through an LLM, I've set a price for that kind of access. And so at the end of the day, I continue to be motivated to publish great content on the open Internet because there's a new way for me to create value from that content in this new era of LLM tools. And it all starts with this new tab. So what you see on the screen is live for all customers everywhere on all plant types today. And this is the tab inside of your zone dashboard. So if you go into the Cloudflare dashboard and pick one of your websites, you'll see this on the left-hand toolbar, AI audit. And what it is, is without any additional work from any customer, a comprehensive analytics suite about what AI bots are doing with your content. So if you see here, I've got personal blog at samray.com, and you can see all the AI activity over the last 14 days. We could also expand this out to the past month. Let's go ahead and do that. And you'll see pretty common, popular AI model providers or folks who operate in this space like Anthropic and ByteDance and OpenAI and Perplexity. And you'll see some activity over the last month. And I can also see who's been the busiest. And frankly, the first time I looked at this data on my own personal blog, I thought, holy smoke, that is way more crawling than I ever expected on a personal blog that hosts goofy stories about Texas or Portugal. This is pretty overwhelming to some extent. And so I can see who's the busiest and not just who's the busiest, but also what is the type of scanning that they're doing? And this is an important distinction because some of these bots, they just want to read the Internet like this ByteSpider AI data scraper bot. They want to consume as much of the Internet as possible to train or improve foundational LLM tools. Whereas these AI search crawlers like these here, they're a little bit different. These are the kinds of tools where I go and I search for something in one of these tools, and it returns a response with links, with attribution in the form of these links. So I might be a little more comfortable with these search crawlers because in that case, there's still a chance that those users are going to click through to my website. Whereas user action means a user is inside of one of these LLMs and they are asking a question like, hey, can you go search for content about this Lisbon subject? And that might then go look for my blog somewhere. So it's a kind of subtle distinction between user action and search crawler. But again, these are slightly different from just pure data scrapers, which are just reading as much content as they can. And then we can also see what is the most popular. So robots .txt, which is a pretty common way for websites to signal what they do and do not want to allow is no surprise, pretty popular. But you can also see different posts in different aspects, different images, different content on my site that is popular. But one thing we do want to give customers the ability here is to filter down for just some of these use types, use cases or types of bots. So in this case, I just want to look at search crawlers. I want to see who's been the busiest, how frequently have they been busy, and what are these search crawlers looking for? Or maybe I want all bot types from a popular model provider like OpenAI. I can see that here. They've got three different bots that they operate and publish, and this is the activity. And this is pretty powerful because again, we want to give customers transparent visibility in ways that would have taken months of effort historically with a lot of log data. But OK, if I'm a customer, I'm a user, I have a website, I see all this data and I think, oh, wow, this is a lot. I do not necessarily feel comfortable with this just yet. I need a timeout. I need to pause to kind of make an internal decision about what is my policy going forward with these tools. Well, if that's the step that these users want to take, they can click this block all button. And with a single click, they can block all AI scrapers and crawlers, which again, gives them a timeout, gives them a way to say, look, this is too much. I'm not ready for this. I need a break. I need a pause before I can decide what to do next. But maybe they take that pause and they come back here and they say, all right, I've made a decision. Either maybe I've entered a direct agreement or direct contract with one of these model providers, and so I only want to allow them, but nobody else, or I only want to allow user action and search crawler, but not data scrapers. They can click this button and come build those rules with a lot of granularity inside of our WAF dashboard, which is a really powerful way for our customers and users, again, on any plan to have control, to take that control over what happens with their content, which is really great. One other thing that we've previewed today is the ability to set a value for the pages or their entire websites, and then have model providers who are interested in crawling and accessing that data agree to pay that value, that price, in order to access that data. Cloudflare is in a really great position to do exactly that because unlike a robots.txt file, which is really just kind of a warning sign without a lot of enforcement, we can enforce that. We are the reverse proxy in the WAF for these websites, and so we can make sure that if these bots are going to access that content, they are going to do so as long as you allow it, or as long as in this new value tool, as long as the value you've set has been agreed upon. That's a really powerful place that we're going to take this over the next few months. But again, this demo is pretty quick, and because I would much rather you go look inside of your own Cloudflare dashboard and experience this than just see mine, but in general it's something that we're really excited to bring to our customers because it starts to shift some of that power back into the hands of the site owners on the open Internet, and we're excited to see what this allows our customers to do, and hopefully it encourages customers and Internet properties everywhere to continue to publish great high quality content, and to feel like they now have visibility and control and soon value in this new world of LLMs. We're really excited. Again, available to all customers, all plan types in the dashboard today. Go ahead and get started. If you have questions, let us know. Send us a support ticket. Let us know on Twitter, wherever else you contact us. We would love your feedback, and we're really looking forward to you and customers of all sizes using this new tool. All right, thank you so much.

Birthday Week

For Cloudflare's annual birthday, we like to give presents back to the Internet. Each day during Birthday Week, we will we announce new things that further our mission — to help build a better Internet. Be sure to head to the Cloudflare Birthday Week...

Watch more episodes