Originally aired on October 20 @ 11:00 AM - 11:30 AM EDT
The online content landscape is changing rapidly, and with the rise of AI crawlers, creators face new challenges in controlling and monetizing their work. Join David Liu, Sr. Product Marketing Manager of AI, and Will Allen, VP, Product Management at Cloudflare, as they discuss the groundbreaking "pay per crawl" initiative. This innovative solution empowers site owners to set terms for AI access, ensuring fair compensation and greater control over their digital assets.
David and Will will explain how Cloudflare is building a more equitable internet by enabling publishers to monetize AI access to their content. They'll demonstrate how site owners can audit what AI crawlers are accessing their content and choose to block, allow for free, or set a price for access. They'll also highlight Web Bot Auth, an open standard for cryptographic verification of crawlers, which is crucial for these new payment mechanisms.
Don't miss this opportunity to learn more about pay per crawl.
all right hey everybody i'm david from cloudflare and today we are going to be talking about paper crawl ai audit this big launch that that we had and i'm joined here by will will do you mind telling everyone you know a quick intro of what you do here at cloudflare david awesome to be here with you thanks for setting all this up will allen i oversee product for a couple of in particular the ones we're talking about today sort of how do you think about content creators and publishers and news organizations and anyone who has content on the web and how do they intersect and interact with the world of uh of ai and trying to sort of bridge that gap and build great products for both sides yeah i'm really excited to dive into the paper crawl aspect but i think it will be good if we start off like from the beginning right so last september we launched ai audit can you give a little back about that first?
Yeah, so much of it, when you put your content on the web, and this is true from the smallest of small websites, so my personal websites that are out there to the largest of large enterprises, they get crawled and, you know, you know, attacked by bots across the board instantaneously.
It's pretty wild to see how that happens. And, you know, we are obviously quite good at understanding what's going on there.
But there was a particular type of bot that was sort of have like a much bigger impact on sort of like how we consume content.
how we think about content and how content is ingested.
And that's largely these crawlers from both foundation model companies and sort of new AI powered search across the board.
What we realized and what we heard from publishers and website operators and news organizations was that they just didn't know what was going on, right?
What is happening to my content? Who was coming to access this?
And so the first stage is to just get a lay of the land. And so we launched AI Audit, which is a name that only accountants would love.
And I love accountants. So it's nothing against them.
We need a better name for it. But just the idea is that how can you quickly get the lay of the land?
Which crawlers from which AI companies and which foundation model companies are accessing my content?
How frequently are they accessing the content? Are they obeying my robots .txt?
And just understand what's going on. And so that was sort of stage one of what we did.
And we think that was a really important stage because just any sort of OODA loop or elsewhere, you just have to observe first and foremost.
And so giving every website an and this is true on our free plans, again, all the way up to the largest of enterprise deals, the ability to observe what's happening is really important.
Yeah. So, you know, one point of clarification, Will, is I thought I wanted bots to crawl my website.
So I rank higher in, let's say, Google.
So what was happening with this new age of crawlers? The truth is you might want that.
And if you are, again, think about anyone who creates content, maybe you're building dev docs, maybe you're in a news organization.
Maybe you're a sort of a publishing company. Maybe you're a graphic designer. Maybe you're a photographer and you put your work online.
People have different perspectives and different business objectives of what they want to have happen to their content.
It all starts with understanding, again, sort of like the audit, the orienting what's going on with my content.
So let me audit and understand what's happening.
What callers are accessing my content and how frequently are they accessing it?
And that's stage one. The second step is really important. And that's where you as the individual or as the organization or as the company, you need to decide what do you want to have happen?
Maybe you're like, I want everything to be written for the AI.
I want to write for the age of superintelligence.
Great. That's amazing. You should be able to do that. Maybe you're like, I want to write, but I only want it to be ingested by these AI companies and I don't want it to be ingested by these AI companies.
Great. You should be able to do that. Or you could say, you know what?
I don't want any of this right now. I just want to sort of like hit pause and all of these things.
Great. You should also be able to do that. The core concept there is that in that second step you've got the audit you get to define your business objectives and your priorities that's the second step that's really important that you are in control and the last step is to enforce those how do you say that like once you've decided what you want to allow or not allow you should be in control of being able to block or allow or selectively block various crawlers from accessing your content the key thing that we sort of worked on announcing last week and that we're pushing towards now is putting you as a website operator in an you know a content creator, a publisher, an AI company yourself, back in the driver's seat for what happens to your content when you put it out on the web.
How does this exchange now work?
Like, how do I, if I'm a publisher, or let's say I write a travel blog, right, and I see crawlers through AI audit, how does the paper crawl get introduced in all of the mix?
So imagine you've got a travel blog, or I've got a blog about Will's guitars, or I'm the minute spot in the world to talk about guitars in my website.
Again, first and foremost, what's going on?
Which crawlers from which companies are accessing my content? So sort of orient yourself and audit sort of what's going on.
The second one is, again, defining your policies. And you could say, hey, you know what?
I'm getting a lot of referral traffic from these particular crawlers.
I want to let them through for free.
Or maybe I've struck a deal with one of them and I have a licensing agreement and that's a side agreement I have.
Great. You should let them. through for free you have in the sort of the dashboard of ai audits literally one click toggles for every crawler uh you know well-known crawler that's out there you can sort of choose this binary option to block or allow through for free we didn't get that you know nothing we did nothing else we just did that that is really powerful right it's just giving you the metrics of what's going on and a one click toggle to be able to either allow or block before we go to paper crawl one thing that's really important about this is we launched a new or should propose the new open standard we call it web bot auth which is cryptographic verification of crawlers and you know maybe most interestingly for agents the internet is powered by cryptography think about you know dkim for email think about tsl ssl certs for websites web bot auth is sort of that same standard or the same concept of using public key cryptography to sign a request on the behalf of a crawler or bot now why that matters is so that you can know when a crawler from a particular company is actually coming from that company.
You know, the first step to being able to actually enforce either blocking or allowing is knowing that like, yes, this is the person who they say they are.
So getting that sort of identity layer out into the world was a really important foundational step that we worked on.
We've got an incredible team of researchers who sort of pushed for that open standard.
It's an open standard. If people want to see it changed or improved, you know, they can and should join the working groups to make that better.
That was a really important innovation that we had that then unlocks the ability to say, great i want to block or allow this particular one and in particular at scale as you think about sort of agents sort of coming along that sort of go out beyond sort of like the well-known crawlers but we also realize that like that binary option of blocking or allowing is kind of limiting you know there's a lot of scenarios when you're saying that like yeah i don't want to give this stuff away for free but i also don't want to block it forever that feels like not fun or not interesting i like i actually think this you know these amazing interfaces are part of the future.
And I want to lean into those and be part of them, but I'd like to get compensated for them. Right.
And so I'd love for my content to be used in search or, you know, test and compute or inference or rag, whatever you might call it, but I want to be paid in some way, shape or form.
And we heard that loud and clear from so many different folks.
And so we said, how do we start to build out an experiment here?
You know, really our first step to get this out there and to begin to learn.
And so that's where we launched paper crawl. And the idea is, you know, it goes from this binary option of blocking or.
allowing what if you had a third option if instead of any particular crawler blocking allowing what if you could charge it on every time that that crawler came and accessed a web page um on your domain could you charge it at a price you set maybe it's five cents fifty cents five dollars five hundred dollars you as a publisher set the price um and done at sort of the network level foundation so anytime that crawler comes and access your site you could get paid for it and then that crawler has this great experience we think of being able to crawl the internet And when sort of somebody says, hey, you're blocked from accessing this content until you pay, giving you as the crawler an easy way to pay at scale and license content across the board.
We think, again, early experiment, but the level of feedback we're getting so far from across the market, from people on both sides of that equation is phenomenal.
And we're just excited to keep building here.
If I'm the AI company and I have a crawler, right, I guess, how do I see, where do I see this information?
that I need to pay yeah to crawl website so we had to unlock a couple things to make this work first was that authentication side you read it right so can we know that this crawler is who they say they are you know and a lot of that's historically been done through IP ranges and reverse DNS lookups that is fine if you're a very big company with a big crawler it doesn't scale to agents who might be you shared infra and even smaller you know up -and -coming companies had similar problems where they didn't always have a dedicated IP so they needed a way to say that like yes this is me this is my crawler so the authentication via webbot auth was like the really important first step uh that i think is is critical but the second was imagine i'm an agent right so i've set up a new agent and it's sort of going off to the world and you know and just in sort of to get this content to find things they will get a http 402 response from a publisher that they set up and said i want to charge this crawler or a content crisis i want to charge this crawler and this is like a for part of the internet sort of stack that was always reserved for future use.
And it basically says, you cannot get this content unless you pay for it, like payment required as part of this.
And we felt like there's like something powerful about this like core network level protocol that hadn't been used, but we could sort of like dust it off and bring it back to the So a crawler goes out, it says, great, tell me, I want to get the latest post from David's But you said, I want to charge for this, which is great.
You had set the price.
Me, as a crawler, I get that HTTP 402 response back from your blog.
It says, payment required to access this post.
And it's going to cost whatever the amount is that you would set as sort of the proprietor or the publisher of that post.
And then me, as the crawler, I can then decide, great, I want to pay for that.
And if so, I do another response and include in my header, like here's the amount I'm willing to pay.
And if I do that, they basically get the content.
We, as cloud developers, player then help tally that up at the end of the day or they could say you know what i don't want to pay for it and that's fine as well but the idea is that there's like a payment mechanism and a price discovery mechanism at the http level like as you are a crawler going out to sort of look to ingest content you can see do i get the 200 response where i get the actual content back do i get the http 403 just the forbidden response that i can't get access to this or do i get the 402 response which is you could get access to this if you pay and then making it easy for you as a crawler to or the agent to pay, and easy for you as a publisher to set up the payment terms.
Got it.
And you mentioned since launching, you've gotten positive feedback. I'm wondering if you could share some more details about that, both from a publisher's side as well as a crawler's side.
On the publisher's side, folks want control.
I think that's what it comes down to is that they want to do what they are best at, which varies if you're running a news organization, or if you're running a big publisher's.
publishing, you know, business, or if you're a small indie, you know, publisher, whatever it is, you want to do what you do, right?
So if you're writing your travel blog, you're writing your guitar blog, if you're breaking the local news, you want to get back to focus on that.
And you don't want to have to spend all the time trying to identify what are these 50,000 different bots coming to access my content.
And so the outreach and interest for them has been phenomenal. Because what we do is we want to make it very, very simple for them to do this, where they we do all of the heavy lifting, and they can set up AI audit.
you know, the couple of clicks that get this ability to then understand what's going on and then sort of take action.
So that's been phenomenal and really great. What's been amazing to me, maybe not surprising, but just really great is the level of interest from different AI companies across the board at all different stages, which I think is interesting.
You know, you hear a lot of up and coming startups who I've spoken to are saying, Hey, this is great because I don't have a business development team that I can then go and sort of task with breaking these one-off partnerships because I'm a startup.
And so the ability to have a programmatic way to get access to content without having to know the right person to call at this particular publisher opens up the playing field for them, I think, in a really powerful way.
And that was a really important concept that, you know, I certainly held and a lot of us held as we were building this is how do you make this a level playing field across the board?
How do you encourage innovation and encourage new business models to come out of this?
And we think, again, our paper crawl is one model there will be many models that sort of come out of this many open standards we're excited to sort of experiment and support many of those but the the interest from people on the sort of the demand side of looking to access content has been really fascinating you know people are building different types of agents different ways to sort of operate remote browsers people looking to ingest content at sort of large scale for training or foundation model training we want to like again make that incredible experience for them and give the publishers control over what happens and so If I'm interested in joining this, like joining this, this movement that Cloudflare has, has created, like, how do I, how do I do this?
How do I get more information?
How do I start charging for my content?
Or if I'm a crawler, how do I start paying for content?
If you are, if you have a website and, or a property or, you know, again, whatever scale you are today, you can log in and see the AI audit.
It's sort of, it's in the dashboard.
And so you like, and that's true for our free plans, all the way up to the largest of large enterprises.
So that's available for you today, regardless of your plan. I would actually love to hear your feedback, what works, what doesn't work.
Send me a note on Twitter or LinkedIn. I would love to hear that feedback directly because we have a lot of plans to make that better and to make that work.
On the paper crawl side, we're starting small with a small private beta and that's intentional.
I want to make this great. I want to really learn from the small group of publishers we're working with and the small group of crawlers that we're working with to make it a fantastic experience for both sides as we start to scale this.
So if you're interested, we've got a landing page.
We'll link to it where you can sign up and get more information from there.
You'll be added to our quick list.
We are rapidly working through that list to bring people on board as fast as we can.
We'll keep understanding it's very high and making it a great experience.
So look for the landing page and sign up and let us know if you're interested.
Will, I know this is still early days.
I'm wondering any thoughts on where this is headed next, what the team wants to build next there's so much i mean my laundry list of ideas is nearly infinite right now the way when i talk about product development i often say that like you know your wish list is infinite and resource is finite so a lot of it is like just finding the right things to prioritize next look on the publisher content creator side so much we want to do right we introduced this manage robots.txt feature that allows you like if you don't have your own robots.txt to like have one of us manage one on your behalf that's an incredibly helpful feature we've heard from so customers and they want to do more right so how do we take that and customize it on a per user basis basic stuff how do we bring in many more metrics to your view of what's going on so like what's the referral traffic coming from you know each one of these particular crawlers how frequently are they coming through like built like flushing out the metric side of that is really important and interesting on the crawler side how do we continue to make it easier to sign up and be verified through the new web bot off verification it's been again phenomenal seeing that sort of interest so far.
I really am excited to sort of pushing that forward and really leaning into like, how does this work for an agent working on your behalf, right?
So it's not just important for the world's largest crawlers who are adjusting lots of content.
It's really interesting and impactful when you think about agentic workflows and sort of agentic paywalls that again, imagine your deep research going out into the world on your behalf.
And how do you sort of like carry that authentication through and then carry easy to use payment mechanisms through.
So stuff that we're pushing forward and sort of the agentic side that i think is going to unlock a lot of new use cases and business models but also tell us you know that's probably the biggest thing like you know if you're watching this and you're like oh i wish cloudflare would do x y and z i would love to know so we can make it happen yeah and um you know the one thing that i haven't asked you like one of the fundamental things is like you know why is cloudflare like uniquely positioned to be able to you know provide this type of service to publishers and crawlers Like, can you go into a little bit about that?
A lot of it is like part of just what we uniquely do.
We do a lot of different things, you know, better than anyone.
We have like the core, you know, the cash side of the business and the bot management side.
We have phenomenal zero trust security work.
We have an amazing developer platform.
We can build real-time applications at scale and, you know, really cool stuff with R2 and RoboObjects.
Like, so the amazing developer platform.
So all these different pieces, but like, there's like a core interesting, like networking piece of like you...
Like, if you have a piece of content and you're a content creator or publisher, you want to get that to your users across the globe as quickly and securely as possible in a way that's, you know, free from DDoS attacks and malicious bots that are out there.
We do that really, really, really well.
And sort of that intersection of getting this piece of content to our particular request across the globe is kind of how a lot of this stuff works, right?
It's how crawlers today are ingesting content, and it's how agents tomorrow will be ingesting.
this content so like just of sort of where we sit in that sort of tech stack gives us a lot of ability to innovate here so i think it's probably the first and foremost thing about it the second one is i think both the commitment to open standards and building these sort of open verification standards i keep talking about it but it's really critical about like the ability for anyone to be able to sign the request from your crawler or agent or bot and then for any website or publisher or content creator or site operator be able to understand what's coming to you using these open standards I think is fundamental right and I think that's just like an absolutely key part that hadn't been solved before that we've got really good solutions out you know live in production today and we're going to really keep pushing on and so I think it's sort of the combination of those two areas really gives us a lot of flexibility to push forward here and to build something great last last couple of questions here will is what are some of the frequently asked questions that you get from publishers and from the crawler side that we haven't covered yet a lot of folks are interested in um variable licensing which is i find pretty fascinating you know if you think about sort of like a different uses for content you could imagine licensing it for training versus you know inference or rag versus search versus something different altogether there's a lot of interest in that of like how do we think about you know variable types of licensing more true marketplace dynamics um with like a real bid ask, you know, out there.
Our version today, like we fully recognize, like we're just starting on this.
It's where like the publisher sets the price and the crawler can sort of take it or leave it.
But the ability to actually have a bid, a bidding system we think is pretty fascinating.
And we're starting to really get excited about thinking about that.
There's sort of new areas to sort of push into.
And then I think just continue to sort of think about other open mechanisms.
Like what does it look like to think about, you know, smaller payments via stable coins?
How do we think about other open standards that are out there?
that we could begin to push into i am excited about all of those different areas wow that's that's really interesting um so yeah i think will this was a really nice overview of ai audit and you know the big launch with with paper crawl and like will said make sure you go to the paper crawl landing page we'll make sure to link that we also have a blog post about web bot off as well that you guys can read it it's very detailed so you guys can learn all about that yeah thanks you so much will really appreciate you coming on here and talking about all of this amazing thank