AI Gateway’s next evolution: an inference layer designed for agents
발표자: Ming Lu, Craig Dennis
원래 방영 시간: 4월 16일 @ 오전 9:00~오전 9:30 GMT-4
Join Ming Lu, Product Manager for AI Gateway, and Craig Dennis, Senior Developer Educator, AI, as they discuss the transformation of Cloudflare’s AI Gateway into a unified inference layer specifically architected for the complexities of AI agents.
Tune in to learn about these three major updates:
- Unified Inference Layer: Access a global model catalog via a single API to easily switch between providers like OpenAI, Anthropic, and Google.
- Agent-Reliability Features: Prevent workflow failures with automatic fallbacks and streaming resilience that allows agents to resume interrupted requests.
- Unified Billing & Observability: Manage multiple provider costs through one wallet and use metadata to track the exact ROI and spend for specific agents.
Read the blog post:
Visit the Agents Week Hub for every announcement and CFTV episode — check back all week for more!
English
대본 (베타)
Hello, everybody. Welcome to Agents Week. I hope that you read this blog that we are about to talk about right now.
It is awesome. And I am here with the author of the blog and happens to be her first blog, Ming Lu.
Can you please introduce yourself and tell us what you do here at Cloudflare?
Yeah. Hi, everybody. I'm Ming. I am a product manager on our developer platform team specifically for AI Gateway.
I am relatively newer to Cloudflare.
I joined Cloudflare back in December through Cloudflare's acquisition of Replicate.
So I was leading product at Replicate and now I'm here.
We are so glad to have you here, Ming. You have been building all sorts of incredible stuff, some of which is talked about here in the blog.
And let's get into it.
Let's get into what we're talking about here. Overall, if we're talking about what problem are we solving?
What problem is going on right now? Yeah. So I think what we're trying to do is kind of like lean into AI Gateway more as this unified inference layer.
I think we have a lot of customers who are currently using AI Gateway as a way to proxy requests onto model providers to get that observability.
But we're now kind of leaning more into like this is one API that you can use to access a bunch of different models.
I think if you're looking at, if you're trying to build like a real life use case and trying solve for a real life problem, you're often having to use a bunch of different models, not just like one model from one provider.
And we want to make that as easy as possible for people. And also, we want to make it really easy for people to switch out models as these models change, right?
Models, the like model landscape is getting better and kind of changing so quickly.
The best coding model today might not be the same model or even from the same provider in three months.
And so we want to make that experience for developers as easy as possible.
Awesome. Are we getting specific customer feedback that's kind of driving that?
I think what we saw this, we worked with so many customers, especially at Replicate, that were building out these real world use cases.
I think especially if you look at the, you know, both for like the generative media, like people using image models and video model space, like oftentimes what we'll see is if someone's trying to get to a video output, they don't go, very rarely are they going from like text to video directly.
Oftentimes, people are generating an image or taking an image from something preexisting and then feeding that image into a video model so that they can really control what is that first frame and what is that last frame of their output.
So I think especially in this like generative media use case, there are a lot of models that you need to kind of create one workflow.
And then even if you look at the more like agentic use cases that people are building, you know, if you kind of take something like a coding agent, you know, people will often use a very, very large model to do planning, but then they'll hand off execution to smaller models that might be cheaper or, you know, maybe certain models are better for coding versus better for like writing.
And so I think like all the use cases that you kind of see, like it often requires multiple, multiple models.
Nice. By having it through one place and thinking about it through one place, that solves a lot of those problems.
In the blog post, we talked a little bit about cascading failures, like an agent workflow.
So using, using this unified model, what is, what is, explain that a little bit.
Can we explain what that is? Yeah, sure. So I think if you, if you think about a very simple use case, if you're building a very simple chat bot, you know, when you, when the user gives one prompt or ask a question, that gets translated into like one inference request, generally, like what is the capital of the United States that just gets sent to the LLM and you get a response back.
If you're building an agent, it's very rarely that one inference request will get you what you need.
You know, if you're saying like a typical request to like solve a customer's support question might involve looking at the support question, then calling an MCP server to look at the documentation for that part of the product.
It might involve looking up that customer and, you know, what is the current setup of their account.
And then you would then like look at both those things.
Maybe it would involve looking up a third thing about like how much money they've paid you.
And basically the agentic workflow kind of by nature often involves like a multi-step approach where the outputs of a previous step feeds in as the input of the next step.
And so if one step really early on fails, that kind of, you know, it makes the, it prevents the rest of the flow from going, but then it also like, if you have to restart that loop, it makes all the steps that you've like done in the beginning somewhat wasted if you have to then like remake those requests.
So that's what it kind of means by like cascading failures.
Okay, cool, cool. And so, so, and this, this helps fix that. Yeah, I think it adds another layer of reliability to that like inference layer so that if there's a problem at the model provider level, you are not like dealing with it directly.
One thing we're trying to build in with AI Gateway is this concept of like automatic fallbacks or automatic failovers.
So for a particular model that is served on different providers, so for example, if you look at some of the cloud models, they're served through Anthropic, of course, but they're also served through, I believe, Bedrock or like Google Vertex.
And so for a model that is served on multiple providers, you know, we can try your like preferred provider and then fall back to some of these other providers so that if, you know, Anthropic is having an issue, you're not out of luck and you can still get a response because we'll try some of these like secondary providers.
And that's awesome. And so that's, and that's super important in that agent workflow, like you were saying, because if one of them falls, you don't want to like go, no, actually go to switch to this.
Super cool.
Awesome. One thing that I'm super excited about is this one catalog, this one, this one endpoint.
Let's talk, let's talk a little bit about what's, what's, what's that bringing?
What is that bringing to us as a developer? Yeah, so I think today, you know, I think if you're, we've had AI Gateway out for a while, you know, AI Gateway has never had its own model catalog.
You know, you just kind of had to know like what models were available.
And I think we've heard a lot of customer feedback, like wanting to see, to know, I want to know what models I can call, what models are, are supported through unified billing.
Like when a new model comes up, how do I know that it's, it's out now and that I can use it.
And so we're releasing, I think what we're calling like a unified catalog for all of the inference that you can run through Cloudflare.
So there'll be one place you can go to see what models are available through Workers AI, what models are available through Workers AI as in hosted on Cloudflare's GPUs and what models are available as like third-party proxies.
And so I think that'll just be a really great way to show people or help people discover what models are available.
Cause you might not know the exact model that you want to use, and you might not know what all the model Cloudflare offers or what models even this provider offers.
And then also then use them within one interface, right?
So the, the kind of big thing that we're, we're launching here for AI Gateway is integration with Workers AI bindings.
So that in a worker, you can now before call Workers AI kind of Cloudflare hosted models, but now you can also call third -party, third-party proxy models as well through the bindings interface.
That's awesome. Let's, let's just in case somebody hasn't seen that or thought about that unified billing, I want to just, let's take it, let's take a moment here, because this is something that's rad that I think might've snuck through a little bit.
So, so you can run models that aren't just on Workers AI with the binding now is what, that is what we're getting to.
So like, what kind of models are we talking about there?
Yeah. So like basically all of the, we've tried to make this initial launch, the like launch set of models, the state-of-the-art models that people are using for, for their real world use cases.
So we'll have all the, all the models from like Anthropic, OpenAI, Google, all like the main LLMs, but we're also really expanding our model catalog to other like multimodal models.
So across image, video, voice, so speech to text, text to speech, I think we've got a music model in there.
And so it's really, it's really kind of this like full, full featured model catalog, whether you want to do things with like LLMs and agents, or whether you want to do generative media, like it's all here.
Awesome. And then how, how long, like developer-wise, how long does it take to switch between those things?
Because what's that, what's that flow look like? Yeah.
With the, you know, if you're, if you're calling an image model and you want to switch to a different image model, it's now super, super easy to do.
You just have to change the like model, like the model ID, you know, so going from, you know, Google Nano Banana to you know, like Black Forest Labs is just kind of changing that string.
You know, then like basically all of the other, or nearly all of the parameters are just like work.
That's awesome. So I think that like, there's a, there's a little bit of a bigger talk here about like what, what that does, like, what does that feel like?
I mean, I know that if I, in the past, if I'm paying for OpenAI and I'm paying for Anthropic and I'm like, I got to submit my expense for this and expense for that, it gets us into the unified billing, which I think is a great place to jump here.
So, so what, how does that change the, the, the conversation, right?
For, for a developer like myself, how does that change the conversation for billing teams, like for developers and their billing teams?
Yeah. So before the process, like oftentimes I think people start out with like, okay, I'm just going to use OpenAI and like they're doing their thing and then you want to improve, but then you realize you need to add on other models or, you know, you want to use a different model for a different part of your workflow.
That would involve then opening an account with this other model provider, getting an API key, putting in a credit card or kind of setting up invoicing.
If you're at a larger company, it might involve going through procurement to get that, get that new vendor now approved and can be quite like a lengthy process.
So instead now with, we've greatly expanded the number of providers that AI Gateway supports through unified billing.
And so now you don't have to do any of that. You don't have to like juggle API keys, save the API key in the secret store, you forward the invoices onto your finance team, whatever that might be.
You just have to load money into your like AI Gateway wallet.
And then any spend that you do across these, like a bunch of different model providers will just draw down from that wallet.
And so all of it will be on your like Cloudflare invoice.
So it's one vendor that you have to manage and get approved rather than, you know, the three to five that you might have to otherwise, or possibly even more.
That's super nice. And I guess if you can like move that to everybody using that internally, you get like a better visibility, right?
So what have we seen for people who have done that? Like what are some of the like the insights of people actually running through this unified system?
Yeah, yeah. So yeah, that's exactly it. Like obviously it's a great simplification on your billing operations, but also on your observability.
And so you're not juggling between different consoles and dashboards to, you know, look at what is the behavior, what areas am I seeing on OpenAI versus Anthropic and, you know, trying to have to like trace a particular session through a dashboard.
And so you can kind of really see the entirety of your inference traffic.
And so that's just like such a superpower, having everything under one place.
And then so I think like the things that you can kind of see now are like your overall inference spend.
So like how, like for this particular product that I've shipped and launched, like how much money is it like actually costing me across like all of my inference providers.
And then with our analytics, we have this feature where you can pass in metadata alongside of a request.
So that lets you just pass in like arbitrary data that you can then later filter on.
So one example of a use case here is, you know, if you have an application, you might want to offer different models or you might use different models to support different tasks between your like free tier users and your paid tier users.
And then so now you can kind of cut your data by that particular, like that metadata free versus paid and kind of see like, okay, how much am I spending on my free users?
How much am I spending on my paid users?
You can even, you know, go as far as like sending through a particular customer's ID through all of their requests.
So you can see how much a particular customer costs on an inference level.
Or, you know, if you're running an agent that does a particular, a couple of agents that do certain tasks, you can see like this agent, you know, is costing me this amount of money.
So yeah, it's really, I think it's really, really powerful to just like understand how people are using your product or how your internal users are using your product, like where where's your spend kind of like creeping up, maybe then that can like lead you to like try to, you know, swap out different models for a particular task, stuff like that.
So, so cool. I can't imagine, I mean, I can only imagine how much people are going to be using that to exactly that point, right?
Like, and you see, you see people saying like, hey, it's an expense and we're not sure exactly how expensive it's going to be and being able to predict and like look at that and like track, oh, the growth is happening in my free tier.
We should change that model to, and very easily, like you said earlier, we could very easily change that model to, to match the free users.
When the free users use this, we want them to use this model. Otherwise we want to use this and you can kind of use that data to make those decisions, which is huge.
Yeah, yeah. And I, and I think it's like, you know, I think if you think about AI deployments within a company and kind of like internal use cases, you know, everyone is telling their, their, their employees to like use AI and, you know, people are building agents and, and doing all these things.
And, but I think like the, the conversation, like the next question that naturally arises is what is the ROI of all these agents that we've got running around?
Yeah. You know, I think that's like, this is like the first thing that you, the first thing that you just start answering that question, right?
Like we've built, the company might build like a code reviewer bot and, but it's good to see like, okay, this is costing us X amount of money, but maybe it's prevented this many incidents or potential incidents, you know, over the last quarter.
But, but, you know, it's hard to have that conversation without knowing the costs of these things.
Absolutely.
Or like bringing that data across yourself, you, you probably would use AI to write the report, to pull from all the different places.
So all in one place with the same metadata, it's huge.
It's super cool. So being it's Cloudflare, we wouldn't, we wouldn't have a conversation if we didn't talk about latency.
Let's talk about latency. What, what's going on in this blog post regarding latency?
Yeah. So, you know, I think like we said earlier, the main thing that we're, or one of the, like the big things that we're launching is this like workers AI binding now where you, that you, where you can call third-party models, but obviously you can still use the binding to call workers AI models.
And so I think if you're like building, if you're trying to call workers AI models through AI gateway, that's a particularly great way to kind of get even like lower latency, right?
Because these workers AI models are within the Cloudflare network. There's no extra hop over the public Internet when you're calling, for example, like Kimi K2 through workers AI.
And so your inference runs, you know, on the same network.
And so your agents have like even lower latency than if you were to call, you know, an external model.
Great. That's awesome. Do you, are there any other hidden gems in this blog post that we should call out?
Yeah. So one thing that I think is notable is if you're building agents using Cloudflare's agents SDK and you're using AI gateway, we've, we've made it so that you're like, if you're making an inference call and it's like a streaming inference call, they're resilient to disconnects.
And so basically AI gateway will buffer the responses as they're generated.
And that kind of happens independently of the agent. And so if your agent gets interrupted for some reason, it can then reconnect to AI gateway and kind of find that request that it's already made and basically read it again, read it back rather than you having to make that inference again, which takes time and also costs you that money.
You can just kind of restart and it's, you know, as seamless as possible.
Well, that is so awesome. So, so I guess what we're saying is use, use AI gateway.
Use AI gateway, use agents SDK, you know, use Cloudflare generally.
Yeah, exactly. Exactly. Ming, thank you so much for doing this.
Congratulations on this first blog post. It's amazing. It's really great. I hope that everybody goes and reads that.
Any last words of wisdom, Ming, to drop before we, anything you want them to do?
I mean, you just wrapped that up really nicely.
Any last, last thoughts? Yeah. I mean, like, you know, this, so many great things are coming out during, during agents week.
This is actually my first innovation week at Cloudflare too.
So it's so cool seeing all the, all the new stuff that we're launching.
Yeah. There, there's so many great, awesome things. So, you know, like read the blog, like try the new stuff.
We've got some great things cooking.
Awesome. Awesome. Thank you everybody. And we will see you next time.
For AI Builders
Access 70+ models from top providers through a single API, built for speed and reliability.

Agents Week
Join us for Agents Week 2026, where we celebrate the power of AI agents and explore how they're transforming the way we build, secure, and scale the Internet. Be sure to head to the Cloudflare Agents Week Hub for every announcement, blog post, and...
더 많은 에피소드 시청