Observatory and Smart Shield: See your site globally, and make it faster
Presented by: Tim Kadlec, Brian Batraski, Noah Kennedy
Originally aired on September 26 @ 1:30 PM - 2:00 PM EDT
Welcome to Cloudflare Birthday Week 2025!
This week marks Cloudflare’s 15th birthday, and each day this week we will announce new things that further our mission: To help build a better Internet.
Tune in all week for more news, announcements, and thought-provoking discussions!
Read the blog posts:
Visit the Birthday Week Hub for every announcement and CFTV episode — check back all week for more!
English
Birthday Week
Transcript
hey everyone thanks for tuning in very excited to be talking about the launch of observatory and smart shield today my name is tim kadlick i'm a principal product manager at cloudflare working on observability and with me are brian hello everybody my name is brian betrowski i'm a principal product manager as well here at cloudflare responsible for our intercolo connectivity and egress connectivity here at Cloudflare.
And we also have Noah with us.
I'm Noah Kennedy.
I'm a senior systems engineer here at Cloudflare.
I'm one of the engineers responsible for Origin Fetch and CDN Cache and the lead on a project called Snowball, which we have used to deliver a lot of connection reuse improvements and will be using to deliver a lot of additional internal improvements, both to the resiliency of our own infrastructure and also to customer performance over the...
next number of months and years.
Now, I'll start off with the first question here.
The blog post emphasizes that modern users expect instant and reliable web experiences.
So why are observability and performance tools like Observatory and SmartShield more important now than they were before?
How have user expectations shifted and what are the consequences for businesses that fail to meet them?
Performance and reliability have always been important for anybody who's had a digital presence like that it's tables.
Right.
Like you have to have a site that performs well. You have to have a site that's available. And we've seen over the years just a multitude of studies that shows that if either of those things are out of whack, everything about the business suffers.
Bounce rate goes up, conversions go down, you know, brand reputation goes down.
So there's, you know, it's critical to get that digital experience nailed. And, you know, over time, my favorite way to talk about this, I guess, is like, you know, users have had rising expectations.
Like we all go onto a website.
We're not. expecting just like the text only pages that we used to get in like the 90s right like we want rich high fidelity designs and to do that requires more code more resources on the pages which means heavier pages but at the same time we're not expecting those sites to be any slower we're expecting them to be faster than they ever were like we used to be able to fire up the dial-up modem go get a cup of coffee and come back and the page would be loaded and i'd be happy that's not the way it works now we want better experiences and we want them faster than ever and so the pressure's on I think for you know every company that's got a digital presence to nail that and to do that means you have to be able to understand like the current state of things and how do you make things better and that's still complicated you know and that's one of the things I you know Brian you and I chatted about this quite a bit as we were working on these two products is like that one-two punch of insight and remediation that we're really trying to tackle here you know what is it like what is the difference there how does that how does it change for a developer in terms of having both diagnostics and solution together and that single view.
As you just mentioned, the expectations for end users today is growing every single day, month over month, year over year.
When's the last time you've seen Google go down, right?
And we're actually seeing the expectation, the amount of performance and availability that applications bolster needs to have a higher benchmark every single year.
And so when we think about tools, large categories especially for developers in terms of observability into their applications and what that means in terms of being able to solve problems that impact their users and or incident responses you need to be able to have rich data that helps you identify the problem quickly but then be able to solve that problem without having to go through hours of additional config that is that could potentially be fraught with risk and lend to another issue and we do see many providers today that are, they're giving analytics to try and solve this problem.
But what we don't see is the third feedback group to be able to say, I've been able to find a problem.
I've been able to make a quick update and implement something that will solve or mitigate that problem.
But then what about validating the impact of it and making sure that the metrics that are important to us are actually moving into the right direction.
And so we're really proud to be able to to market this new one -two punch and add that third bit feedback loop to all developers, whether it be solving a problem that you've just gotten a ticket for or being able to respond to an incident, being able to quickly identify what the problem is through Internet Observatory with those rich analytics, be able to go to SmartShield through the recommendations that point you to where the solution larger should be based on the information that we already have to help guide you towards that solution resolve and down the time to resolution and then be able to add the extra bit and give you that assurance to be able to say in real time, I've turned on a traffic accelerator to make sure my time to first bite goes back down.
Now I can see that happening, you know, time to show my other engineers, my peers, my management, my customers that this is working and going in the right direction and what the time is to see full propagation or whatever else.
And so I think it's very powerful.
I hope, you know.
others start to follow this pattern and show how valuable and important having that rich observability and one -click solutions to make sure you're seeing that impact right away.
But when we think about the types of data that we need to work with, you know, the blog host that we wrote, you know, makes a really strong case that using RUM or real data or real user data is the kind of source of truth.
Why is RUM considered more valuable than synthetic data?
Even though synthetic data can provide a consistent baseline, what are the unique insights that RUM or real user monitoring provides that synthetic testing just simply cannot provide?
My experience has been that most companies need both, right?
Like synthetic data and RUM data, they are necessary for both companies for different reasons, right?
So synthetic data, it is that baseline that you mentioned, right?
Like it's the cleanliness of the data because we have a defined way of testing.
We're testing under the same scenario.
you know we have a lot less variability and because there's not a real user involved we can go very very deep in terms of like what we're actually collecting about the the page load or the network or whatever without having to worry about like infringing on any sort of you know user journey or having some sort of an observer effect or anything like that but if we over if we're overly reliant on synthetic testing we're going to miss a lot of things because synthetic testing is only as good as the environment it's tested for, how well that represents reality.
And it's only testing the exact endpoints and the exact scenarios that you set up, right?
There's all these countless, when people are accessing our sites or applications, they're coming from everywhere under every different scenario, every network, every location possible, different devices.
There's no way we could ever set up a synthetic test suite to capture all of those different variabilities, right?
It just doesn't happen.
So with real user data, we're seeing what happens when reality hits.
when the rubber meets the road kind of a situation, right?
Like it is going to capture everything, the good, the bad, and in between, even stuff that we're not expecting.
So that's why we have real user data as that source of truth, because synthetic data, even in the best scenarios, is just a hypothetical.
It's a lab-conditioned environment that we're hoping simulates reality.
But when we're looking at real user data, that's where we find out actual traffic patterns.
It's where we start to identify that, oh, you know, this in one scenario, scenario you know when when the screen size is within a certain range like our lcp suffers if we're coming from a certain geography maybe the network telemetry is telling us that we're actually having a lot of performance issues there too and that's the only way that we're going to surface those kinds of insights so that's why we're putting that up front and center inside of observatory and again the synthetic data is there like it's still there we still think that's important to be able to run those tests to be able to do that active monitoring and to give you sort of that clean slate of comparison but yeah it's got to start with the rum data that's that's the source of truth we're also using those metrics the rum metrics prominently and sort of how we're laying out the dashboard on that overview page like a lot of thought actually went into what signals were surfacing and where from so we start at the top of the dashboard with core rub vitals because that's it's coming from real user data and it is user facing it is those are the things that most directly reflect what the user is experiencing now again if you have Like if you're testing, you know, API traffic and stuff, that data won't show.
You're going to be focused on your network telemetry.
But for anybody who's got sites or applications that users are directly accessing, Core Web Vitals, you know, Largest Conceptful Paint, Cumulative Layout Shift, Input, Interaction, and Xpaint, those are the metrics that you want to be focusing on because they have direct correlation to what the user is actually doing on your page.
Tim, actually, another question I have thinking about what you just said.
This blog highlights a lot the connection between TTFB, which is Time to First Byte, and Largest Contentful Paint, or LCP as we tend to call it.
Can you explain this in a bit simpler terms for folks and why it's important to break down these sorts of black box metrics into simpler subparts?
Okay, if we're starting with what the user experience is, and again, we're going towards, as Brian mentioned, we want remediation.
We want you to understand what the challenge is.
is how do I fix it?
It's not enough to know that like, oh, LCP is slow. We want you to know why. Well, one of the strongest correlations to LCP is time to first byte.
We actually did some data analysis on this where we looked at like 9 billion data points collected by our web analytics ROM providing a solution.
And what we discovered is that if you have a good TTFB, like using the thresholds that were defined by Google, a good time to first byte is 800 milliseconds or less.
a good largest content full paint is 2500 milliseconds or less if you hit that good ttfb threshold you are 70 percentage point like more likely to have a good lcp basically it is there's nothing better that you can do to guarantee that lcp is going to be good than having making sure that your back end telemetry is giving you good numbers and that you're getting good results there so that's why that's the next thing in line then below that we're giving you some information about actually the health of how you're actually using Cloudflare, the origin, the edge, so cache hit ratio, error rates for edge and origin.
And again, it's a very deliberate choice. These are things that have very direct impact and strong correlation to your time to first byte metrics.
So if you're coming down the chain and you're saying, oh, LCP is a little slow, well, I can see time to first byte is struggling.
Now you're seeing the cache hit ratio, for example, and I can see that we have a lot of requests that are being served from the origin.
that's a very strong indication why the time to first byte is as slow as it is.
92%, 91.7% of 9 billion requests that we looked at, when they're served from Cloudflare's cache without having to talk to the origin, they're under that 800 millisecond time to first byte threshold.
And it's about 79.7% when we have to talk to the origin.
So it's not perfect, but it's okay.
But you can see that there's just a dramatic impact if we can guarantee that we're going to...
hit that cache hit.
So there's a lot of thought that went into laying out that page and the order of those metrics to make sure that we are giving you a diagnostic information, that we are giving you a way to understand what's going on and remediate those issues.
When Cloudflare has a lot of machines in our global network and when they and oftentimes those machines don't necessarily need like oftentimes those machines may not be serving stuff that's even cacheable in the first place.
API traffic, for example, where you are hitting your internal API, that's not usually cacheable.
Like that is usually like going in and querying a database and finding some data, returning it back, and then using that to help render a web page or stuff that is like service to service where one like API client might be, that might be like one production service you have is hitting Cloudflare and requesting data from another internal API that you have.
That sort of stuff isn't really like that.
stuff isn't cacheable and that's stuff where basically every time you make a request we have to go fetch it from the origin and so when you break down where is that time spent it is it's oftentimes at the tail so at the like worst you know the worst like couple percentage points of requests which tend to actually be the ones that matter the most for loading a web page it tends to be like on connect on connecting to the origin because we try to reuse connections can but if there isn't an available connection to an origin then we have to create a new one and that involves multiple round trip times between a customer servers and ours which might be a fair bit of time like if you were connecting to us from a data center if we are connecting to an origin that is in ashburn virginia from a site that we have in chicago that's you know like that's a double digit number of milliseconds and if you have to do one, two, three round-trip times before you can even make the request, that adds up.
And that adds up to slow experiences for users, which, as we've already talked about before, is not very good for customer businesses as sites start to feel sluggish and slow.
So connection reuse ends up being a really, really important thing.
And historically, we've actually tried to address this by having shared...
connection tools across sites the trouble with that is the way we would do this is we would basically pick like a subset of all of the machines that we had within a site and we would start fanning traffic in for any given customer site into those like into those machines for the final origin fetch this does improve connection reuse however there's a bit of a trade -off here you might have like a like large social media or e -commerce site who has a ton of traffic and where we have to actually scale these out across a whole bunch of machines we have in order to support their load and like the number of machine and like that the side and the number of machines that's going to work for serving that customer is going to be very very different from like a small llama and pop shop or honestly even most enterprise like websites most enterprise websites it turns out like have what is you know to our customers who don't specialize in handling all of the world's internet traffic they sit there and think oh like i have a lot of traffic that i hand that i'm handling but to us like that is enough that like a machine or two in every data center that we have is easily enough to serve them and probably a lot of other customers too and so we don't actually need to like send them if you think about it to as many machines and so with connection reuse the math is very much around you have to hit a certain critical mass before things start to look good But it really, really matters quite a bit how much traffic you're getting into or how much traffic you're able to send to any given connection pool until you hit that mass.
And so one of the big changes that we've made over the past few months is we have made all of these connection pools dynamically, these pools of machines dynamically sized.
So we'll basically look at how much load is actually.
actually going to any given origin and we will figure out how many machines are needed to serve that origins traffic based on how much traffic that actually is so most customers now like have all of their origin traffic directed to a machine or directed to like one machine at a time within a site and so they get really warm connection pools and they get really great connection reuse performance we're finding places where we have warm paths to serve their traffic from and we've seen pretty substantial improvements in reuse like as we've as we've turned this on in individual sites we've seen drops in like in the number of connections needed to serve customer traffic by as much as 60 which that one actually kind of surprised me when i saw it i was not expecting numbers quite that high but when i looked further how much dynamic traffic was going through that site and like the traffic patterns that actually made a lot of sense what was going was basically all api traffic to us east one which okay figures so yeah we've done a lot of improvements on the back end with basically how we route traffic and how we decide and how we decide where to send it within the site that have led to these pretty substantial improvements in customer performance so on that note actually tim i have a question to go back to you so this post talks a lot about future plans for observatory and smart shield particularly note we're discussing benchmarking against other products and collecting deeper and honestly just more in general diagnostic data what's our ultimate vision for this how do we plan to make these single sources of truth for application health and what does that really mean for the web performance industry as a whole i think the thing that we're really excited about with this there's two right like the fact that we are the platform in this situation we have you metrics and analytics from darn near every angle about like our users applications right we can tell you when how things are performing we can tell you the availability issues that might be happening we can get into deep telemetry and tracing and logs and there's there's a large amount of data and it's all under one umbrella where we can actually start connecting the dots that's pretty atypical i think most companies struggle with this right because it's like one tool for this one tool for this this data is over here and this is another silo there and then you try to connect the dots and you're always operating with it's a bit of the what is the elephant analogy right we're like you know depending on if you see the trunk or the tail or what the body you think it's something different yeah it's like that's the situation that companies have to fight with we can fix that we have the data to be able to put that complete piece together and that's one of the things that we've challenged ourselves on with this product is like everything that we add to it every data we don't want to get in the data point we don't want to get into a situation where we're adding a metric just because we have it we want to know why that matters and we want to be able to give you a way to remediate it because that's the other piece is data is easy pretty charts that's easy the hard part is what the heck do i do about this and again because we are the platform we have the ability to not just tell you where these problems lie and help you connect those dots and see the complete picture we have the ability to fix it We have the ability to let you through different features that we may have or product offerings, or maybe it's like deploying a worker here to patch something.
There's so many options that we have for how we can help you actually remediate.
Like we can give you that full circle in one spot.
And that to me is, that is the, that's the holy grail of performance monitoring and observability, you know, and that's what we're after here.
And that's what we can provide.
And we're in a unique situation to do it.
And so that's what we'll be working on as this product.
as we continue to add to it we'll be looking at again you know what other diagnostics can we give you what other fast solutions can we give you how do we make sure that like when you are providing like using one of the solutions that you have confidence that it's having the impact that you would expect those are the things that we're going to be building towards with this product and i am so excited to see this all come to life thank you everybody for taking the time today to tune in with us we're so excited to bring the rich kid possibilities of internet observatory and smart shield to you and i'm so excited to see what you're going to build with it and how it can impact your end users thank you so much and have a great rest of your birthday week