Monitoring at Scale: A Conversation with Cloudflare and Chronosphere
Presented by: Mia Wang, Colin Douch, Alexander Huynh, Rob Skillington
Originally aired on May 26, 2022 @ 5:00 PM - 5:30 PM EDT
Join Cloudflare's Mia Wang, Colin Douch, Alexander Huynh and Chronosphere's co-founder, Rob Skillington for a discussion on monitoring and observability at scale.
English
Enterprise
Transcript (Beta)
Hi everyone. We have a fun half an hour coming up. We've got a few of us from Cloudflare as well as a special guest with us.
I guess we'll quickly just introduce everyone and then get right into it.
We have the next half an hour to dig into monitoring and observability at scale.
So from the Cloudflare side, I'm Mia Wang. I'm on our special projects team.
We have Colin Douch, a systems reliability engineer here.
And then we have Alex Huynh, who's an engineer at Cloudflare. And then we have Rob Skillington, who's the co-founder and CTO of Chronosphere.
I won't be able to do this justice, but Chronosphere is a cloud native monitoring platform powered by M3.
And some of you watching might recognize the name and know that it's Uber's metrics platform.
And so Rob and his co-founder Martin had built an open source M3 during their time at Uber before striking out on their own and starting Chronosphere in the last year or two.
So we have a nice mix of people here who have experience building and implementing monitoring capabilities as well as sort of the end consumer end user of those tools.
So should make for hopefully a fun conversation.
Before we really dive into it, I think a good place to start is to talk about how each of us kind of approach building our monitoring environments in general.
So maybe Colin, one place that would be great to start is as you're thinking about monitoring at Cloudflare and making sure we have visibility into our network globally and reliably, how do you think about architecture?
What are sort of the key capabilities that are top of mind for you? And then what are some maybe limitations that we just kind of have to work with?
Sure.
So the main sort of architectural problem we have is that Cloudflare can be effectively split into two types of environments.
We have our edge data centers and our core data centers.
And our edge data centers are a lot more constrained around what capacity they have and what connections they have, because a data center in San Jose is just naturally going to have more connectivity than our data center in Djibouti, for example.
So the way we generally think about our edge is that they operate as sort of bastions, as individual silos.
So all of our infrastructure has to operate within one thing, which generally leads itself to a certain level of failure when that one location goes down.
So we have to design with that in mind, really.
On our core side, we have a lot more flexibility. And because we operate multiple core data centers, we can have a lot more reliability, because we have the ability to have a, for example, a Prometheus server in both locations.
And we can federate between those and federate our metrics up from our edge data centers, as long as they have connection to one of those two locations.
And Rob, for you, when you were at Uber, did you think about it sort of similarly?
Obviously, a very, very different business from Cloudflare, but certain similarities in terms of just sheer volume of data and the kind of global nature of it.
But what were the things from an architectural perspective that were top of mind for you at Uber?
Yeah, that's a great question. We were just, before we started off, kicking off here, kind of talking about how we both exchanged some war stories about how we both run or have dealt with on-premise data centers and managing the fleet of servers and all that's involved there.
A lot of talk of IPMI.
So definitely just the monitoring these physical servers was a large thing.
And also the fact that it's not just the servers themselves, it's the racks, it's the network infrastructure.
We were talking about the temperatures of all the hardware and them being safe operating limits.
So first, it's just how many things that are very diverse in nature and how to centralize all that monitoring in one place so that you're not going crazy with 30 different systems to monitor the different layers of everything that you're running.
So I think that was honestly the biggest thing we had to get correct was one way to collect everything, one way to store everything.
And then once we had that, it was about, okay, teams are onboarding at an unprecedented rate.
And how do we manage that without having to go and put the burden on the teams themselves into mapping their metrics to a Prometheus server or something of that nature.
So we ran Prometheus at Uber, but it was always backed by M3.
And so M3 is a open source Prometheus remote storage, and that can scale out horizontally by just adding servers.
And so for us, it was like, centralize everything, put everything in one place.
And then also how do you make that one place scale horizontally across the globe, obviously.
And if you look at what we did there is we federated the queries out so every local zone would be its own isolated little sphere.
But you could globally federate the data across that at query time, similar to how Cloudflare is federating that data from multiple locations into a single place so they can query it in one place, but architecturally a little bit different with how we achieve that end goal.
Can you give us a sense of actually one thing I sort of mentioned that's similar between maybe where Uber was back when you were there and where Cloudflare is today.
So the size and scale of our network just increases at sort of a blinding rate.
And so when you're planning things out at the beginning and by the time you're implementing, the scale problem might be really different.
Can you give us a little bit of a sense of at Uber, some sense of just what scale looks like, whether it's from an amount of data or latency perspective or global nature perspective, how giant was the undertaking?
Yeah. I would love to hear some numbers on the Cloudflare side as well, just to kind of get a sense of, I'm sure similar journeys and similar war stories can be shared just due to the very probably similar nature and growth.
But to put some concrete numbers to it, a lot of the time I think of it in samples or metrics collected per second is a great way to think about it.
Just because anyone who runs a Prometheus usually knows how many samples per second they're collecting.
So at Uber, when we basically had the first version of M3, which was backed by Cassandra and Elasticsearch in a central deployment, that was collecting roughly 100 to 300,000 metrics per second.
And that was across all the data centers globally.
And then yet during, by the time I left, we were doing 40 million samples per second.
So that was a pretty massive increase across the three, four years there.
I would say going from 100K to 1 million was one journey on which Cassandra and Elasticsearch kind of just got there.
And then the wheels were falling off as M3db went into production.
And then getting M3db from 1 million to 10 million writes was its own journey.
And then getting, especially since some of the longer term data was kept in Cassandra and Elasticsearch still, once the real time layer had been done.
And then, so getting from 10 million upwards was really the next journey on that.
And we had hundreds of millions of time series.
And by the end of it, we had 10 billion unique time series and more than a petabyte of raw disk space being used by the monitoring cluster.
So that's kind of a sense of the journey and the leapfrogging that we had to do by building M3db as Cassandra and Elasticsearch kept us going to that up until about a million metrics per second.
Colin, similar overarching experience, I'm guessing?
Yeah. So when I first joined at Cloudflare, we were using OpenTSDB to store most of our samples and we were just sort of starting to push things into Prometheus.
And at that point, we were doing in the order of hundreds of thousands of data points per second in our largest data centers.
And there was no real thought of how we could possibly aggregate that up.
In our largest data centers now, we're pushing several million.
And that's really starting to push the limits of our Prometheus and we're sort of scaling it up and by adding more RAM and things like this.
But it's definitely hitting those limits, I think. I did see an Upstreet and TechEd about Imgur running a one terabyte node recently.
Actually, I caught up with Austin. He's actually at Robin Hood now, previously at Imgur.
We actually had a conversation about the different mMap and monitoring memory in a process and when it's out of memories is actually a very complicated problem because Linux memory has so many different categories and how do you account it all and actually how does it impact your system when you're using it.
And that was actually a really interesting conversation that came out of that.
But yeah, interesting tidbit.
It's a small world. Pardon?
Yes, go ahead. Go for it. Oh yeah, yeah. So what I wanted to add was that some of the architectural challenges that we hit here at Cloudflare also as I work with current existing teams pushing metrics in and also new teams wanting to deploy things.
So when we talk about architecture for those specific instances, it's like, well, they want to expose everything, right?
But in reality, you can expose everything locally but you can't bubble that up globally because the cardinality would just be way too high.
It would overload a central point. So then how we've designed it here is a little more like a tree structure where you have, as Colin mentioned, like many edge polos and all those get bubbled up locally.
And then from there you have a view of all of the worker machines right then and there.
And then some of those select metrics that we have to go back and forth with both by Colin on the observability side and on the engineering teams who want their visibility, some of those get aggregated and filtered up such that you get a high level overview at the global level and then many different slices on a data center level on our end.
You actually, Alex, touched on a point that I think might be really interesting to a lot of our audience, which is the kind of role you're in, right?
You probably end up being the conduit or facilitator or middleman between product, engineering, Colin's team.
And so if you think about your time here, five years or so, right?
Clothler was a very different company back in the day.
And is there anything you would have done differently in terms of, are there things you would have encouraged our product or engineering team to think about more proactively when it comes to monitoring?
Are there things you wish you maybe relayed to folks on Colin's team more clearly?
What would someone, if they were kind of in your shoes at a super high growth startup, what should they sort of think about now?
Yeah, that's a very good question. To be fair, though, I think the steps that we have taken were pretty good.
Like it's like, of course, hindsight is 20 -20 and you don't know, like it's like looking back, you're like, of course, like we should have done this better.
We should have done that better. But given that we didn't hit those operational constraints, like OpenTSDB failing to ingest all of the things that we wanted to push into it, that was born out of a natural progression, right?
So before it's like we only needed one central store and that one central store could handle all of these metrics.
And then as we moved from even before like OpenTSDB, we had Nagios that was used for like alerting.
That was feeding everything in and like the sheer volume was too high.
So we needed something that more, like Prometheus fit more in that description.
So as we migrated everything over, those Nagios problems were now OpenTSDB problems.
Like we've just moved the problem.
And during that second time, when it's like, okay, what else can we do?
OpenTSDB into longer term storage, Prometheus, et cetera. That became the point I'm like, all right, how do we avoid that mistake?
And how can we do it better?
So I guess like for it, I wouldn't really change the way we've done things.
It's just like we lean on the shoulders of giants. Like when we looked for examples of better orchestration and whatnot, we looked at Google and other large companies because they've hit these problems and they've scaled through.
So we should also like borrow that and then scale through.
And hopefully like what we show here, like can also help other people that when they have hopes of scaling up their metrics can do so.
Yeah, go ahead Colin. I think the key point there is to just not over-engineer too early.
Yeah. Like we got a working solution and it worked until it suddenly didn't.
And then we built a new one. There's a lot that can be done, particularly in an early stage startup.
There's a lot that can be done with the time you spend wasting on over -planning something.
Yeah. I just want to reiterate that building on the shoulders of giants is a hundred percent what needs to be done.
Every time I talk to any monitoring team at different companies, like what we did at Uber is an anomaly and should be an anomaly because the most value you bring to the company as a monitoring team is translating exactly what your systems look like into the data model of some existing technology ideally and making sure that you can explore and monitor and track all of that system that you are running in production to the monitoring systems you're using.
And to be a bit more explicit about that, the last mile of hooking up these little data pipelines to your systems is really what derives the most value.
Because as Alex alluded to, just mapping Nagios data in a really naive way one-to-one into OpenTSDB or wherever else, Prometheus for instance, is that you have the same problem.
It's really like how do you actually capture what's going on in the best way that's understandable and make that actionable and float up the right data at the right time to people.
That's really what honestly you're solving. And I think that Cole and Alex and myself are just lucky enough to have also had to work on some systems level stuff to get there due to the scale of the companies we're at.
But yeah, I would say that building on the shoulders of giants is 100% necessary.
And Prometheus obviously, isn't it?
I know that a lot of people think Kubernetes is overkill, but things are staples like Kubernetes, like Prometheus that are providing really great building blocks for people to do great things out there, I think is really exciting.
My granddad's a civil engineer and they have way more processes and they know how to build a bridge really reliably.
And I feel like software is still getting there.
And I'm really excited about these building blocks and pillars that are kind of now accessible to everyone.
This kind of agreement between the three of you that you shouldn't worry about over-engineering your metric system is an interesting one because in my role, I spend a lot of my time talking to early and growth stage startups, and there's often this temptation to sort of future-proof their infrastructure.
And most of them after six months of pain or whatever amount of time, they come to kind of a similar conclusion, but it's tempting.
I mean, of course, you should think about the future. And that's not to say only look like 10 feet ahead, but it's also like, of course, you should have the dual purpose of like, I see the long-term problems, but I have the immediate problems in front of me.
It's just like, how do I stop the bleeding here?
Or it's like, if these machines are running slow, I need to report on like, is the CPU being throttled?
Is it something else? Maybe there's power spikes, right? Like you need the immediate fix and not really worry about like, how do I do more like mesh service orchestration?
Like, yes, that is important for sure, but also keep in mind like immediate tasks.
And going back on like the leaning on shoulders of giants is more so Google and them have developed Kubernetes already because they've been through these problems.
And then like, thankfully it's, we can just like pluck it out, use it and whatnot, such that we get to reap those benefits without having to go through all that pain, like to rediscover that.
Most definitely.
I would say that, I mean, case in point, the amount of Google papers that are published and turned into systems like Hadoop and whatnot, it is super useful to use.
And the community that even just, especially in the monitoring world, that has helped each other, I think has been really productive.
I actually, one of the big reasons why M3 was open source at Uber was the hope that there would be less reinventing the same wheel at different companies.
Like in-memory time series databases tends to be actually fairly popular at big tech companies.
There's Atlas and Netflix, there's Guerrilla at Facebook, there's Monarch at Google, and they're all proprietary.
Well, that's not a hundred percent correct actually. Guerrilla was open source as Berenge and then later deprecated in open source land.
And then Atlas is open source, but only has in-memory storage, which I think is just one of many that Netflix may use along with a few other things they use on top of that.
But yes, anyway, that's kind of a tangent. The point of looking at what's out there, is what you're doing sane?
Are you going too far? And having those discussions early on is very healthy because there's a lot of prior art.
It's actually really hard to find, I would say.
I think your Googling skills and your community, Twitter networking, and just kind of like exploring the spider universe of like actually what's happening in this space is not a trivial thing to do.
It takes time and talking with people to really fully get the space.
I think, I don't know if you would agree or disagree.
Oh yeah. I feel like finding the right resources is a skill on its own.
And there's something to be said about that. But then once people do find a community that's willing to not only like bounce ideas off of, for like, I have this metrics problem, whatnot, that could turn into potential like bug reports on GitHub, which then could be turned into pull requests.
And at the end of the day, like your feature is now mainline. So it's shifting it forward for everyone.
And throughout, I think like the evolution of software engineering in the industry, it's more so we're now moving more towards like open plans and designs or like the open container initiative.
You have open hardware specs, all the things that we want to push out for everyone to use such that we have one unified platform.
Yeah. Yeah. I know that we talked about open metrics just before this call and yeah, I contribute to that project and I'm very excited about that.
The HTTP for metrics being a thing I would love to see. Yeah. Standardization is what really moves forward everything.
Well, one of the things you guys all sort of touched on is in your roles, you can't do what is what makes sense only in sort of the perfect world, right?
You have to balance a million different things among them, you know, maintaining availability, enough redundancy, but not too much balancing cost, making sure it actually serves sort of whatever kind of business case there is.
So how, I guess we can start with how do you think about managing all of these different parameters?
Yeah.
To some extent you've always got to sort of customize any solution to a given problem, right?
But there is also a lot of not invented here syndrome in our industry, which is a bit of a problem in that I think a lot of companies don't realize that their problems are probably not unique.
And this goes back to what we were saying around sort of Google and things going through these sort of issues.
Too few people realize this, that the solutions are out there.
Yeah. Yeah. So in other words, to Alice's point about staying on the shoulders of giants, there's a lot more giants out there that we could all sort of learn from.
Yeah. I'd wager that very few companies these days have unique problems, at least when they're starting up.
No one is pushing the scale of Prometheus when they're just starting as a startup, for example.
Right. Right. Rob, what about in your role now with Chronosphere where I'm sure a lot of your time is, you know, part of your job, obviously, is to help Chronosphere do well, but I'm sure a lot of your time is spent kind of just advising your customers on how to build and scale their ecosystem.
So, you know, what do you tell them in terms of balancing all of these different factors?
Yeah, it's a great question. I think that, you know, as Colin alludes to, there are usually ways in which people deal both with allowing their developers to emit too many metrics and limiting that.
There's ways to deal with how long and at what granularity do you want to keep some of your instrumentation data.
And so a lot of the time, it's really about talking with folks about, like, how do they think about this?
Because honestly, there's different models as well.
So at Netflix and consequently Uber as well, you know, there's a lot of talk out there about we just want our developers to run as fast as possible and do things that kind of make it seem like a free lunch to them, even though that comes with a whole bunch of pain on the infrastructure side to actually provide that back to them.
For instance, like we had at Uber, there was an extremely high cardinality monitoring case of testing different surge algorithms on a per hexagon basis.
So you're talking about tens of millions of different hexagons running, say, 10 different algorithms.
And so you do times 10 by 100, you've got 100 million time series just for that one monitoring use case, right?
So that's, you know, I think, like, you really got to work out, is that the model you want to run in?
Or do you want to run more like Google that have, you know, and it's different inside Google from team to team, but like, fundamentally, they came from a world where if you want to add anything to the core search product, in terms of metrics, you need that signed off by a central SRE team.
And so they gave a lot of the cardinality and the metrics at the very front end of the pipeline.
And so, you know, I have this discussion with people about how do they actually want to operate.
And then at Chronosphere, a lot of the time, it's not the actual raw underlying infrastructure that people come to Chronosphere for.
Yes, that's attractive because of the cost versus some other vendors and all that.
But it's more actually the fact that we have rate limiting controls that allow you to per team, like dynamically, let them have a slice of the overall infrastructure.
And then if they go over that, it drops like new metrics before it drops old ones.
So that way, it's not like a Friday and someone's gone over their cap, and then suddenly all of the company's being woken up by alerts, because we had to drop random data across like all metrics.
So I usually say like, think about philosophically, how you want to do things, then see for your philosophy, what people out there are doing, and then choose the best tool to kind of like go after that philosophy, rather than starting from like, hey, this is the world we're in.
And maybe we should just throw a few developers at this thing to like do some custom stuff here.
That makes sense. Right, right, right. If I could add, so then being being a consumer of the metrics, and the user of the pipeline that Colin and his team has built, like, I've been on the receiving end of throttling as well, a few times where, like, really, it's up to the engineering teams and the monitoring pipeline teams to essentially form a contract to say, like, this is a shared resource, like, please, let's not reach toward this tragedy of the commons, but rather set upon an agreed terms, like, and the features that you've mentioned are, like, in one form or another, negotiating that back and forth.
So I would say, like, reach out to the monitoring team, first of all, try to see what they're comfortable with, what you're comfortable with pushing, or if you're on the monitoring platform itself, like, invest in those products and features that could allow you to either like through explicit agreements or through selective throttling, bring that down and keep everything in check.
Yeah, just quickly, I think that idea of the accounting of who's producing what is definitely one of the most valuable things that you can have around being able to find who is overwhelming your system at any given time.
Yeah, introspection. Yeah, who's monitoring the monitor?
And also, what is the monitor doing with their things? Yeah, I mean, you know, we actually apply a weighted random sampling to the metrics, like firehose that comes into the ChronoStreet platform.
And that actually allows you to see, like, oh, for 1% of data weighted randomly, like, where are the high cardinality parts coming in, like right now, which tools like that is really exceptionally delivers a ton of value.
And that usually, those can be layered on top of whatever systems you're using fairly straightforward in a straightforward manner, depending on what you're using, obviously.
But yeah, I fundamentally agree that that's the very high leverage and high value things to give your organization.
I think we're coming up close on time, which is my least favorite thing I have to say whenever I do one of these.
This is a lot of fun. Thank you guys for making time for this.
Thank you, Colin. Thank you, Alex. And this is a lot of fun.
And we'll see you soon. Great. Thanks for having us. Bye.