Evolution of the Cloudflare DNR Team
Join members of Cloudflare's Detection & Response Team to discuss the journey and challenges of building a detection engineering program from the ground up.
Hello. Thanks for joining us today in our segment. My name is James Espinosa. I'm a member of the tech and response team here at Cloudflare, and I'm joined by Anjum, who leads the tech and response team.
We're going to talk today about our journey and challenges of building a detection and response program from the ground up.
Anjum, do you want to give yourself a quick introduction? Sure. Hey, everyone.
I'm Anjum. I'm the director for detection and response here at Cloudflare. James and I joined the detection and response team back in 2019, one of the early members of the team.
So we have sort of seen and worked through the evolution of our team and what was the best time to talk about how it has been and how the journey has been and what's next for us.
Thank you. I'm James, as I mentioned. I joined as well, same day as Anjum, and I still remember day one when we first joined, and I remember walking into a room, and first somewhat incident was getting access to our platform and being told to try to investigate what had happened in the specific scenario that we had seen.
And for us, it was all brand new.
How do we get visibility into the right things? What's missing from logs that we already have access to?
And so I want to talk a little bit about some of those challenges that we've had and the foundations of building, going from lack of visibility to rolling out technologies all the way down to where we're at today and what we're looking forward to next.
So maybe we can talk a little bit about like, I don't know, for me, I think the first thing that comes to mind is like access logs.
And I remember like using our own product and working with the product team to help us get better visibility in our environment.
I think one thing that I remember specifically with access was on the identity side, we wanted to be able to investigate what users did in our environment, where did they visit, you know, who the person was.
I still remember vividly like not having that who, who was the person that went to see.
So imagine we saw HTTP logs, we could see when they visited something and what they did, but we didn't know who the person was.
Do you have any recollections of anything that we, you know, working with those teams on adding additional features to that?
Yeah, I think there are two aspects of it, which are, which are, which come back to my memory.
One, those logs, as we got the HTTP logs with user, with the identity data attached to the access logs, this was all, you know, very valuable to us before Zero Trust became a big thing that it is now.
Like this was Cloudflare access, this was gateway resolver logs.
That was so useful to us from a security standpoint, but it wasn't, you know, we didn't come into this with the idea of this is zero trust.
This is what we made a lot of sense to us. And the other challenge that we had at that point was the commercial SIM that we used at the time wouldn't ingest those logs.
And so like, this is the most valuable log that we have, but our SIM doesn't support ingestion of those logs.
And so that was, I guess, in my mind, the problem number one that we were facing that here's the crown jewels and logs from there.
And how do we, how do we manage that? And so that brings that whole idea of, you know, what are the first pieces you move, right?
I, I, I've been teaching my son chess and I think about the opening of a chess and sort of like, what are the minor pieces, like, you know, getting logs from the host, getting logs from the network, getting logs from applications, and then getting it to the SIM were the four main things that we were touching in the first six months as we landed into the team.
And the SIM became sort of the next question of how do we get these logs to the right place where we can use them?
Yeah, actually, the SIM part would be super interesting to talk a little bit more about because I'm originally from Chicago, and I have worked at a lot of other companies with the environment where we use other types of technologies that were not something that I've seen what we did here at Clackflare, which is really, really cool.
We took a different approach with building out our own detection system that essentially tried to build out to scale.
We have that problem of we have so many data centers and users and endpoints across the world and makes it super difficult and challenging to, like, get everything somewhere centralized where you can actually scale and do proper detection.
So do you want to talk a little bit like about like, kind of what why we decided to go, you know, the route that we did and maybe some of the challenges that we've had with our SIM and what that is that would be pretty cool, I think, to share with people.
Yeah, I think that we have gotten to that crossroad twice since then, I think.
So as you started out, everybody had used commercial SIMs, and one of the challenges everybody had had was the high operational cost of it.
You know, you set it out, it looks good, but then you start to scale it, scale up, and maintaining the SIM was a big challenge.
Everyone had had the challenge in the past.
The commercial SIM that we did go with was something, you know, it's a new product that would be self-managed, self -scaling, and the whole goal was can we build a SIM and maintain it in a low operational cost?
We should be users of SIM, not just spending a lot of time maintaining the SIM and scaling it out as we add more log sources and such.
So at that point, the big driver for that was that problem.
And the other problem was, like I said earlier, a SIM that would ingest the Cloudflare logs as is not having to work to try to parse it and do a lot of that work.
And so that drove the first time we were looking at ever, you know, let's build a SIM ourselves that can do the things we want to do.
But then in a year from then, we revisited that decision and thought about, all right, let's look at commercial SIMs that now support a lot of the features that we were looking for.
But I think the few things that still stood out was, A, you know, we have a huge production network across the world.
We did not really trust having a closed source agent running on those machines that are shipping logs to us.
That was something inherently a risk that we didn't want to take.
And these were specifically written in unsafe languages.
We were looking at Ruby based Linux agents to ship logs.
I don't feel very comfortable deploying those on a production network. And the other aspect, my background has been on the backend engineering side of things.
And I have always liked the way software development processes are built out.
And I was hoping that we could replicate that in detection engineering as well.
So from day one, I really wanted us to have detection tools as code, as something that we can code review, something that gets deployed automatically to a CI CD pipeline.
And all of those features were very, very hard to, you know, add on to any commercial SIM.
And so that first half became a second factor. We're looking at, you know, multiple commercial SIMs.
And every person had, all right, there's this hack that you can do to potentially link your repository of source control repository to the SIM.
But it never worked as seamlessly as we wanted it to be.
So our SIM, you know, detections were code from day one. And we just did not want to lose that capability.
So every time we have come to that, it's, you know, all right, we're going to be scaling 10x from here.
Should we not consider going for commercial SIM?
And we look at what we had and what we, what was on the table. And we always have chosen to go and, you know, commit ourselves to building a SIM ourselves.
And the last part was, you know, the high operational cost, right? The way our SIM is built out is built out in a manner of, you know, scale by, you know, it sort of automatically scales for us.
We don't really have to think about adding more, so to speak, nodes to a cluster.
It does in a serverless manner that we don't have to really think about how much more logs do we have in just every year.
Yeah, that's actually pretty neat. And for me, it was very new. My background is actually much more like the incident response side.
So when I came to Cloudflare, I think one thing also early on that I remember thinking about, like, well, cool, we have this system.
We can detect things. But, like, what do we do when an alert fires or when we find an incident?
Like, how do we respond and how do we, like, block things?
So one of our philosophies here at Cloudflare, at least within our team, has always been, like, to use Cloudflare to secure Cloudflare.
And to me, that was pretty cool to see as part of our program because we use things like Cloudflare gateway to block domains programmatically.
So, like, I don't know if you're talking about, like, integrating into our SIM, this little SCSI pipeline there to be built like bots that essentially could block domains whenever something would be detected programmatically and, like, stuff like that that made it much more exciting to build, you know, I guess a detection response program from the ground up here at Cloudflare, specifically because we're able to use our own products to protect us.
And that makes it exciting. And then the other part was, like, also working with those teams.
I remember working with the gateway team and features that we didn't have that we wanted to support, like, our specific use cases for our team.
And then we'd just go to them and say, hey, like, we're missing this piece.
We wish we could do this. And, like, how open and willing they were to help us prioritize those features, which ultimately translate to better product for our customers as well.
So, that was pretty cool.
That's the most exciting part of doing what we do at Cloudflare, you know, being a security company, building enterprise security products and being part of securing that infrastructure.
It's just being on both sides of the equation. It's been really, really great.
So, if we were to fast forward a year from that point, we had, you know, visibility from network.
We had visibility from post. We had visibility from IAM and access layer, some of the applications as well.
And we had a SIM.
And at that point, we were still very much reliant on, you know, the detections that these tools provide us, you know, EDR tools provide us.
And our network visibility tools provide us signature-based, very much, you know, reliant on our vendors and very signature -based.
And that became a challenge for us.
A, volume, and, of course, how many, you know, network-based tools can we go triage a day.
There's a limitation on how much we can triage. So, that drove us into this idea of, all right, how are we going to build out our detection engineering SDLC?
And James, you, you know, I remember you wrote a blog, a blog long ago about, you know, how it should structure out, you know, speak to that side of things, how you thought about SDLC and how it has sort of grown over the years.
Yeah, absolutely. Yeah.
So, this might be a little bit embarrassing, but I remember, like, when we were first, like, trying to figure out, like, what detections we wanted to start adding that were custom outside of our, you know, existing tooling, the first thing we did, because we didn't really know where to capture, we were, like, putting things together in, like, spreadsheets.
Sometimes we had them in JIRA tickets.
And it was just kind of all disconnected and felt like sometimes we didn't really prioritize certain things that we probably shouldn't have prioritized more over others.
And so, it just felt a little bit disconnected. So, a lot of this was inspired by some work that a few folks have put together out in the industry.
And I thought it would be awesome for us to integrate our detection SDLC life cycle and a process around, like, how we get from an idea all the way down to a detection that gets deployed.
So, I guess I'll talk through some of the things that we tried there was using, like, JIRA.
That's our main ticketing platform, essentially, to any idea that comes from either something that we find from, like, a current, what do you call it, like, something that's happening in the world today or a past incident or something that pops up from one of our threat models from systems that we're looking into for any, like, threats that we want to detect there.
Like, we capture all those as tickets into a backlog, a detection backlog that we have.
And that already helped organize things a lot better for us.
And then using MITRE as a framework to kind of map how we're going to write our detection to those TTPs, it kind of helped, or to the tactics, it helped a lot, like, think about how we're going to build our detection.
So, if we were thinking of a system and we're worried about how somebody might get in, it makes it easier to look at, like, initial access and think of all the different ways that somebody could potentially access the system.
And then we can create tickets for that and make sure our coverage is easier to talk to, or it's easier to measure, because we have the framework already in place.
Yeah, so, like, I think that high level has helped a ton to keep us a lot more organized with the team and have a better workflow.
One thing we did adapt to was, I'm trying to blank, I think it's called the EDS framework from Palantir that they have put out to just help organize how we also structure our detection content and our playbooks for response, which has been, I think we've, yeah, we've still been using it since 2019.
It's something we've iterated over and it continues to work well for us.
One really cool thing that, on that side of things, like, every rule had metadata that talked about the tactic and the technique that mapped to MITRE.
So, even if it's, you know, it's an internal application, we were thinking about it from an attack tactics perspective from MITRE and were able to think what techniques.
And so, if you were to take all the metadata of all the rules, we can easily map up to say, all right, on this system, what's our coverage against MITRE attack framework?
Now, by no means it's ideal to have, you know, that's not, that's sufficient.
It's not sufficient, but that's a really good starting point to understand how well covered are the initial access.
How does, you know, what does persistence mean on a SQL database?
What does this, you know, what does exfiltration mean on a SQL database?
And being able to think through that and have an understanding of what an end-to-end attack on that system would mean for us.
That was really cool.
And still, you know, we have a good mapping of saying, this system, this asset, we can see how well protected we are against that, against all the tactics.
Yeah, yeah, that's a good point. And I should mention, like, our rules, we do the whole detection as code model that, you know, that basically all our rules are in our repository that we source control and have, like, people do the proper code review and approval.
So, we follow that whole chain. And we've also built out, I believe we have, like, a Cloudflare worker that we use to read some of the metadata tags that you're talking about.
So, we can display metrics around that and overlay it over, like, I'm forgetting the name of the tool, but essentially it's a MITRE framework overlay that allows you to fade in and add, like, show kind of your coverage.
So, when we're doing, like, reviews with the broader team and that want to understand, like, how much priority you've made in a certain section, like, you're able to kind of communicate that a little bit better that way.
So, yeah, for sure, that's been something that's been super valuable for us.
Jumping, maybe I'm jumping a few steps ahead, but I think that whole idea of being able to take those metrics and say, when was the rule last updated?
When was the rule last triggered?
Did we break something? And being able to sort of maintain that body of work has been really valuable from that aspect that code detections work and metadata was all, you know, machine readable.
And so, we can ask these questions around data, both from the rules alerts that trigger and the rules status alerts over their Git history over time and metadata that we have managed.
Yeah, yeah, absolutely.
And I think the other aspect, I'm trying to think of, like, from the automation standpoint, too, being able to, I mean, everything we've built has, for the most part, been done with trying to add a level of automation to it.
So, there are still things we can't fully automate, but a lot of the stuff that we have were also built with using Airflow and having, like, some automation that will automatically take action, you know, with whatever detects something.
So, stuff like that.
We'll continue expanding on that, but I think it's made it possible by having this extra amount of data that we can get on our rules from, like, the moment that we create them.
Well, this is sort of, you know, we're getting to the recent times, right?
2021, end of 2021. We were at this point where we had a lot of that going.
We tried tracking our, you know, rules, our triggers, our alerts as, you know, what's the disposition of every alert that triggered?
We have, like, three classifications right now.
We have false positives, true positives, benign, and true positives, malicious.
And one thing we realized that a lot, a big chunk of the detection was that alerts end up being true positives, benign.
So, a bad thing happened, but there's a good business justification for that.
Like, there's a reason, you know, the machine may have been, you know, automated and there's some action that was taken to prepare for it.
And things like those, like, they're destructive actions, but there is some business justification for that.
So, we were in the realm of managing a large volume of alerts.
And I think, like, 80% of them were true positive, benign.
So, we were at that stage and started thinking about, all right, how do we address that challenge?
And one thing that we are working towards now, and then sort of incrementally towards getting better at that, is what we are calling, you know, two-step, two-step page.
One, every rule we're trying to link to an asset and an identity.
Whether it's a user or a service, an asset could be a machine or could be a service account and such.
And the whole idea is that once we have asset and identity as sort of first-class citizens attached to every rule, then we can attach risk-based scoring on top of that.
And, again, this is something that's developing right now. We're very much in the process of building that out.
But the way we've been thinking about it is every asset and identity has a static and a dynamic risk.
So, for example, static risk would be, you know, dev environment, production environment.
You know, there's a static risk, different risk levels to each of these environments.
But a dynamic risk would be something that's changing much more rapidly than static metadata that's attached to a host.
So, for example, if my account logs in from a VPN, maybe that's, like, one thing, one thing that is, you know, concerning.
But then once maybe I modify my MFA tokens or maybe I create an ABI key, things that tag on and over time, the risk scoring, my user risk account's risk scoring goes up because of the actions that are taken over time.
And all of these different events bump up my risk factor.
So that's where we are headed in an attempt to reduce the amount of work internal response is doing on the sort of true positive benign set of things, like how quickly can we respond to those and how quickly can we bubble up things that are potentially should be malicious, but it's still in the benign category at this point.
Individually, these all events are true positive benign. Together, they are probably true positive malicious.
Yeah, that is super exciting. I was just picturing, like, a funnel of, like, the detections that we have, how much they've evolved from, like, the first ones that we had to just find things that were, like, suspicious.
We go look into them and try to see and figure out if they're bad or not.
Then, like, you're talking about now trying to, like, getting closer to this, like, end of the funnel where eventually what's output is going to be hopefully, like, just the bad stuff that we really need to worry about and everything else.
We'll start lowering the toil, I guess, that people have to deal with every now and then with some of the alerts that we're getting.
So, yeah, super exciting.
Are there, like, any, like, I guess, I don't know if you can think of any, like, major challenges that we have right now or anything that you can think of that you want to talk about?
Yeah, I think the one thing that you called out earlier is how much of an automation we can bring in our response.
Ideally, we want humans to be only decision maker, but there's aspects of things that we haven't fully been able to automate.
For example, you know, an action happened on a host. Who was logged in at that particular moment?
That isn't, that data isn't really easily rapidly available to the alert, but something, you know, it needs a secondary query.
It needs to sort of change the timelines between how long, you know, when the query for and all of that.
So, a lot of the automations has been a big challenge where we are trying to get better at how much of that can we automate and bring, you know, just humans as, all right, I agree with the assessment of data I'm seeing.
This is what the next action should be. It's just the, you know, the diamond of a flowchart, and we want to get there rather than having to execute some of the queries manually, or even just modify the parameters of a query.
How much can we automate there?
I mean, that's one area that is still something that we are, you know, trying to figure out how we do better.
The other thing I think we have been sort of thinking about this whole idea of, you know, capturing business justification for a lot of things.
Like one aspect is risk, right? Risk, you know, like I said, this machine in the current state is maybe in repair mode, or maybe it's going to be automated or something, right?
There's some inherent change that happened to a host that we can probably figure out.
But there's also this idea of, you know, I, this action was taken.
Is there a reasonable business justification for that?
You know, this comes as a company that we are very, very privacy aware and conscious inside and outside, and having a good understanding of why an action is taken.
That has been an action, you know, I don't want to bug the whole organization again and again, including me, of saying, is there a reason for this particular action?
It looks pretty odd to do that. But being able to bring that faster, that loop faster, you know, not having to ask someone for a business justification has been something that we've been working on.
And something that has been really useful in that, James, is the cloud-share access business justification part of it.
Yeah, that's been really cool, being able to sort of, you know, before someone gets access to it, there's a feel that you can provide business justification, and that sort of short circuits our incident response.
There's something that right away in front of us, here's a ticket, or here's ideally a ticket, but here's a text-based understanding of why this action was taken.
And that sort of short circuits the incident response very quickly. Yeah. Yeah, I was just going to mention that as well.
I think that's definitely a nice feature, I think.
I know we're also looking at, like, ways that we can get a little bit more structure to that as well, so that we can do some automation hopefully better with it, because, like, when you have, like, open text and some structure, sometimes it's a little bit tricky, but that's, like, I guess, step two after that.
So being able to capture it is good, and then structure it so we can take some action.
And actually, I was going to go back to the other thing is, like, the data, or the quality of data, and, like, structuring data, all that stuff that we store, I feel like that is another area, too, that we're continuing to make improvements in, because, like, the better structure we have the data, the better structure the data is, the better we can make automated decisions that don't require a human all the time.
So, yeah, super excited. There's that aspect of, you know, pros and cons of going with a commercial sim that probably has a log normalization.
So there is someone that has written, someone who has written a parser for a log and will convert that into a normalized format.
We wanted to be log format agnostic from day one, but that's, there's a, on the flip side of that, the CNA is this problem that you mentioned, data quality, right?
There are log types that will have, you know, undeterministic number of area fields or will have key value pairs, and it just becomes harder to write queries and have, you know, predictable way of figuring out an answer to a question from a data that's not always going to have a very structured format.
So that has been a challenge and something that we've been working on quite a bit.
Yeah, and I know we're getting really close to time here, but I'm going to throw a shameless plug, detection response team is hiring.
So if you want to like deal with some of the challenges that we have across our environment, I think it's super fun.
I'm also a little bit biased, but whenever we're interviewing people, I always, like candidates, I'm always telling them that one thing that makes it super exciting to do detection response with a cloud player is, you know, there's a lot of companies you can do similar type of work, but it's usually in a smaller scale or a smaller environment, cloud base or something.
And here we have like actual data centers across the world. That's one risk area, right?
One area you can focus on. And then we also have our infrastructure that we have on our production side for the systems that run cloud player, as well as our corporate side.
So it's like you get a little bit more to worry about.
I think that's pretty fun. The size of our network is pretty fun.
So yeah, hiring on our team, so definitely apply. And as our company is growing and building more and more interesting products, I think it's been always amazing to see how those systems are built out and how we can be part of that process early on to think through all the attack scenarios on the systems and proactively look out for those.
I work with, like we said, started the conversation earlier, like being close to engineering teams and product teams to understand how do we improve this from ground up rather than having, you know, coming back in the space where we are monitoring rather than building this in the product.
Being able to solve that problem has just been very, very, you know, interesting for us.
Yeah, absolutely. Okay, we have about three minutes.
Do you have any last minutes, things you want to chat about or think we're thinking in your mind?
Yeah, I think the one thing we touched on, I think we could use the last two minutes to talk about that.
We were talking earlier about being able to pull metrics from our detection rules and our alerts, but I still wanted to sort of bring up this last topic of how do we measure our detection engineering, right?
How do we measure how detection is doing both in terms of rules that we have running, the rules that we want to add and how we're overall lowering the risks for our organization.
And so that's something that we've been working at the last sort of thing, problems that we're solving on.
You know, there's some metrics that are easy to come by.
We can easily calculate mean time to detect or we can calculate mean time to respond or disposition.
But there is things about talking about how do we measure this against a risk.
It's something that we have been working quite a bit on, you know, how do we over time continuously improve ourselves and our efficiency of our operations.
So that's the last, I just came to my mind, a problem that sort of still unsolved.
We're getting better at it.
But hopefully in a few months we'll have a better answer to those questions as well.
Well, especially with like the coverage piece, which I know it's a big topic, is like how do you actually measure how, you know, it's like a moving goalpost.
The more products you're, you know, after we put out so many products very fast and it's like keeps growing and how do you measure and communicate that we're making progress if the post keeps moving.
So that's definitely been tricky for sure.
But that's something that we'll definitely have to figure out. Yeah. By the way, it's always fun to go back and reminiscence about the journey you've had and it's sort of both of us who joined exactly on the same day.
This is, you know, fun to relive some of the moments we've crossed over the time.
Yeah. So yeah, it's definitely a really cool story.
I mean, I know couldn't really capture everything in 30 minutes, but hopefully in the future we can do another session and we can talk more about it.
Yeah. Awesome. Well, I just want to thank everybody who's watching.
So thank you so much for joining us today and enjoy the rest of the show. Thank you, everyone.