Tales from an SE: Under Attack Onboardings
Presented by: Weston Eakman, Ankur Aggarwal, Kabir Sikand
Originally aired on April 6, 2022 @ 3:30 PM - 4:00 PM EDT
Tune in to hear Cloudflare Solutions Engineers discuss their experiences onboarding a customer that was Under Attack from malicious online threats.
English
Security
DDoS
Transcript (Beta)
Thank you everybody for tuning in to Tales from an SCE Under Attack. My name is Weston Eakman.
I'm a solutions engineer here at Cloudflare. I have been here for two and a half years.
Fun fact about myself, during quarantine, I just started growing this beard.
What do you think? Is that believable, Ankur? Yeah, totally. My name is Ankur Aggarwal.
I'm also a solutions engineer at Cloudflare. And today we kind of wanted to go through some of what we see as customers come in when they're under attack.
So we actually invited Kabir here with us. He's also an SCE at Cloudflare to kind of take us through kind of what we see when customers come in like that.
So Kabir, if you want to introduce yourself. Yeah. Hi guys, I'm Kabir Sikand.
I'm a solutions engineer here on the Cloudflare team. I've been here for just north of one year now and definitely have been through, in that time, a large number of under attack scenarios.
So we'll definitely be talking a little bit about some of the more interesting ones that we've had over the past year.
Yeah. Awesome. Thank you, Kabir. And we really appreciate you joining us, kind of telling us your story and everything.
But before we really jump into that, I just want to give a little bit about the purpose of this segment.
You know, Cloudflare, we're here to help, you know, our motto is to help build a better Internet.
And in part of that, one of the things that we help with is under attacks. So to kind of define that a little bit, Ankur, do you mind talking about what an under attack is?
Sure. Yeah. So an under attack is basically just anytime a customer's infrastructure or, you know, your infrastructure is either getting hammered by something malicious or sometimes even not malicious, like, you know, he released something new and your infrastructure is either just not able to handle it or even just like filtering out the bad DDoS traffic that's hitting your edge.
So we'll have the under attack available to really anyone to just go ahead and click on I'm under attack on our homepage.
And, you know, from there, we'll take it away internally.
Yeah. And so, you know, talking about that internal, who typically is involved internally at Cloudflare for like who's who's supporting these under attacks when they come in?
Yeah. So once those under attacks come in, obviously, you know, we assign an account team.
So we'll work with both the account manager and then and also an SE.
So the SE's responsibility during those scenarios is basically just to go through talk to the customer, try to figure out, you know, what is actually occurring.
We'll work with the customer to either identify based on these sort of logs or documentation or information that can provide us and eventually, you know, get at a solution and then also with help implementing that solution.
And the key thing with the under attack is like, we do all of this in a pretty compressed timeline, or just, you know, right when that customer comes in.
Great, great explanation.
And, you know, you know, with that, with the SE's coming in as helping bridge that gap between kind of our box out of the box enterprise solution, the customer's infrastructure, how, who, who's picked up?
Is it round robin?
For the SE's to come in and help? What does that look like? So on the SE side, it's actually kind of amazing.
It's, so it's basically by volunteer. So, in a typical under attack scenario, you'll actually have one SE or two raise their hands pretty quickly to join that under attack.
But as time goes on, as other people check the messages, you'll actually have additional SE's either wanting to shadow or to just add or take over if the first SE becomes busy.
So it's a bit of like a badge on the SE team to actually support an under attack.
Yeah, awesome. Awesome.
Cool. So, you know, to start actually jumping in, you know, after this introduction, I want to give a little bit of background more for Kabir, because I think the under attack that he had experienced was kind of within Cloudflare, a little bit of a, it's kind of legendary, because this one took a very, very, very long time.
Pretty much what we call the under attack that happened around the world.
I think it was between what, like 36 to 48 hours or something like that, Kabir, right?
Yeah, yeah, it was somewhere in that timeframe. At that time, you know, it was a few months ago.
So the amount of exact time definitely gets blurry past hour 36.
But certainly, I wasn't the only person helping out on onboarding this customer and getting them through the attack that they were experiencing.
We actually got to hand this off between folks around the world.
Oh, see here, I thought this was kind of a superhuman feat on your side.
I kind of want to keep that image of you like that.
Yeah, I don't know, maybe. With sleep deficiency, I did everything.
For sure, for sure.
Well, cool. Well, yeah, let's get into a little bit. Can you kind of describe the industry of the customer, as well as, you know, kind of a semi location, I want to make sure we're keeping privacy and everything, but just like a kind of simple.
Yeah, without obviously, giving away too much information about the customer in question.
This was an attack scenario that was occurring in the EMEA region.
And it came in, in Pacific time, roughly in the afternoon.
So it's pretty late in the day for the folks in EMEA to be working on it. And as you can imagine, you know, when we got the under attack scenario, coming into our pipeline, the customer themselves had already likely been up trying to deal with this on their own for a long period of time.
So we'll definitely jump into a little bit about, you know, what that might look like and what that relationship looks like.
But for us, it started at around like 3 or 4pm Pacific. I handled it until we had some folks in Singapore wake up and get to their desks.
And then, you know, obviously, overnight, that kind of went on.
And I got it back in my morning from some folks in EMEA.
Amazing. It basically just followed the sun. With that, when the customer came in, like, did they readily know what issues they were facing?
Or did you guys have to kind of work through, kind of work through some material to get there?
Yeah. Oh.
Kabir, I believe we lost your audio. Yeah. Are you still there, Kabir? I wasn't sure if that was just on my end.
I am here. Can you hear me? There we go. Yeah, there we go.
Okay, perfect. Sorry, a little blip. So in any case, any time we have these scenarios come up.
Oh, I think we lost you again, Kabir. We lost you again.
The joys of working from home, everybody. Exactly. Awesome, awesome.
All right. Can you hear me now? Yes, yes. Great. Switched Wi-Fi. We should be good now.
So in any case, whenever we get these coming in, I generally see that there's like a series of kind of symptoms that almost come into play to understand exactly what the attack is and how we might be able to help it.
So it's kind of like calling into an advice nurse line, really kind of read what the symptoms are and what types of attacks those might line up with.
And also, you know, on the flip side of that, not just on the technical side, we need to really empathize with the customer.
They've likely been dealing with this. They're frustrated.
They're hard at work. They're reaching out to a third party to come in and try to help them implement a solution.
And so that really helps guide the conversation and also helps guide the pace of things, too.
Yeah, I think that's... Sorry, Ankur, just real quick.
I think that's very important to note, too, because while we have, as you guys kind of mentioned, we have a follow the sun model where we can hand things off.
These teams aren't very big all the time. So it's... Before they even came to Cloudflare, they've been up trying to understand the problem, diagnose it, right?
And then trying to fix it as well. So they've probably been up for, I assume, maybe a day or two trying to fix this.
And then coming to you, this is 36 to 48 hours, right?
That's a lot of time to be up. And that's a superhuman feat.
Joking about with you, but I mean, the team that came to us, they were having to do that.
Yeah, exactly. And that's one of the key pieces to kind of make sure that the relationship is good.
And so as we get into an attack scenario like this, it's important to realize we need to get as much done as possible to get them to a point where that person can go home and go feed their family, get some sleep, come back and get back to the problem and kind of start to move on from there.
Yeah, absolutely. So when they got to you, it was late at night, you were able to kind of tee off on some information or signals that it was a certain type of attack.
Did you guys initially from the get-go establish some success criteria or a game plan on how you're going to mitigate it?
Basically, could you walk us through some of that actual solution architecting?
Yeah, absolutely.
So one of the big problems here in this particular scenario was that they saw a sudden jump in traffic.
They didn't expect that jump in traffic based on any marketing or releases or anything like that that they were running, no promotions.
And as a result, they hadn't capacity planned for that increase in connections to their servers.
And so connection limits were being hit. They weren't able to serve all the traffic coming in.
And so one of the immediate goals is let's try to mitigate some of this.
A, is it attack traffic? And that was very quickly identified as a yes.
But B, since it was, can we mitigate a lot of this and free up some of the connections that they have available on their origins so that they can actually serve their sites, at least in even a limited capacity such that they're not affecting their bottom line too much, or at least minimizing the effect on their bottom line.
And again, goal number one was to get this guy to be able to go home and go to sleep and come back to it with a fresh pair of eyes in the morning.
So basically, when you guys first started then, what was that first step to get that set up so you could kind of relieve that person so then they can come back and pick it up with, say, one of our other teams?
What was that?
Just walk me through that. Yeah. So one of the out-of-the-box services that Cloudflare provides once you enable our proxy is DDoS attack mitigation.
So we were able to provide that DDoS mitigation on layer seven.
And all the customer had to do was really switch over to our name servers, enable the proxy on the records that he was experiencing the attack on.
He was then able to utilize our DDoS mitigation out-of-the-box.
And we enabled the web application firewall as well, really just a click of a button.
And it also enables you to get in our IP reputation database so you can really take out any malicious threat actors that are known out there.
So that really helped solve some of the immediate problems. And we saw an increase in three connections on his origin servers as a result.
Awesome. Awesome.
And this was just one website, or was it several websites? What did this look like?
Yeah. So the attack itself was occurring across a number of his domains. So the company themselves managed multiple domains and some subset of them were being...
Oh, I think you're breaking up again, Kabir. Sorry about that. No worries.
So some subset of those websites were being attacked. Gotcha. Gotcha. And so this, again, I keep bringing it back because it's crazy that it took so long.
So did that...
Yeah, go ahead. I was going to say. So they're able to get the proxy up.
They're able to start processing traffic through us for the specific, say, DNS records that were impacted.
Did that solve all the issues for the customer? Was that it?
Were we able to end it? Did that take a full 36 hours? Yeah. So in an ideal world, that's all we need to do.
Just turn on the proxy. But we're in the real world here.
And sometimes these attacks will happen because threat actors out there are just port scanning the Internet, or they're looking up DNS names, and they are attacking those and looking for places where they can do attacks like credential stuffing, like look at...
I'm sure you're all aware, there are many places in the dark web you can go and just identify leaked password, username combinations, buy those lists and test them again.
And kind of look at those across the web and see if they're reused.
And so sometimes when we see an attack, we'll see this customer is not really a target of a specific targeted attack.
They're just being hit because they don't have the mitigations in place that they could have.
In this case, a few hours later, we saw that the attack started again, this time not on our edge.
What had happened on the second step was the client in question previously had another DNS service.
That DNS service did not mask his origin IP address.
There was no proxy. Those DNS records were cached and the customer, or the attacker rather, was able to look at those previously cached DNS records and attack the IP addresses directly.
And what that does is it bypasses Cloudflare's proxy.
So in that case, what needs to be done? How does a customer... How do we resolve that?
Yeah. So what we want to do is really lock down traffic to force it to go through our proxy.
There's a few ways we can do that. We could tell the customer to lock down the IP addresses that can make requests over to their origin server to only use Cloudflare IPs.
We could install daemons or agents on their machines as well.
The other thing that we often do is when you do onboard and you are under an attack, an active attack especially, you can go over to your ISP or service provider and say, hey, can I rotate those IP addresses out and just post my service on a different set of IPs.
Cloudflare on our side, we can just change the IP addresses that we're pointing over to.
The old IPs can get black holed at your ISP or whatever provider you're using so that it doesn't affect the rest of the network over there.
And then any of the attack traffic that was previously coming into those IPs is no longer going to a valid server.
So you're now able to force all outside traffic to go through Cloudflare, apply the mitigations that we have in place, and send that traffic back to origin cleanly.
So that was part of the process as to why this attack kept coming back in different forms.
And that was really step two.
It wasn't necessarily the end of the attack. There's definitely more attack patterns that we can see.
So with this one specifically that you were working on, it sounds like it was changing and evolving and attempting to adapt as it got into Cloudflare.
Was that the case? Am I seeing that clearly or did they just happen to be lucky?
Yeah. No, I think what was happening is, obviously, I can only speculate as to what the threat actor was looking to do and what their motivations were.
But based on the attack traffic that we saw, it looked to myself like someone was specifically targeting these domains and these sites.
And they were using more and more difficult but complex methods to attack the site.
And they started at really basic DOS attacks.
They moved on to direct-to-IP attacks. They started issuing credential-stuffing attacks.
You're breaking up there, Kabir.
Just a moment. Can you guys hear me again?
Yeah. Okay, great. So I think you may have lost me somewhere in the attack.
Just to kind of wrap that up, the last attack pattern that we saw was a really bot-like credential -stuffing attack.
And what that is, is it's a method of adding any sort of username-password combinations in some form of an attack pattern that allows you to identify valid username-password combinations.
And then if that's something that was used on another site, and now it's used on this site, it's very likely that you can go to a higher target site, higher value target, like a bank or a social media account, and log in with the same credentials, and some subset of those will work.
So you're whittling it down. And obviously, those higher value targets have higher forms of mitigations in place.
So you want to, as a threat actor, generally do that kind of threat recon on smaller sites that maybe don't have those mitigations, and maybe only have one guy on the IT side, or instead of a full security team.
So when this customer, essentially, you guys got them proxying traffic, they rolled their ISP IPs, and then they're experiencing these credential -stuffing attacks, did we roll out any additional products to help mitigate these?
Or how did we roll these out? Were they during that under-attack scenario, that 36 hours, or did we come back later to address these with the customer?
Yeah. So this third part of the attack, we considered it to be part of the under -attack scenario because it was very clearly coming from a similar threat actor.
What we did is we added on... So maybe just to dig into what our previous mitigations were, they were against things like distributed denial of service attacks, where it's very clearly volumetric attack patterns that can knock a site over, looking for specific threats that are known on the Internet, looking for vulnerabilities that you might find in a common vulnerabilities database, and preventing those types of vols from allowing a threat actor to get access to your infrastructure or take it down.
But with a credential-stuffing attack, often that looks like a normal user coming in.
And so you do need a lot more advanced mitigations to put in place.
They could be in the form of a brute force where there's lots of traffic from single IPs or specific sets of IPs, and often that's a threat model where you can kind of just play like a cat and mouse game.
You plug the hole as it comes in. In this case, it was distributed.
It looked like it was a fairly advanced script that was being used to kind of mask the traffic and make it look like it was human.
Cloudflare itself over the past few years has been developing bot mitigation solutions.
And so what those are designed to do is really learn from the collective intelligence of the traffic that's already flowing through our network and identify what is automated traffic and not automated traffic.
So in this case, we applied that bot mitigation solution to the customer, and that traffic was able to be mitigated.
We saw a much lower rate of folks logging in and definitely mitigated a lot of the connection pull limits that he was hitting.
So it went back to normal kind of traffic patterns afterwards.
Awesome. So instead of, we were able to in eventuality find a way to react by not just kind of putting boards over kind of these holes coming into the ship, but be reactive and then proactively in the future reactive to mitigate these things.
Yeah, exactly. I think you're getting to like a really good point in that in many attack scenarios, like, you know, as I mentioned early on, you want to kind of empathize with the person on the other side and get them to go to sleep and then come back the next morning with a fresh pair of eyes and figure out what's the plan to prevent this from happening in the future.
Like this was only one type of attack pattern.
In this case, that planning phase of what does the future look like, it had to be accelerated because the threat actor kept morphing the attack and coming back and coming back and coming back.
So we accelerated that certainly.
And we implemented our bot management solution, which is going to help in the long term and has been helping them for over the course of the past year almost.
And this is a pattern that we'll see often where folks will come in and we will mitigate the initial attack and then we can apply solutions and strategies to prevent future attacks.
Gotcha. Awesome. And then at the end of this, it sounds like we were able to solve the problem.
And, you know, they've been on Cloudflare for, like you're saying, almost a year now.
So after this, did you see the customer coming back often for, say, tunings or anything like that?
Or was this handled by you or someone else on the team? How did that work out?
Yeah. So good question there. So kind of going back to the model that we described early on, for the period of the attack itself, we do have a volunteer model for a lot of these customers coming in because it's going to happen in off hours.
It can happen when, you know, the region that generally serves the customer maybe is offline, could be in the middle of the night.
You know, in the case of the particular attack that I'm talking about, it came in in the afternoon in the Pacific region, which means it was the middle of the night over in EMEA where the customer was.
So after the attack subsided and we mitigated it immediately, this actually went over to a solutions team in the EMEA region so that we could better serve them and down the line get meetings that are in the right time zone for both parties.
And in passing it from, you know, region to region, did you guys do syncs at all of, you know, catching up on how notes during that under attack process and then eventually, you know, settling out with that single account team?
Like, was it simply just pinging another colleague some notes or was it, you know, hopping on a Zoom or, you know, what have you?
Yeah, that's a good question. I think, you know, for most of the handoffs that we did, and we did do a lot for this particular case, others, there may be one handoff.
Those handoffs, I prefer for them to be, you know, get on the call with the customer, make sure all the context is there.
If the customer's asleep and we're still kind of helping behind the scenes, maybe I'll just sync up with that solutions engineer in that different region, you know, on a time that works for us.
But in general, it's really useful to have the customer on the line.
So any gaps that, you know, I might misinterpret anything that we're not like, we're not playing telephone in that we won't kind of transform that information to something else.
We want to make sure everything's clear and everyone's in sync and what's going on.
Awesome, awesome. And something you had kind of mentioned, I kind of want to put another plugin for someone else for the Cloudflare TV, is we kept talking about bot management.
And, you know, I'm sure that kind of piqued a few people's interest.
If you guys want to learn anything more about that, I believe there's gonna be three sessions.
I only have one at the top of my head for the time is Calvin Shirley, who's the bot management SME, the subject matter expert.
He's going to be giving a session at 6pm on Wednesday, trying to understand and you can learn more about what that product looks like with us.
I know it's been mentioned a lot.
It's great able to, you know, help us identify and mitigate these intelligent attacks that come in.
Which definitely came handy when Kabir was helping out with this specific instance.
Well, awesome. Great. And so I think, I'm not sure if you had answered this question.
I was a bit curious. I apologize if you had said this before, but in terms of kind of the team members on the customer side, was it just kind of the security team that was involved?
Who was there?
Yeah. So, you know, we have, when we have under attacks come in, the companies that come in needing help are of all shapes and sizes.
And so as a result, you might get someone who is a consultant for a small organization that they've worked with before that's helping out with this.
On the flip side, you might get a whole security team that has, you know, built out solutions and they have, you know, they're some of the brightest minds in security on the phone with you trying to figure out how to mitigate this attack and what solutions are in place.
And so the conversations are very different between each of those.
In this case, it was just a single person, one of the only IT folks on that team.
And, you know, as a result was definitely a little bit tired.
But we were able to solve that. So in any case, I think we're coming towards the top of the hour here.
Yes, yes, yes. Oh, yeah. Yeah.
So thank you. I'm gonna go ahead and kind of announce the next one. Kabir, thank you so much for, you know, telling your story, kind of describing what the process was, what kind of the nitty gritty look like kind of passing around the under attack that happened around the world, right?
Appreciate everybody for tuning in. Your co host here, your hosts here, myself, Weston Ekman and Ankur.
Next, please stay and listen to Candice as she speaks about on the spotlight on Latino excellence.
She's great. I always love listening to her speak. And she's very insightful and very intelligent.
All right, everybody, you guys have a great day and awesome.
Transcribed by https://otter.ai