Cloudflare on Cloudflare: Developing a system where it trusts no one
Presented by: Chase Catelli, Derek Pitts
Originally aired on October 2 @ 12:00 PM - 12:30 PM EDT
Join Chase Catelli, Sr. Cybersecurity Strategist, and Derek Pitts, Director of Security, as they detail Cloudflare's journey from traditional VPNs to a full Zero Trust security model.
Hear how they achieved this massive transition by prioritizing user experience and securing vital executive buy-in. They reveal the crucial role of infrastructure as code in accelerating implementation, the importance of security keys for all employees, and the strategies for deploying a robust, layered security suite including email security and secure web gateway. Discover their lessons learned in achieving user adoption and building a system that trusts no one.
English
Cloudflare on Cloudflare
Security
VPN
Zero Trust
Transcript
So what's interesting is some of those products that you mentioned didn't exist in 2020 when we started this.
Hi everybody, I'm Chase Catelli and I'm a senior strategist here on our security team.
And I'm joined by Derek Pitts, who's a director of security engineering here.
And today we're going to be talking a little bit about Zero Trust and specifically our own...
implementation of the Cloudflare Zero Trust products here internally at Cloudflare.
Well, why don't we get started by you giving us a little bit of background about your role here at Cloudflare, and then we can kind of dive into Zero Trust.
So I started at Cloudflare in 2020, in early 2020, and I joined the enterprise security team at the time.
And part of that role was to implement Zero Trust, get us off of...
two -factor soft tokens, get us off of our VPN, and just to kind of scale up our zero -trust strategy and kind of get us to where we are today.
Somewhere along that journey, we've had some reorgs, and I now currently lead our Customer Zero Cloudflare security team.
And what the purpose of the Customer Zero team is, is...
that we implement Cloudflare products to secure Cloudflare, both the corporate and the production-facing stuff.
So that is how we focus on that.
So before we came in and we started implementing all of these Cloudflare products, today we're talking about Zero Trust, but we have a suite of other applications.
What was the experience like for developers?
A lot of times we talk about how...
how access these applications is such an important part of being able to do your job.
Did they have a good experience using our old VPN?
How was that?
That was a big pain point that Cloudflare employees and engineers had.
And it's very similar to many people who are using VPNs.
What the problem is, is you're constantly dialing into the VPN.
Many of the VPNs are not widely distributed.
So you could have in people.
in europe trying to connect to a vpn that's in san francisco which is kind of how the model that cloudflare had that it doesn't really scale well and you don't get a lot of security out of it you're you're in the um what's generally referred to as the castle and moat infrastructures mode with that and so what that does is is keeps you on the outside but once you get on the inside you can do whatever you want and the zero trust model is different than that the zero trust model is essentially your ID or your authorization is checked at every point of access to a specific application.
And so that lets you better control and better know what people are accessing.
And it actually helps improve the use case for the developers.
And it's a lot smoother implementation for them.
And was there like a particular catalyst that caused Cloudflare to want to make this pivot to zero trust away from the VPN?
Or was it kind of just like a general? you know we're experiencing a lot of latency and we want to move to a model that's more efficient i think a big driver for um the reason that the cloudflare access got built in the first place was there was a big vpn outage and there was also an incident during that outage where people who were responsible for fixing the problems that cloudflare was having couldn't access the systems that they were trying to get to efficiently because they were kind of locked out because they were stuck behind, they were stuck on the outside of a castle, essentially.
That was, we were using a third -party technology, and a lot of the things that the Cloudflare does is, can we do it better?
Can we do it more efficiently? Can we leverage our global network? And that is basically what Access does.
So people, instead of connecting to a VPN concentrator in San Francisco, they connect to the Zero Trust service running in their local point of presence.
which runs all over the world that's a great starting point of kind of like why clockwork decided to move to this model of zero trust but now let's get into kind of how that worked so when we talk about zero trust here at clockware we're talking about a suite of different products you know derek you mentioned our access product but we also have things like our secure web gateway our data loss prevention tool and casby so did we do all of that at once or did we kind of phase out the approach um What did our own implementation look like?
So what's interesting is some of those products that you mentioned didn't exist in 2020 when we started this.
So the Casby tool didn't exist.
A couple of the other ones didn't exist until later on in our lifecycle.
So we adopted those in the customer zero model where we were the very first user, the beta users of those tools.
But we started with the zero trust tool or Cloudflare access.
And that was so that we could do.
two things we wanted to replace our vpn which would get us a lot better performance like we just talked about but we also wanted to be able to selectively enforce a specific kind of two factor authentication we wanted to move from totp or you know the otp codes that people were using at the time so we call those generally soft tokens from a security point of view those soft tokens are fishable so you can get a phishing message to redirects you to a fake website then asks you to put in your token we wanted to move to a an unfishable version of that or using web often and fido2 and so what part of our journey was twofold is to replace the vpn with something more efficient to make our engineers happy and as a security improvement we wanted to get that auditability and authorization on each application so that we and also implement web often as a requirement for those and we chose YubiKeys as our enforcement mechanism for MFA for that and that allowed us to to begin that journey.
Perfect so really like laying our foundation up front with defining our requirements around we want to use 502 compliant hard keys we want to make sure that we're using WebAuthn we know the product that we're using already because we're using Cloudflare's own product so starting out with those really strong I didn't need access management requirement prior to moving into the actual migration of these tools over to Cloudflare Zero Trust so that we have like a really strong foundation to start on.
So let's move into like when we actually started migrating these applications off the VPN and over to Zero Trust.
Did we do it, all the applications at once?
Did we chunk it up?
Like, was there like a beta group of applications that we went with at first before we started like moving forward with everything?
The strategy that we used was...
Yeah, I think it was great.
enforcement, and we kind of did two things at once.
We wanted to get buy -in from our user base for this change in MFA because there's a lot of resistance to change, humans being humans.
Change equals bad, no matter what, pretty much, for when you're trying to get humans to change their behavior.
So what we did was we took a lot of the people we knew are vocal or leaders and the in in those groups across several different engineering teams a couple of different sales teams and we created a beta testing group and we gave them hard keys we gave them the yubi keys and we had them practice we kind of had one -on -one sessions with them and walk through how it works how it's better than soft tokens how it's easier than typing a code that kind of stuff and then we also basically made a hierarchical list of With Crown Jewels, these are the applications with our customer data.
These are the applications that, if compromised, can cause the most damage at Cloudflare.
So we wanted to make sure that while we were migrating towards a more secure environment, we were doing it with the most important things first.
We also took from that list the least important things, like our web company directory.
And we said, this is the first thing that we're going to do.
try because if it's broken it's not super impactful least amount of impact and it does actually get a lot of usage so we could selectively enforce security keys and move it over to zero trust at the same time so we did that and we started having um pages that instead of just giving people an error basically told people you sign in with a totp This app actually requires a security key.
You need to switch, sign out, sign in again to switch to your security key.
And that self-help journey really, really helped adoption.
And we spent, our waves were about one month per set of apps.
So we did that very first alpha set of apps.
And then we picked our most secure apps and we did about, I think we did two.
And then we...
five and then we did 10 and then we did 100 and then we had a big long tail of apps which at the time didn't necessarily support web often which is interesting to say looking back you know about five years later even ios didn't support web often at the time i think it was ios 13 .2 that added support for web often inside of of your mobile device.
So we were kind of right at the bleeding edge of a lot of support from WebAuthn.
And so we had some trade-offs to make for some apps where we had some slow adoption, where we were hitting some blockers, some things that we hadn't anticipated, some things we hadn't tested, but that did allow us to be on that leading edge and to gain the security that we needed to on the most secure apps that we.
fully controlled in -house.
Perfect. And if I recall correctly, we were going through our access implementation right around the same time that COVID happened, right?
So we have this like paradigm shift where everyone's coming into an office and then overnight, almost everyone's remotely working from all over the world, from their houses, from coffee shops, perhaps.
And now we have all of these employees all over the world and we need to make sure that they have a secure connection back to these really important internal resources did that impact the implementation at all it did actually it was really bad timing for us because it turned into a logistical nightmare we had already ordered physical security keys for every person at the company we have we had events planned for big all hands and all of each of the offices because we were gonna everybody comes to the office every day and so So we had these events, we had like, we were getting cookies made with some fancy slogans to, you know, make people happy about this change that we were providing to them.
But that didn't happen because none of us, we stopped going to the office.
So then we actually did some interesting things where we got one or two people from each office to go in and stuff envelopes.
And we mailed the security keys to each person when it became apparent that we weren't going back to the office really quickly.
That was interesting to go through.
And then we had a lot of, like, our threat model changed a lot, too, from a security point of view, right?
We went from, okay, Cloudflare has this network, and we have all of our network monitoring set up in our offices.
We're relying a little bit still on the VPN for the Castile mode mode.
Not really, but it was one of our mitigating factors from our threat story for people accessing our applications.
Back to, these are...
everyone's modems at home?
Are they patched? Are they secure? Is there a malicious, are they on a malicious network in an apartment building?
What could also be part of that traffic? And so then we started thinking about how do we also secure the endpoint?
Because we're no longer securing the office network.
We're now securing the laptop. And that's when we started looking towards the Cloudflare Warp that was released around the same time.
So we started going through the the journey of securing the laptop as well as the application and having that end-to-end zero-trust infrastructure built out.
We started saying like, okay, now we know that the device is there.
Let's look at the software patch level.
Let's look at the, is our EDR agent running?
Is it, has it been updated?
Has the, start looking at the impossible travel things.
Have, are people suddenly jumping?
from one continent to another these are those those types of things are what you could start doing with the posture checks and posture rules all right so moving on i guess we talked a lot about access so let's talk about some of the other products that are in our zero trust suite so we talked about kind of like phase one of laying that foundation we talked about kind of phase two of doing that zero trust access or zero trust network access migration moving off of the vpn moving on to the zero trust model so the next step um What product do you move to after that?
Are we implementing Secure Web Gateway?
Are we looking at our email security?
What was kind of the next step there?
I think we implemented the email security and Secure Web Gateway warp around the same time.
So it was kind of, those were kind of parallel efforts because there's not a huge overlap, but I mean, they were both security help.
Since we were just talking about warp, we can say the Secure Web Gateway was kind of the next logical step for us.
So we did warp and secure.
web gateway at the same time that gave us secure dns so we could implement with secure web gateway checking for malicious zones malicious urls we could block those from being access from people's laptops so when we do have places that have known known malware on them known problems people just can't get to them zones and those sites.
So that helps add security to the endpoints, as well as the layer of, you know, we are already checking to make sure EDR was running.
And so we didn't necessarily just want to rely on EDR.
We wanted to layer on more and more security so that we were ensuring that we had more protection.
And so we added that. Then we also, we started out in the DOH mode for warp, which is just doing the scare.
DNS, but it's not pushing all the traffic through the warp tunnel.
As we were very early testers, there was a lot of learning curve to that for us.
We were developing, like, how does it work?
How does it interact with video conferencing tools?
How does it interact with sites that have certificate pinning?
Because we started doing TLS inspection on some of the sites.
And so through all that learning, I think...
It helped us improve the products a lot.
We had a big initiative to make sure all of those things worked for us so that even customers that are small or bigger than us didn't have the same problems that we have.
So we use the products every day so that we're finding the bugs, we're reporting them directly to the product teams so that our customers aren't.
And then for the email security, at that time, we started looking for email security so that we could.
add a layer on top of what our Google Workspace was giving us at the time.
We were just using the standard Google spam filters and malicious filters.
We started seeing more targeted campaigns at us, so they weren't necessarily rising to the level where Google was blocking them before we were getting them.
We were re-reporting them and then they were being removed, but kind of after people were getting those in their inbox.
And so what we did was we looked at the Area 1 email security product at that time.
We tested a few other ones.
And that one was kind of the best fit for us.
So we adopted that.
We put them in front of our users so that they were, you know, looking, reviewing links, following links to make sure that there's not malware, doing analysis on any files and attachments that are there.
We have that.
this email security as well which is like faking that you're a person like pretending to be a vendor sending a fake invoice trying to steal money through that kind of fraud so it was a big that was a big help for us and we we see a lot of value out of that for us it helps us prevent a lot of those things getting to us yeah and then like email security paired with secure web gateway paired with the zero trust those three tools work really interestingly together because if your company is experiencing a phishing attack each of those is doing a little bit of a different layer in that layered security model so email security is hopefully able to like block a lot of those coming in and then if one is able to come in and you have a user that enters their credentials at the end of the day if you're enforcing a hard token you'd still require that token that is a physical thing that you have in order to get into the system so And then as a response to that, we're able to use our secure web gateway product to block that malicious domain that might be spoofing our SSO login page or something like that.
So these products all kind of like give us that full layered security approach to preventing like a more sophisticated phishing attack.
There's two examples that we blogged of about specific attacks.
The first one I remember very well because it was the first big Texas ice storm.
And so.
always sitting inside.
We had a big, big incident because it was a big targeted attack. People were getting a lot of phone calls pretending to be Cloudflare IT to try to get people to enter their credentials.
That was where we were dealing with a big, big long tail of, should we move off of TOTP?
Should we move everyone to security keys? And we had a big decision point. And so part of that incident is we made the decision to hard cut everything.
over to security keys and WebAuthn so that we're no longer allowing to OTP.
And so we made that big change for probably 50 % of our applications during that attack so that we didn't have anything to worry about for the phishing attempts.
You know, we were using email security to prevent that in email.
But like I said, people were getting text messages, people were getting phone calls.
So we blocked all of the things that were known, but there was a lot of pivoting around by the attackers at that point.
And then the second time, I think it was about a year, a year and a half later, it was a targeted SMS phishing attack.
They had completely cloned our login page for SSO at the time.
The URL looked very, very similar.
All the graphics were the same.
And what was interesting is...
for some reason like 50 people on the security team got the targeted attack sms and we were actually at a team dinner at the time in austin and all of us got a text message at the same time and we're like wow weird and then we started going to look through this we saw that you know it was targeted with us there was a fake login page we added blocks of gateway for all that stuff you know we started notifying all our employees about this targeted attack so we got a few people did were tricked by that because it's you know it's it's very human to trust things that you get on your phone which seem legitimate you're in your kids are screaming things are happening you're in life is happening and you're just like i need to do this thing but that's why we have layered security so they may have entered their username and password but it was impossible for them to use the otp token because we didn't support them so that was where it stopped some people like i think it was three or five people had to rotate their passwords and and that was it so we did a big forensic analysis of that um our blog talks through how we did a lot of those things one of the interesting kind of philosophy or strategy changes that we made is previously we were on the fence about newly registered domains whether we should outright block them, send them to browser isolation, or just kind of leave them there because our sales team, a lot of the time is dealing with brand new customers, a lot of newly registered domains.
So we were trying not to have a big impact on the way that they do their day -to -day work.
And so we had left it as just normal. We switched that to browser isolate all of the newly registered domains, which prevents any malware from being dropped on people's devices and things like that.
So that if there had been an attack associated with that phishing page where it tries to drop malware on your machine, it wouldn't have been successful.
And so that was a philosophy change that we made during that attack.
That's a great example.
All right.
Well, we've covered a lot of the products on Zero Trust Week, but there are two that we haven't really talked about, and those are really around data security.
So can you tell us a little bit?
about our implementation of our cloud access security broker and then our data loss prevention tools?
So we started with our CASB tool, which is the abbreviation for the cloud access security broker, which I actually never remember the name of that.
So I'm glad you never knew it. We use that for is we have integrated with our Google workspace and Salesforce and the applications like that.
We use it to ensure that we don't have configuration drift. Um, because as things, you know, to go through the general life cycle of tools, admins leave and join the team.
So settings get changed.
We want to make sure that those settings remain in a, in a secure configuration.
And so we have a learning setup to our, our SOC that lets them know when something has, has drifted and is no longer secure.
Um, and then we also use it for Google workspace.
We can see things like.
applications being authorized, different things like that.
We have it with our GitHub.
We can tell you, is your repo set up securely?
Can people force commit to main without any reviews?
Things like that.
You know, basic fundamental security goals, the like standard best practices.
It helps you make those, keep those enforced.
And for our data loss prevention tool, we have that set up through Gateway.
and also with the CASB integration, which allows us to use the predefined data profiles like credentials and secrets.
Do people have any credentials set up? Like if they have been committed to GitLab, are they sitting in Google Docs?
Things like that, where you want to make sure that there's not credentials being linked.
And again, this is another layer of security.
We have all these things set up in our city CICD pipelines, but we just want to make sure that we're doing the belt and suspenders approach.
Like we want to make sure there is really no secrets being leaked in our repos.
We want to make sure that people aren't copying all their secrets and dropping them in Google docs to just sit there.
They mess up a sharing setting and there's this Google cloud or Google workspace token.
sitting in there or a Cloudflare token, something like that, that it doesn't get leaked and used and abused.
Cool. Well, let's move into some of the lessons learned. So if you had to pick like one or two things, like what would be some of the biggest takeaways that we learned from this?
Like if we were going to do it again, like what would we change?
Was there anything that kind of stood out as a lesson learned?
I think that the thing that we will keep the same and that the advice that, that we give to, that I've given to a lot of different companies.
I've talked to over the last five -ish years is having good executive buy -in makes this kind of journey possible.
We had a really good buy -in from our IT leaders, our CEO, and our security leaders at the time, and they were our champions for people.
We talked about this at a company, all hands.
We had these messages coming out from those leaders, kind of co -signed by all three of the big leaders in those.
groups which helped resistance and then the people you know the resistance to change and it helped show that we were serious about that that is the biggest advice that i get to people is build that trust with those leaders and you do that by talking to them about it early on getting buy -in from your specific leader and then having those those beginning journeys for people to test things on including them in the base is having kind of white glove conversation with them.
How do you get their buy-in by showing them that it's actually better?
One of the things that I think is really helpful for this is building in self-support.
So if security is leading this charge, which it was for Cloudflare, having a close partnership with IT so where you're not throwing hundreds of support tickets over the fence to them, you're building trust with them.
you're showing hey we really want users to be able to fix it themselves so we have that error page that shows them hey you signed in with the wrong mfa factor please use the other one instead of them just being broken i think one of the lessons that we learned is we had our initial windows for those test apps were too long for for adoption i think about a month is the right time if you're switching if you're adopting zero trust or you're adopting web often or skis going giving people a month to get used to it is the right time because you catch most people don't go on vacation for longer than two weeks so you're gonna you're not gonna surprise a bunch of people but you're also not getting them so long where they're like well they're never gonna do it and then the cut is a surprise to them so as long as you keep up a good cadence um one of the other things that we focused on was infrastructure as code so that we could use devops practice practices to really accelerate towards the end part.
So a lot of the foundation we built over the first year allowed us to take that last 50 % of the apps during that incident and get them switched over in a matter of two or three days instead of wanting to repeat manually doing things, going through GUIs would have taken much more, much longer and it would have been more error-prone.
This was great.
Thanks so much for sharing all of this. It sounds like Cloudflare, just like any other customer, implementing Zero Trust went through its own journey and had its own set of challenges, so we really appreciate that.
We know we're doing a couple different versions of this episode.
We're talking to a bunch of different perspectives and leaders around our organization.
Talked to Derek today.
We have a couple more on the roadmap, so feel free to tune in to those other ones, and we'll talk to you later.
Thanks so much, Derek.
Have a good one. Always great to chat with you, Chase.
I'm sure he'll see you in. our next meeting.