ย Cloudflare TV

๐Ÿ”’ Security Week Product Discussion: SSL/TLS Product Announcements & Private Key Handling

Presented by Patrick Donahue, Dina Kozlov, Nick Sullivan
Originally aired onย 

Join Cloudflare's Product Management team to learn more about the products announced today during Security Week.

Read the blog posts:

Tune in daily for more Security Week at Cloudflare!

#SecurityWeek

English
Security Week
Product

Transcript (Beta)

Hello and welcome back to Cloudflare TV. We're here for security week. Hopefully you checked the blog over the weekend.

On Saturday had some great posts and I'm joined here by those blog post authors.

I'm really excited to talk to them and have them take you through what they announced and what they looked back on from a history of sort of Cloudflare's evolution on some of the technologies that we're discussing today.

So I'm joined here by Nick Sullivan. Nick, I'll let you introduce yourself and Dina, I'll let you introduce yourself after and then we'll get started.

Okay. Yeah. Thanks, Patrick. I'm Nick Sullivan. I lead Cloudflare Research, which is a group within Cloudflare that looks at new technologies and how they can be adapted to Cloudflare's customers and sort of the future of the Internet.

I've been at Cloudflare a while. I've worked on all sorts of different technologies across the stack.

Previous to the research team, I led the cryptography team, which was involved in pushing forward standards for cryptography and helping develop some of the products that we're going to be talking about today.

Great. And little known fact before we get to Dina, Nick was the star of the Cloudflare basketball team, which we used to have.

And so maybe one day we'll get back to that, but it was fun playing with him as well.

Go ahead, Dina.

Hi, everyone. I'm Dina. I'm super excited to be here. I have been on a few teams at Cloudflare.

I started out on the crypto team with Nick and then I moved on to the DNS team and I was there for a while and the addressing team.

And now I'm the product manager for the SSL TLS team.

It's funny, Dina, because you've bounced around, you've learned a lot of different products and technologies.

It kind of reminds me of when I I was doing partnerships and then started asking questions about SSL and was doing the role that you're doing today.

So fun time learning a lot of different things here.

So I want to talk to you about the blog post that you did on Saturday.

Really interesting post. I've got some history here.

I want to get into some of the specifics, but can you just tell everyone that's listening at home, what do we actually launch?

What do we announce? So we announced Advanced Certificate Manager.

It is a new, flexible, customizable way to manage your certificates on Cloudflare.

And so a little bit of history there for certificate issuance.

In 2014, we launched Universal SSL. This was huge. It gave every single one of our customers their own certificate, which allowed them to keep their traffic protected.

And so this covered the root and first-level subdomain.

And this was sufficient for most of our customers. But some of our customers came back and they wanted something a bit more fine-tuned.

Back then, we were sharing certificates amongst customers.

And so they wanted their own dedicated cert.

They wanted their own branding because the Universal cert only was provided with Cloudflare branding.

They wanted to be able to set their validity periods.

They wanted to cover more subdomains. And so that was what dedicated certs allowed you to do.

And so while that lasted us a really long time, customers started coming back to us and they wanted stricter security rules.

They wanted to be able to set different validity periods.

They wanted to choose their CA.

They had special requests. And more importantly, they were asking, why am I spending $5 to $10 per certificate?

And so we decided to move away from a model where you're purchasing each certificate to a model where you're purchasing an advanced certificate manager, which has these advanced features.

And instead, it gives you up to 100 certificates that you can then go and configure how you'd like.

That sounds great.

And so let's get into a little of the history there. Nick was leading a lot of those efforts back when we launched Universal SSL.

And so, Nick, at the time, we were working on, I think, one specific certificate authority.

And so when Dina said they were shared, what did she mean by that?

What did you actually do back then?

Yeah. So back in the day, because we had millions of different customers and our databases were optimized to scale to the tens of thousands range, that's sort of where we were at technologically at the time.

What we did was we took around 20 customers, their root APEX hostname, as well as a wildcard underneath it, and added them all to the same certificate as subject alternative names.

And so when a request would come in, the handshake has the hostname for what the client is trying to go to in a field called SNI.

So if you're trying to go to example.com, it would say example .com, and we would pick the appropriate certificate that had example.com in the list of 20 to 40 customers and present that certificate.

And as Dina mentioned, it also had Cloudflare branding. So it said this is a Cloudflare certificate.

This was a very efficient way for us to provide services for millions of customers without necessarily issuing millions of certificates.

And I think the cool thing at the time, and I remember reading about it before as an employee, I think I joined the year after and had some fun with renewing those.

But the thing that was fascinating to me is just giving that away for free.

And that was pretty cool to see. And so in order to do that, and we talk a lot with customers about what is the cost for us on an issuance perspective, I think the thing that people might not be aware of is that all of those certificates are present on every machine and every data center around the world to sort of improve the handshake performance and not have to pull data from different sources.

And so that storage is expensive.

So if you're storing a lot of certificates, that means that you can't store something else.

And so the combining of the subject alternative names was a way at the time to allow for that.

And Dina, I think I read in your post, we're doing upwards of like 5 million certificate issuances per day.

And so clearly we figured out how to scale that and move to giving people the ability to customize the subject field, as we talked about, removing some of that cloud for branding.

So it's come a long way. And so it's pretty interesting to see. One of the things you mentioned was certificate lifetime.

What do you mean by that? What does that actually mean?

Why does a certificate expire? So that is how long a certificate is valid for.

And in general, the whole industry is moving towards shorter certificate validity periods or how long they're active for.

This is for a few reasons.

One is it helps phase out kind of old protocols. So for example, SHA -1 deprecation, if the validity periods of certificates were shorter, then it'd be much easier to phase out and improve the ecosystem overall.

Another reason is it encourages automation.

Essentially, the more times you do something, the more it encourages you to automate that.

So for example, when we started offering shorter validity periods, it gave us a reason to really put in work to help improve our certificate issuance and renewal pipelines.

And so right now the Certificate Authority Browser Forum, they set the maximum lifetime for certificates to be 27 months.

But they're also looking to drop it to be 13 months to kind of encourage everyone to move in that direction.

Yeah. And I think we'll get to this when we talk to Nick later about his blog post with talking about Heartbleed.

But I think back then, was it five years, Nick, from a validity perspective?

How long were certificates when you started working at Cloudflare?

Could you get them for?

Yeah. The certificates that we issued by default, Heartbleed was actually before Universal SSL.

So these were five-year certificates. So the validity time was five years, and they were signed with an algorithm called SHA-1, which Dina mentioned, which was getting long in the tooth.

And so if those certificates were less than five years, then it could have been an easier process for the entire industry to move on to more modern algorithms.

Yeah, that's a good point. A lot can change in five years, as we've seen at Cloudflare, and just within both of your worlds, from a cryptography and an SSL, TLS perspective.

And so I think it's great that the drive to shorter and shorter, and some of the CAB forum browser group, they've been really pushing this.

And so I think Let's Encrypt has shown that you're able to automate these with protocols like Acme, and it's great to see the industry getting more nimble, essentially, to make those changes.

So Dina, what certificate lifetimes can people now set with ACM?

They can now set 14, 30, 90, and 365 days.

Got it. And what would be a use case for a very short-lived start?

What have you seen in the industry where people would want to do something that short?

Yeah, typically, the shorter certificates we've seen are up to 30 days. Going down to 14 days is an innovation.

It is going to test out some of the clients out there, because there may be some clock skew.

So we'll be able to tell if certain clients don't have the correct time by doing this rollout.

But 14-day certificates are useful for everything, useful for websites, for IoT systems, anything in which the server is able to automate the renewal.

Having a shorter lifetime is better, because if there's a compromise event or if there's any other reason that that certificate would have some issues, being able to renew it for a 14-day window continuously gives you a better chance.

It makes a lot of sense. And I think one of the other things that you and I have worked on in the past is certificate transparency logs.

And so it will be interesting to see as more and more of these devices are getting short -lived certificates, if those can keep up.

Are there initiatives going on there to think about it?

I know maybe sharding has been one, but can you explain what happens with these certificates when they get issued?

So if Dina is working with a customer and they start clicking around on the dashboard and issue assert, what happens behind the scenes with respect to CT?

Yeah. So if folks aren't aware, certificate transparency is an effort to provide a publicly auditable database of all certificates that are trusted on the web.

And different organizations run these certificate transparency logs. And in order for your certificate to be trusted by the Chrome web browser, it has to already be included in at least two of these trusted logs.

So certificate transparency is a great system for detecting if say a certificate authority issues a certificate that doesn't follow the CA browser forum rules or has been compromised in one way or another, or is being used to issue certificates that are not authorized by the domains that they're valid for.

And so certificate transparency is sort of a volunteer run process.

There's a lot of organizations that are large that run logs, including Google, ourselves, Let's Encrypt is a large CA that runs them, Sectigo had a log as well as DigiCert.

So there's certificate authorities, there's browsers, there's related parties like us who help run these logs.

And because this is an append only log, it means it can only grow. So the more certificates they have, the bigger these logs can be.

And so the way the process works is when you issue a certificate, you take that certificate, you send it to a couple logs, and they give you a timestamp that says, we're going to include it in our logs.

And then that timestamp goes into the certificate, and then Chrome will trust it.

And so we'll see over the next few years or so, how these logs are able to scale to handle more certificates.

But there are some strategies that you mentioned that are in play, including sharding, which is what we do with our Nimbus certificate transparency log.

Effectively, we have a one log for every year.

So for all the certificates that expire in 2021, they all go into one log.

And then after the year rolls over, we can kind of retire and freeze that log.

It's not necessary anymore, because all of those certificates have expired.

And so this is the strategy that different logs have taken, is there's a certain maximum size that you can expect for the number of certificates issued in a year.

Now, that's a little bit deep and in the weeds, but- No, that was great. As you taught me most of the stuff I know about this world, and so I appreciate the continued education.

Dina, I think one of the things that was cool that your team was able to do with building on that CT log was the monitoring of those.

So talk to me about what can they get from an email perspective when certs are issued and sort of what is that useful for?

So they can get an email notification for when someone issues a certificate for a similar kind of domain.

And so that way, if someone is trying to kind of trick your customers into going to their website, and like, for example, like bankamerica.com, instead of bankofamerica.com, you can be notified.

And that way, you can go and contact that CA or you can kind of figure out if someone is trying to spoof your domain.

Yeah, and I think, you know, Nick, you've probably seen this over the years as well, but with, you know, anybody can generate a private key in their certificate signing request, which we'll get to in a second.

But if you're able to convince a certificate authority to sign that, you know, put their stamp of approval on it, saying, hey, we trust this, and the browsers trust us.

And therefore, you know, if you're using Chrome, for example, or Firefox, or Brave, which is what I like to use as a browser, and you're going to one of these sites, and it changed back to that root of trust, then you can essentially impersonate, you know, obviously, you might need to couple it with some DNS, you know, poisoning, but essentially impersonate that and try to steal that data coming in.

And so it was cool to see the team ship the ability, as you mentioned, Dina, to get notified of that and then take action, right.

So there's, there's a whole revocation process, we'll get to a bit later and how that works.

And there's different, you know, mitigation strategies, depending on the how high value the domain is, right, there's different techniques.

And so we'll get to that.

So cool, I want to jump back to the to the ACM post. So certificate signing requests, CSRs, you know, what are those?

And why do customers care to, you know, take control of that process versus us doing it for them?

Yeah, so today with ACM, we offer either DigiShare or Let's Encrypt as our two certificate authorities.

Some customers want Cloudflare to generate the private key and keep control of it.

But they have their own certificate authority that they would like to use.

And so with CSRs, what they can do is they generate the certificate signing request on Cloudflare, and then they take that to their certificate authority, the certificate authority issues a certificate for them, and then they come back to Cloudflare and they upload it.

Yeah, and I think there's, you know, there's different, as we mentioned that there's controls over the private keys, first and foremost, right?

And we spent a lot of time thinking about how do we protect those private keys.

And, you know, Nick has written some very detailed documentation that we share with customers that are that are interested in this.

And I think most, you know, 99.9%, and don't quote me on this number, but I think it's probably pretty close of customers just want us to take that process on for them and, you know, go through a lot of the stuff that we've done to separate, you know, and we're going to get into this in the Heartbleed post in a second to manage that for them.

But in other cases, you know, they want to take a bit more control over that.

And so we'll talk about some of the things that we shipped to make that happen.

So CSRs, there was one other thing I think was Cypher Suite customization, right?

What are Cypher Suites and why might customers want to, you know, restrict the Cypher Suites that are configured or usable for their properties?

So a Cypher Suite, it's a set of algorithms that help secure the network connection, the TLS, that uses TLS.

And so during the client hello and server hello, they negotiate which Cypher to use.

And so with Cloudflare, what we traditionally do is we prioritize higher security Cyphers.

But now we're going a step further and we're giving customers the ability to say only allow the strongest Cyphers and do not allow the weakest ones.

And so this is very common for banks or very high security customers that cannot take any risks and can only allow through the strongest Cyphers.

Yeah. And I think we work very diligently to try to keep a list that balances the security and performance and ability to serve older browsers, right?

I think that's one of the things we talked about SHA-1 before. We didn't get into this, but we built some neat technology at the time when a client hello is coming in to figure out, is this a modern browser that can use the optimal Cypher Suites and certificates and signing algorithms?

Or is this an older browser that would be better off served with some sort of different certificate or different technology to still provide HTTPS connectivity rather than just throwing up an error?

Because we're probably all dialed in from the latest and greatest MacBooks and really fast computers and Internet connections, but that's not how all of the world operates, right?

There's places in the world where the technology is lagging behind and the refresh cycles are longer.

And as a mission, I think it's been really cool to see us provide that HTTPS capabilities for those sites and for those people wherever they are in the world.

Great. So I think that was it from ACM.

I want to get into, we've talked about certificate signing requests and private keys.

I want to hand the moderator over to Nick. He's got a lot of background here, but you and I worked on this other blog post together.

So I'm going to let Nick ask the questions and I'll probably throw a few back to him to get some help given his knowledge here and his deep understanding of this, but Nick, take it away.

Right. So Patrick and Dina, you published this really great blog post over the weekend called Keyless SSL with HSM, or it was about this whole idea of Keyless SSL with HSM and those sound like jargon terms.

So let's deconstruct that a bit and help explain this to a more broad audience.

So what is Keyless SSL? Yeah, I'll take the first one and then I think we'll hand some of these off between each other, but Keyless SSL, so we just talked about, Keyless is kind of, I joke with customers, kind of a misnomer, right?

There's a key that's used somewhere, right?

But I like the name of it because it does describe what is actually happening at the cloud for edge.

And so Keyless is a way where you may have, and this goes back to the cybersuite conversation, you may have very specific policies that your organization has developed that says, here's exactly how we're going to treat this key material and the provenance of it, of where it actually gets created and the security around that key creation.

And you may, most customers that I mentioned want us to manage that process and hold onto it for them and rotate these as certificates expire.

But if you're an organization that has these very strict key management policies, historically what you've done is you've generated those and kept them on your own infrastructure, right?

And as the world moved to cloud-based services or edge -based services like Cloudflare, there's a whole bunch of security services, like things like DDoS and web application firewall and all these other things that customers want to take advantage of, but they want to keep the key behind their own firewall, right?

With their own infrastructure, according to whatever policies they deem.

And so they can do that.

And I think, Nick, you were obviously involved very early on in this, but we built a way to separate that handshake at the edge.

And so if we have the certificate loaded there, as I mentioned, and available in all those data centers, but we don't actually have the key, right?

The customer has the key on their infrastructure.

And so during that TLS handshake, that key is needed for some parts of it.

And actually, I'm going to throw it to you to explain to people how that key is used, because you have a really great way to describe this.

Sure. Well, so when you're connecting to a website or web service using a secure protocol like TLS, you want to both encrypt the traffic and learn the identity of the server.

And so there's this process that we've mentioned several times here, but I can go more specifically into it called a handshake, which is some messages sent forth from the client to the server and back.

And as Dina mentioned, there's one aspect of it, which is cipher negotiation.

This is where the client and the server support different ciphers, and they negotiate one to use for the connection going forward.

And that helps choose how the data is encrypted as it's sent in transit.

There's another aspect, which is key agreement, which is the cryptographic, let's negotiate what key we'd use to actually encrypt these things.

This uses public key cryptography.

And then the third part is authentication. And that's where the certificate comes into play.

And so for authentication, every website has a certificate.

The certificate has a name of the site on it and the subject alternative name, and then it has a private key.

And so the way that the server proves that that certificate belongs to it and that it's fully controlled and it is who it says it is, is because there's a private key and that private key does a digital signature operation on the handshake once it's done.

So these first two aspects actually don't need the certificate at all.

So you can negotiate ciphers, you can negotiate a shared key without necessarily using the private key or the certificate.

And so what Keyless SSL does is it takes this third action, which is the digital signature with the key from the certificate and takes it out of the process that's actually establishing the connection and puts it somewhere else.

And that somewhere else could be inside of a specialized piece of hardware, which we'll go into in this case, or it could be running on a server that's controlled by the customer itself rather than Cloudflare, or it could even be a different server controlled by Cloudflare at a different location in the world.

That was really helpful. Thanks. Let's talk for a second about that piece of software that you could run on your own server, right?

Or you need to run on your own server that our edge will talk to.

We wrote this in Go, right?

Golang. And we open sourced it. Can you talk a little bit about why was it important for that to be open source for our customers?

Yeah, there's actually multiple versions of this protocol.

We started out with a version in C that was written by our CTO, John Graham-Cumming and myself back in the early days of Cloudflare.

But it was important for us to have the server that does the private key operation open source so that our customers can fully vet what it is.

And if they have any compliance or other specific restrictions on what sort of software they can run, they can take the software, they can compile it with their own tool chain, and they can run it in their own data center with full visibility as to what the code is.

It also makes it easier for deploying in different environments, whether that's Windows servers or Linux servers or cloud environments.

So this piece of software, Gokeyless, is something that's on Cloudflare's GitHub page.

So any one of our customers can take it and run it.

Yeah, and that's been really important as I've worked with customers over the years to tell them about this technology and how they use it.

I think it's important to see if something has access to your private key, you've already decided you want to keep it in your own infrastructure, right?

And so we had earlier in the week or last week, actually, now we're into our second week, we talked about supply chain risk, right?

And so if you're running some piece of software that another company has given to you, you want to know what's going on there, right?

You want to know that this is not sending your private key up to Pastebin or something, right?

And so we provide the ability, if you so choose, to build from source and add whatever logging you want, and you can see actually what's going on there.

And so I actually didn't know that you and John Graham-Cumming, our CTO, wrote the first version.

I knew it was in C, and I've hacked around with it a bit as I've run it, but did not know that.

I think that that's a theme of those security week is taking code that John wrote and putting it out to pasture for some newer technology.

So he's always excited to hear that, but that's pretty cool.

History note, early days of a startup when you've got your CTO that's writing a lot of code and customers are able to use it successfully to solve their problems.

So cool. So let's talk a little bit about the post itself, right?

So we announced the ability to now take that key, which previously would typically sit in a directory on a Linux server, which is what most of our customers are using, right?

And you could do your file system controls or your server controls, or you could put it on some encrypted file share, but we wanted to go further, right?

So we heard from a lot of banks in particular, and credit card companies, and people that are dealing with the financial services transaction that, cool, glad you have this ability to separate it.

We've paid $100 ,000 for these beefy things called HSMs, right?

These big boxes that have special controls where they're either tamper evident or tamper resistant.

I don't think you can describe something as tamper proof, where you can generate your key on there and so they're resistant to different types of attacks that a regular server may not be, right?

And so one of the things is known as a side channel attack and some of them are related to timing and things like that.

Can you speak a little bit, Nick, to what are those attacks that someone may do and how can that tell you information if you're generating your private key on a server versus a hardware security module?

Yeah, so hardware security modules, HSMs, are dedicated pieces of hardware that are meant to protect keys.

That's the main thing that they do and they protect keys from a number of different scenarios.

You mentioned timing attacks and side channel attacks.

An HSM is designed to withstand physical attack, which is something that traditional servers and softwares aren't, software systems are not.

If you go into a data center and you have access to the power reading and the power usage of a server, you can actually extrapolate different things about what's going on in that computer.

If you have access to how fast it takes for certain requests to be answered and certain operations to happen, you can extrapolate information about the data inside.

And in some cases, with physical access, you're able to extract keys, which is something that people definitely do not want.

And hardware security module, HSM, is designed not only to do the same things that a server does, so do private key operations, do them in a performant manner, but it also is designed to be resilient against power analysis attacks.

So if you look at how much power is used by the HSM, it should be relatively constant.

You shouldn't be able to extract things. Same with timing. It shouldn't take a different amount of time to do things based on what the key is.

And it also has tamper-evident and tamper -resistant aspects to it, which is part of the real cost of what an HSM is.

There's epoxy, you build it once and you put a bunch of glue on it.

So that if someone tries to actually plug into the CPU and really try to get more direct side channel attacks, they have to dig through a lot of physical barriers.

And those physical barriers are also, they have tripwires in them.

So the whole machine will break if you try to hack them. So HSMs are often required by, as you mentioned, financial services because they have the highest level of both physical and logical security for keys.

Thanks. That was a really helpful explanation.

I think the thing that was fascinating to me was learning, they even detect things like if you're trying to X-ray them, right?

So if you've got trying to figure out what the insides of the HSM are and what they're doing and where you might want to avoid when you're trying to tamper with it, the machine can actually detect and in some cases, I think overwrite keys or, you know, zeroize it out so that they would not be, you know, if they're detecting you're attacking, it's better to destroy the key material in some cases than to let an attacker get at it.

So fascinating. So what we did here and what Dina and I worked on, so originally we took the keys and allow you to put them in a physical device that was, you know, sitting in your data center.

And there's a joke that there was an engineer on the team at the time where we gave him a space heater.

And so I said, Chris, you know, we need to find a place for this demo machine so we can test our integration with it.

And it was, you know, in his office in San Francisco and it kept it a little warmer than the rest of the office.

But that machine was used and there's a standard interface known as a standard API known as PKCS11.

If you've worked in the past, I'm very sorry.

It's not the most fun thing to work with, but it is a standard, right?

And as far as the standard goes, certain companies will implement, you know, portions of it and hopefully it'll interoperate.

And so we announced at the time there's, you know, I think it was, there's been a lot of consolidation in the space, right?

There was like Gemalto and Talus and the Luna series and several others that if you go to developers.Cloudflare.com.ssl, you can see the specific hardware modules that we supported at the time.

But we did this integration and we if you use PKCS11, there's a URI format that you can give us in a shared object, which is just a way for the piece of code that go keyless binary to figure out how to talk to these HSMs, right?

And so they needed to do things like certain types of authentication, right?

So there's, HSMs have these slots and then in the slots, there's tokens.

And so a lot of this, like these terms and have continued forward, even though they're virtualized in some cases, as we'll get to in a second to talk to Dina about it.

But there's a need to interoperate with them at a physical level, right?

And so typically this can be done over a network connection, so over IP.

And so we released this and customers started using this, but we started getting into a conversation where they said, you know, this is great for my on-premise data centers, but hey, I'm moving to the cloud, right?

I'm maybe going a little bit slower than if you were to start a company today, you'd be sort of cloud from the outset.

But if you've got all this infrastructure you've built up over many years, it's going to take some time to get it there.

And so you can't really ship a HSM to a cloud provider and say, hey, plug this in for me, right?

And run this in your data center. Or if you're able to, you could probably, I think Google cloud actually does let you do it.

You're only able to send it to maybe one.

These are very expensive machines. And so what they've done is they've virtualized these and given them in a way that you can integrate with them.

And so Dina, tell me what we announced today and how does that relate to these cloud services that our customers are starting to move towards from their own on-premise systems?

Yeah, so we announced support for common cloud HSMs. So the main two are Azure's managed HSM that they launched in last September.

And then the other one is Google cloud HSM.

And so we configure these, we tested them out, we added support for they have their own native APIs and not PKCS11.

And so now customers can continue to use the Azure HSM or Google cloud HSM, which they've probably already been using and integrated with our keyless product.

Yeah. And that, when you said native APIs, that reminded me.

So the PKCS11 interface actually worked initially with AWS's cloud HSM or sometimes on cloud HSM v2 and also worked with IBM cloud HSM.

And so we had some support for cloud HSMs and actually Azure had a dedicated HSM.

Although I think they're kind of nudging clients towards the managed HSM.

And so it worked with some of them, but some of them like the Google cloud, for example, and the one you mentioned Azure managed, those had their own native APIs where you can provide a much richer experience and more capabilities than implementing to the standard.

And I think it's neat to see how these different cloud providers are innovating and moving in a direction of providing additional capabilities.

So great to see us add those. And presumably if more spring up, we'll add it, but I think for those at home listening, if you are using keyless and you have an HSM, talk to your account team, they'll get in touch with Dina and we'll get it into the roadmap from a prioritization perspective.

So the new ones, how do we actually go about testing this?

What do we actually do for Azure, for example?

So for Azure, we set up a VM virtual machine on Azure. And then we set up provisioned, activated the HSM.

And then we set up the gokeyless, the key server.

And then we, once you get the URI of where your key is located, you update your, you have a gokeyless YAML file, which tells us where to go and get the key.

And then we tested it end to end and saw that the TLS handshake was successful.

Well, yeah, it was a fun process to work on. It actually was not that much of a lift, right?

Because we'd already done a lot of work for keyless. It really just integrating the specific providers.

And so great work by you and team to get that out.

And I know we've got some customers that are testing it and anxiously awaiting it.

One thing that we didn't cover, just to go back for a second for the HSM stuff that I wanted to bring up was the concept of FIPS 140, two and levels one through four.

Nick, you've kind of been in this world for a while. What is FIPS and why is this relevant in this case?

Well, FIPS is a, well, it stands for Federal Information Processing Standards.

These are basically standards for processing information in the US government.

FIPS 140-2 relates to cryptography and what types of cryptographic modules can be used when processing data relating to the US federal government.

So this is a standard that's used beyond just government processing, but it's used well into the industry, in the financial industry as well.

But it stipulates certain things about the types of cryptography that's used and the type of cryptographic modules and whether or not they've passed some stringent audits.

And so if you have cryptographic aspects of your system that are certified with FIPS 140 -2, and this is an evolving standard, there's a FIPS 140-3 that just came out, then you pass a certain bar that allows certain compliance regimes to be considered your software acceptable for managing different technologies.

And so there are definitely some HSMs that qualify and some that don't.

There's different hardware tokens that qualify and some that don't, but a lot of customers consider FIPS validation to be an important aspect, in particular because it fits under different compliance regimes, such as FedRAMP, which is important for contracting for the US federal government.

Yeah, that does help. And I know FedRAMP is something we're working on, and so that's very top of mind for our customers in the government.

The thing that was interesting to me as I worked with Dina on the blog post and the capabilities is that the cloud providers are leading with their advertisement of that FIPS 140-2, typically level three.

And so there's one, two, three, four progressively stringent requirements on the tamper evidence and the other hardware defenses of the machine.

So it's interesting to see that they are advertising in that.

And that's typically when we get calls from customers where they've got some internal policy around key handling and key storage, and they're looking for that level that we would typically recommend keyless with HSM support.

So great. That was fun. I want to go, Nick, to your post now. And so the reason that we're doing a lot of this stuff is that private keys are incredibly sensitive.

And as we mentioned before, if you have access to them in a pre sort of perfect forward secrecy world, you could decrypt traffic, but with them, you can kind of now impersonate traffic and just very scary to think about your private key being exposed.

And so this actually happened at scale. Tell us about that. Take us back through that process.

I know this is many years ago, but I think it's a lot of learnings that we've incorporated into how we build cryptographic product today.

Yeah. So around seven years ago, I think it's just coming up on the seven year anniversary of the announcement.

There was a bug in a piece of cryptographic software called OpenSSL, which was open source and frankly ubiquitous.

It was used in almost every web server, including Cloudflare's web servers and those of large companies like Yahoo and et cetera, which were big companies at the time.

As well as, I think that you could say there's e-commerce, there's banking, there's all sorts of different websites used OpenSSL as the core library to do TLS termination on their servers.

And Heartbleed was actually a bug.

It was a bug in one aspect of the OpenSSL library. It was a very rarely used function called Heartbeats.

And what this enabled was if you're running OpenSSL on your server and running TLS, an attacker could send a malformed heartbeat message to the server and the server would respond with data that actually blew beyond the bounds of what was supposed to be returned.

So in particular, you could return up to several kilobytes of just raw memory from the server.

And so this raw memory could contain anything at all on it that has gone through the server.

So it could contain passwords, usernames of people who had logged in.

And the scariest part and the part that wasn't necessarily clear could be done when it was first announced was you could imagine that this memory could actually contain the private key.

In which case, this means that somebody who wanted to exploit this bug, which turned out to be very, very simple to exploit, could potentially steal these private keys from the majority of sites on the Internet.

And then after that, potentially impersonate them for the lifetime of those certificates.

And as we mentioned, those certificates at the time were up to five years in length.

And so this was kind of a pretty big catastrophe for the Internet.

It revealed a number of different flaws because usually people didn't expect that all sorts of private keys, like the majority of them, would be compromised at the same time.

And it became sort of national news.

And I remember even my parents would call me and say, what is this Heartbleed thing?

And they had never shown that level of interest in my work up until that point.

Yeah, it was a scary time. I remember reading quite a bit about it.

And I dealt with some of the follow-up after the fact. I wasn't involved as you were, obviously, early in it.

But I think one thing I thought was pretty cool was the contest we had.

Didn't we sort of throw the gauntlet down on our blog and talk to people about that?

Yeah, yeah. There was some internal debate because we had actually patched Heartbleed before it was made public.

And so there was a risk calculation that we sort of needed to do with respect to revoking and reissuing these certificates.

And the key point, the piece that we didn't know that actually would help determine our path forward was it was said that you could dump memory with Heartbleed, but it wasn't actually confirmed that you could extract a private key from a server.

So what we did is we set up a challenge.

We took a vulnerable version of OpenSSL, ran it with Nginx, which was the web server we were using at the time, and opened it up to the world to say, please have at it.

Try to extract this private key. And some really fun stuff happened.

So because Heartbleed lets you dump any server memory, what people post to the site actually were showing up in memory dumps of other people.

So you would have things like someone would post a fake private key, and then someone would email us and say, hey, look, we got the private key.

And it wasn't the real private key.

And so we weren't sure. Honestly, when this was launched, we weren't sure whether this was possible.

But within less than 24 hours, people started pouring in with responses that proved that they had stolen the private key from these servers.

And so this made it absolutely clear to us that if anyone had knowledge of this and was able to try to build this attack and use it against Cloudflare before we were patched, that all of these keys were at risk.

So we decided to revoke and reissue every single one of the certificates managed by Cloudflare.

Great. I want to get to that process in a second. But I think it's fascinating.

Nothing like putting a challenge out there to have the Internet at large prove that this can be done.

And it's great that we were able to get advance notice of this and patch that.

And we're going to talk in a different session today about the patch gap and what you should be doing and how Cloudflare can help you and buy time for you to patch your own systems for that intermediary.

But really scary that this was possible.

Do we still use? I know we used OpenSSL at the time.

Do we still use OpenSSL? No. In fact, we switched to another project, which was actually a fork of OpenSSL.

At the time that Heartbleed was revealed, OpenSSL was in somewhat of a sorry state.

There weren't too many developers actively working on it, and they weren't very well funded.

This is something that they've gotten funding since then.

But different users of the library decided that that model wasn't something that they could rely on.

So for example, the OpenBSD team launched something called LibreSSL, which was a fork of OpenSSL meant to improve some of the security practices.

And at Google, they launched something called BoringSSL, which was meant to take this complex system OpenSSL that has all these bells and whistles and make it boring and just focus on what's important and fix up the practices.

So we've been using BoringSSL, which has turned out to be a very good choice because over the last several years, there have been quite a few vulnerabilities, new vulnerabilities against OpenSSL.

And almost always, they are in pieces of code that didn't make it into BoringSSL or had been rewritten in BoringSSL.

So we haven't had to patch as frequently. Yeah, I think it's funny to think of boring as a good quality, but in cryptography, there's that surface area for attack.

And so the smaller the piece of code you're running and the easier it is for people to understand.

And it didn't have the cruft. A lot of it, I think, as you mentioned, was rewritten or written from scratch.

And only certain pieces were brought along by, I think it was Adam Langley and the Google team.

It became a good thing for us to use BoringSSL.

And I think there was actually just another OpenSSL security issue last week, if I'm recalling correctly from scrolling through Twitter.

So great to see we did that. Let's go back to the response that we had at Cloudflare at the time.

Like, OK, you get notification of this, you patch our servers to protect against it.

But then what do you do with all of those certificates and private keys that we've previously issued, as we talked about with UniversalSSL?

Yeah, so this was before UniversalSSL by a few months, actually. So it might have been a much more hairy situation had we had all the free customers have certificates at the time.

But still, all Cloudflare Pro customers and above had certificates.

And this was a hefty amount. It was over 100,000. And so what we did was we tried to, we assumed that the certificates and the private keys for these certificates had been compromised.

And so we went through the process of revoking them.

And revocation is effectively a way to tell users, hey, this certificate that used to be valid is no longer valid.

Please don't trust this anymore.

If someone uses this, it's malicious in one way or another. And in doing so, we uncovered, well, in revoking over 100,000 certificates within the span of one day, we discovered a number of very serious problems.

Some were anticipated and some were not.

The first of which had to do with the different mechanisms that were used to indicate certificate revocation.

So if you're ready for a quick deep dive, I can kind of jump into what the different mechanisms are for a browser or a client to know whether a certificate is expired.

Yeah, so the classic technique here is called a certificate revocation list.

It's effectively a file that has a list of serial numbers by a certificate authority that's signed by that certificate authority.

And this is like, these are the list of everything that's revoked.

And they have a certain lifetime and clients will refresh these CRLs occasionally before they get stale.

So that every week or so, they'll get a new set of revocations.

And then from that point onward, they can distrust those serial numbers.

There is a large drawback to this, which is that that file can grow pretty substantially if you revoke a number of certificates at the same time.

And so when we revoked a lot of these certificates, it grew from a couple of kilobytes to in the megabyte range.

And the interesting thing about this is that the CRL, part of our agreement with our certificate provider is that part of what they needed from us was for us to be the CDN for their CRL.

And so actually, this became a very large object file that we would serve to all the browsers in the world that understood CRL.

And so this actually became a very huge traffic spike against Cloudflare that kind of seemed like the equivalent of a distributed denial of service attack.

So a number of our internal systems kicked in. We had to refactor the way that we were doing caching.

It was really quite a surprise. It was kind of like hosting the Super Bowl without knowing that you're going to be hosting it.

So that was one big event. And so we discovered that CRLs are not something that really scales in the case of mass revocation.

But the good part is that browsers and other clients had anticipated this.

And the IETF, which is the standards body that helps organize these things, had thought ahead and built another protocol that's more lightweight, that doesn't require downloading of a massive multi-megabyte file every several days.

And that's called OCSP, so Online Certificate Status Protocol, not OSCP, which is the offensive security practitioner certification.

But so OCSP is a very simple protocol where you ask the certificate authority, is this serial number expired?

And they say yes or no. And if they say no, they sign it, and they give like an expiration.

So they say, OK, it's not expired.

You don't have to ask me again for seven days. And so OCSP is this really nice lightweight thing.

So browsers that had implemented it would, if they see a certificate, then they query the OCSP server and say, hey, is this revoked or not?

And then if it's not revoked, then great, we'll trust the website. Well, that turned out to have a pretty big issue as well, which is OCSP servers are run by certificate authorities, and they don't have very high availability guarantees.

And so there are times in which these OCSP servers are slow to respond or down altogether.

And so if you're a browser, you're potentially stuck in the situation where you have this connection, you're trying to show a web page to the user, but you're waiting for this OCSP to come back, and it might never come back.

And so you don't want to give the user a bad experience.

And so typically what browsers had done at the time is just put a cap on how long they'd wait for OCSP.

And so if OCSP didn't come back, then by the way, they'll just show the site.

So from a security perspective, that's actually pretty weak because an attacker could just block any requests to the OCSP server, use a revoked certificate, and then the user would be exposed to the attack.

And so OCSP wasn't working, CRLs weren't working. And then there was a third mechanism that was built into Google Chrome called CRL sets, which is effectively, you take a bunch of CRLs and you put them all together and you ship it with the browser.

And with CRL sets, they had the issue of potentially growing too large.

So they made an executive decision at Google to only put in the very high value certificates, so-called extended validation certificates, ones that cost a lot of moolah to get because they do extra validation of the owner of the certificate.

And so when we revoked these 100,000 certificates, they were the DV certificates, which are validated in a much simpler way by validating just the owner of the domain by doing some inline validation.

But in any case, DV certificates are the vast majority of certificates and they weren't revoked or they weren't being checked because they weren't in this CRL set.

So in fact, the interesting and non-scalable way that we solved this was we actually submitted a patch to Chromium to just look for certificates that are managed by Cloudflare that were issued before Heartbleed and just mark them as revoked in the code, which is obviously something you can't do.

You can't add a specific patch for every single time a certificate is revoked.

So we found at this point that revocation is a very difficult problem and frankly, not solved in the web at the time of Heartbleed.

And this set in motion what became a long-term roadmap for us, which is to help make that situation better in the case of another large problem like Heartbleed were to happen again.

That's a great history and a great synopsis of the different options of revocation.

We've got just a couple of minutes left.

I actually got to go to another Security Week. Cloudflare TV, I don't want to be late for that.

So can you just maybe spend two minutes and then we'll wrap or maybe a minute and a half on what are some of the other advancements?

I know we talked about things like Must Staple. Can you cover that briefly?

And GeoKey Manager. What are some other things that we've done to try to get to a world where key protection and revocation is done in a better way than it was done that many years ago?

Yeah. With OCSP, having a separate request for OCSP is not very reliable.

So there's another mechanism called stapling, which provides the OCSP response inside the handshake.

And there's a new feature in TLS certificates that was introduced after Heartbleed that allows certificates to tell the browser, you must have an OCSP.

So you must have an OCSP in the handshake to check it.

And this actually pretty much solves the revocation problem as long as the server is able to get that OCSP response 100% of the time.

And so we now support this custom certificates on Cloudflare.

If you have an OCSP certificate, Must Staple certificate, we will always staple the certificate for that.

Other things that we did, I mentioned how in Heartbleed, the reason that keys were accessible is because an attacker could dump server memory.

So what we did was take the keys out of the memory of Nginx and put it into a separate process.

And we talked about Dokeyless. This is actually something we use internally, universally.

So we run Gokeyless servers in various data centers, and that's the piece that has access to the key, whereas the Internet-facing process does not.

We extended that to something called GeoKey Manager to allow customers to configure where in the world their keys were kept.

So if there are certain data centers that have higher, more stringent security settings or processes, they can choose, OK, just keep my keys in the higher security data centers.

And we'll incur a small latency in order to actually make that request outwards to the secure location.

And we're always perfectionists here. And if there's a problem, we're going to try to solve it.

So that latency issue is something that we've been working with the IETF to solve using a technology called Delegated Credentials, which allows very short-lived keys to live at the edge while your long -lived keys can live in a secure location.

And rather than having to reach out to the key server every single handshake, you can actually just use the local key and get that updated from the key server asynchronously.

And so this is coming out in Firefox Stable in the next version in May, and it's currently on the road to becoming an RFC and a standard.

So those are all the different things. Maybe that's too long.

Oh, no, this is great. I feel like we could do a whole separate session on that stuff, and we probably should.

But I really want to thank you both for your time and enjoyed learning about a lot of the history here at Cloudflare.

Nick and Dina, thank you for sharing your insights on what we're shipping and how we're working to protect customers' keys as we go forward.

So thanks, everyone. Thanks. Bye.