Scaling SSL Certificate Issuance
Presented by: Tom Lianza, Mihir Jham
Originally aired on June 30, 2020 @ 2:30 PM - 3:00 PM EDT
Mihir and Tom talk about the technical challenges they've overcome in the last several years of scaling Cloudflare's SSL products.
English
Product
Transcript (Beta)
All right, I'm Tom and I'm here with Mihir and we are going to talk about Cloudflare's SSL products and scaling them from an engineering perspective.
I'm an engineering director at Cloudflare.
Mihir is an engineer on the SSL team. Just by manner of introduction, when I started at Cloudflare in 2015, we were just under 200 people.
Before Cloudflare, I had been working at startups and the reason I knew about Cloudflare was we used Cloudflare at my startups.
And the reason I used Cloudflare at my startups was actually because I was tired of dealing with SSL.
That was the number one thing I wanted to just make somebody else's problem and just deal with my application.
And then I came to Cloudflare and it became my problem again.
So these days, I still work with the SSL team, as well as a number of other core control plane teams.
And Mihir is currently on the SSL team. Yeah. Yeah, same.
I work on the SSL team. I started on in 2016, Halloween 2016. So it's fun day to start work on.
I did not come undressed. Since then, I've been on the SSL team and been helping solve SSL problems and then also scale problems which we'll get into in a bit.
Before Cloudflare, I was working at a coding bootcamp.
I was helping teach the data structures and algorithms section of the course.
And I was also building out software for the company, things like internal tools for tracking student application analysis and basically any integrations they had with other third party vendors.
These days, I work on the SSL team and my role is basically to ship products, keep the lights on for all our systems, and also help fellow team members get their stuff through and help them with anything that might come up.
Right on. So it's kind of interesting, I think, the space that we're in with SSL, because when I joined, I think most websites only use SSL on the checkout path or something.
SSL was reserved. Some websites didn't use it at all.
Nothing we transmit is secure. It needs to be encrypted. And there was a perception and for a while a reality that there's a performance penalty to be paid for doing SSL.
So aside from sites that I worked on that didn't use SSL at all, the ones that did only did it in certain sort of subset.
You have a secure portion of your website.
And as a result, this assert and SSL generally wasn't as top of mind. It wasn't like today when you go to a browser and the browser warns you heavily that the site is not secure, maybe if it doesn't have SSL turned on.
And the main product that, you know, drew me to Cloudflare, and that is I think is still our most widely used product in part because it's free, is Universal FSL, where Cloudflare developed the infrastructure to give everyone a cert who wanted one, which was amazing.
And I think that without peer at the time, the, to me, it was magic. And I think it was enabled by the fact that SNI became a sort of increasingly powerful capability that was supported by a subset of clients, which allowed Cloudflare to get a lot of customers to share certs, which allowed us to provide SSL for a lot of people without having to manage one cert per customer at the time.
So Universal SSL was the main SSL product when I joined.
And Cloudflare also had a custom certificate product.
So you could bring your own certificate uploaded to Cloudflare.
So you still have the, you know, DDoS and other protection on Cloudflare.
But using your own certificate that you get at some other CA of your choosing.
The immediate challenges when I walked in were Cloudflare, like most startups, was built as a monolith.
Almost every startup starts that way and probably should start that way.
You got to get a product market fit before you go architecting complicated solutions to scale, you need to have enough customers that have to scale first.
And Universal SSL was certainly part of the monolith. So the code base that just grew over time to the point where lots of different teams were contributing to it.
And there weren't clean sort of divisions and interfaces the way you might hope in a mature, more mature system.
We had multiple CA partners like we like we do still.
Cloudflare, we are a CA, but not a browser trusted CA. We have other products with an issue certs, but they're not browser trusted.
And so we had to integrate with them and we use them in ways that they, their typical customer wouldn't.
And as Universal SSL became more and more popular and successful, we put, you know, we had to work really closely with them to keep up the whole chain of systems working so we could get certs issued and renewed and issued quickly.
And, you know, over time, I think the shared certificate pattern was, it was critical for Cloudflare to be able to offer Universal SSL to so many people.
But the challenges, the downsides of customer sharing certificates started to intensify.
Because, so because when you have multiple disparate customers on one certificate, and then you need to renew it or add, remove, you know, make modifications.
So they're all in the same bucket.
And so the first thing that everyone needs to do before they're issued a browser certificate is called DCV, domain control validation.
You don't get a certificate for a domain if you can't prove that you control it.
And for Cloudflare, because we provide DNS for people, we can we can provide that DCV for them.
But some of our customers are partial zones. I don't know if we have a different, do we call it partial zones?
Is that the word that I'll use?
I think we call it CNAME setup. Or CNAME setup. So, so customers that didn't use us for authoritative DNS, we couldn't prove that they controlled the zone for them.
We don't have that power. So, so they needed to do a step themselves. So then whenever we needed to make it, you know, renew or update a certificate, we can't just help, we can't just do that for them.
Because there's that step where they need to reprove that they control the domain.
Which might be fine if you were dealing with your own cert, but it becomes a lot less fine when you're sharing, which is why you have a lot of people.
CAA became a bigger and bigger issue as it became more widespread.
I don't know that I know exactly what that stands for. 2017 is when it became a standard where if you had a CAA record, then the CAAs would begin or would stop issuing if it was incorrect, like if they were not authoritative to issue that certificate.
So this is where a customer can say, there's only these are the specific CAAs that I allow certificates to be issued for for this domain that I control.
And so if a customer had done that at some point in their existence, and then moves to Cloudflare, and the CAA we partner we use at the time wasn't on their list, then we wouldn't be able to get a cert for them and you get into shared certs.
And that's, that's, that's a pain point as well. And brand checks were really big in the early days.
So so I don't think it's as common now.
Or maybe not. Now, different CAAs have different sort of policies with respect to brand checks.
But the idea there is, if you try and get a certificate that looks to be a violation of a trademark, or phishing or something that the CAA believes doesn't look right to them, they won't issue you a certificate.
And that's very well intentioned.
But some some of the rules from different CAAs mistakenly catch completely legitimate usages.
You know, if there's a country they're they're concerned about, and your domain name has that country, you know, in any part of its word of its name, they wouldn't issue certs.
And again, put one of those domains on a cert with our customers, and we get into get into issues, issues there.
So we had to work through a lot of those, those sorts of problems over the years.
And then then there's some of the infamous, this is around the time we joined here, the deprecation of SHA-1 certs.
So Cloudflare tries to issue, you know, the spirit was we want to make sure everyone's encrypted, and, and everyone's reach able to reach your site, to make the Internet better and more secure.
And SHA-1, you know, it was a weak, the weakest type of encryption that was allowed and eventually was deprecated.
And we wanted to make sure our customers had support for it as long as, you know, the standards body was would afford.
But ultimately, we needed to pull that from everybody.
So that was yet another process in which we needed to change a bunch of certificates.
And those are some, some pain points.
And, and so all of that, the shared certificate model, the changing the rules of SSL, the fact that it was built into a, you know, a monolith, led us in engineering to want to really break this out into a new system, expand the number of SA partners we work with, and product wants us to build a whole bunch of other stuff.
And I think that's right around when you joined me here, we were, we were both trying to rewrite this system and launch the notion of dedicated certificates.
When, when you join, right? Yeah, as we launched dedicated certs in September of 2016.
And we basically, with all the operational challenges that you just mentioned regarding the shared cert model, we decided that dedicated certs would be first would be essentially make it a paid paid offering where you can for $5, you can buy dedicated certs just for your website, and we'll give you a wildcard and a, and the apex.
And then we had a $10 offering, which is basically the, you can add up to 50 custom hostings.
So things like second level subdomains, and levels of domains, etc.
Right. So via example.com, universal cert would cover the apex example.com and star.example.com.
So you get any one subdomain, but you couldn't do a.b.example.com with universal, that would be, that would be where you could opt in to do that.
But now we finally had a way to do that. Right. Besides bring your own.
We were moving into this world of, of having, you know, dedicated certs, but also more added, like configurable, more configurability to your certs, like add more sands, for example.
And the other thing I think product was asking us for...
So subject alternative name. Yes. Was an acronym that, that I struggled with when I first joined.
Everything I knew about sands in 2015 was storage area networks.
And everything we spoke about, yes, was the, was the zones, the domain names that are appear on a certificate.
Right. And also at that time, product was asking us to build a SSL for SAS, which is our offering where SAS providers can offer our SSL services to their customers.
So think of like an example, like a website builder and website builder has example.com and you can be like tom.example.com is your, is your, is your business and you're running it on that.
But then you, you, you want to basically have www.tom.com pointed that you don't want to rebuild all your, all your, all the website that you build your business on.
And so we wanted to build an offering that can support issuance at scale for those, for certificates that were essentially, that potentially were not our customers, but our customers' customers.
Right. That was the real challenge of it is, you know, whoever owned tom.com wasn't a Cloudflare direct customer, they were indirectly leveraging a SAS platform who was our customer.
So coming up with a solution for that was huge.
Right. So that was back in 2016, we decided to not leverage the pipeline that we had at that time, which was our universal and custom certificate pipeline.
And we decided to move into this model of dedicated certs, one website or one hostname per cert, or up to, or whatever custom hostname, how many custom hostnames you would like.
And we decided to move away and build this away from the monolith as well.
And we decided to build this on a bunch of Go microservices that will allow us to horizontally scale, as well as just basically break down the logic into its most simplest form.
That was the biggest challenge, basically moving away from the shared certificate model.
The other thing that we did leverage at that time when we decided to build this was, we leveraged, we had the choice of should we basically build all this parsing of cert metadata and inserting into the database so that we can, you know, search our certificates and just have more data to like query against.
Should we build it in-house or should we level something?
And we stumbled upon Netflix's offering, open source offering called Libre.
And that allows us to basically, anytime a cert comes in and we'd like to issue a cert, we can communicate with the CA partner using an integration they already had in place.
And then also, once we issue the cert, break that cert apart, parse it, and store all the metadata in a relational format so we could query it in the future.
And we still use that today. So it's worked really well and scaled really well for us.
And so, yes, the idea was that we will start building these new offerings on this new platform while serve universal SSL at the same time.
So it wasn't like we were shutting down the existing platform completely.
This was a very, it was a decision made with in mind that eventually we'd like to move everything to this new system.
Yeah, somehow they wouldn't let us stop shipping things just to do all that.
So we had to figure a way to do both at the same time.
Yeah, so it was like, it was a fun challenge, of course.
How do we make sure both systems work in line with each other and slowly chip away from the old system and migrate all that active traffic onto the new system.
So like when we move from the shared cert model to the new system, a new model, a few things.
I mean, it seemed like we made one thing much, much better, which was like the state machine for a certificate got simpler and that's relationship to customers, you know.
But the number of certificates that exploded, which, which seemed to create its own set of problems.
Yeah, those challenges are often more, I would say, more common to what you see in the industry.
The shared certificate model was a very, very tight business logic.
It was very, very hard to like, get a break out of that.
Whereas once we basically went into one cert for customer model, it was the challenges that we faced or other scaling challenges you always hear about like memory management, this is a CPU usage issue.
We had to basically optimize our code to have that.
We had to tune up because we were issuing certificates for customer.
We had to tune our service to adapt to our CA partner systems because at that time, as we were basically as dedicated certs and SSL for SAS was growing, we had a step change in terms of the scale, especially when we got SSL for SAS.
Our pipeline, when we had built it, it was in mind where we have a steady state of customers coming on a daily basis, on a weekly basis.
And that would never have a big step change.
Whereas in SSL for SAS, if we signed a large SAS provider, they would like to migrate all their customers onto a platform in a given timeframe.
It's like, okay, in a week, we want to onboard 100 ,000 websites, whereas a few thousand websites a day.
And so we had to basically build our system to serve both, which is the steady traffic that's basically our traffic coming in from our signups that are directly using us for authoritative that they're using Cloudflare directly versus our customers' customers coming in.
And so we had to also, as we were building that, we had to tune our services to adapt to our CA partner system because they also were seeing spikes of traffic and they were having scaling issues.
And so having a good partner, of course, helped a lot. But also, you know, keeping them, working together with them and helping them scale their systems was also something that we were involved in.
I remember in the early days, there were certain operations with partners that were faster than others.
And not that our customers, you know, our customers just want SSL, click a button, give me a serve.
But there were certain operations where we wind up minting certificates because that was slower and then adding SANs to them because that was faster.
So we would sort of order certs in advance, like get ahead of the customers, because then we knew we could add and remove SANs a little faster afterwards.
And I think that was a lot of it was just working with our partners to make sure we could fit their APIs and behaviors to what our customers wanted.
It was not an ideal, you know, added complexity, again, which seemed to get better over time.
You know, SSL took off and became de facto every website.
The CAs also had to. Yeah. And the other problem that's very fascinating, and I'm sure you face the same as like, you never like, there is no notion of availability.
It's one or zero. It works or it doesn't.
You either have TLS termination at an edge or you don't. And if you don't, it's a big outage for a customer.
So it's an interesting problem to solve because it either works or not.
And you've got to make sure that the speed of the thing that you're solving for is, can you get that certificate up as fast as possible?
Because we have ways to mitigate any downtime while customers are migrating to us.
Yeah. The other, like, the thing that I really disliked as when I did this myself before using Cloudflare, just, you put a calendar event in your, a year after you go buy your cert from whomever, you put a calendar event to remind you to do it again.
And then you go and you relook up the instructions for what you did. I mean, these days, the obviously bigger company is much better configuration management and a lot of the CAs are easier APIs.
So it was a thing that was so painful.
And the fact that Cloudflare has had an offering that took away the renewal fear, renewal worry, it was just like one less ticking time bomb in your infrastructure to have to think about.
So yeah, not just the speed of initial issuance is obviously important during onboarding, but then just like the ongoing renewals.
Right. It's huge. Yeah, I think in 2018 is, so after we basically launched dedicated certs and SSL for SaaS in 2017, in 2018, we decided to start moving some of our new signups using universal SSL also to the new pipeline.
And that's when we really hit some of our big scaling issues in the new pipeline.
So we had issues with like the issues with our database, basically, you know, making sure there are, what are queries that we wrote initially performant at that time?
No. So we had to go and basically do an audit of all the queries or the worst offending queries and basically start improving those.
We had lots of issues with log contention on some of our table rows.
And we had to figure out how to shard these tables to reduce that log contention.
And as we're doing this, the one thing we did as an MVP was like, okay, we want to build an SSL issuance pipeline.
Let's just build it on a DB back queue to begin with.
And then once we hit scale, we'll worry about that problem at that time.
And that happened in 2018. And we basically had to move our queuing system out from just using Postgres at that time and we moved to Redis.
And the analogy I liked from our product manager at that time was, we're basically an 18 wheeler going 120 miles down a highway and we're changing the engine while the speed just keeps increasing and we're going down.
So that was an exciting problem, fun problem.
Lots of sleepless nights, but we're here and we made it all work.
We're still using the Redis queues, right? Yeah, we are. Yeah, and so yeah, all those issues popped up and those are good problems to have.
People are using the product.
And at the one year mark in 2017, we also had a bit of issues with the renewals.
It was the same, it was issues with our queuing system.
So it was all coming all together at the same time.
Yeah, team put their heads down and we got it done. Seems like we've gotten, I mean, the fundamental infrastructure has really come such a long way.
It was cool to see ACM launch recently, which feels like just icing on the cake in terms of advanced functionality that people who are really, really want to turn the knobs and dials of SSL.
What are some of the things you think about that product that people would want to use?
Right, so ACM is a new product offering which is launched, it's called Advanced Certificate Manager.
The whole idea is that we've, in the past, we've given you, we solve the problem of, hey, you come in onto Cloudflare, you get a cert for free.
And we make sure that's fast and we make sure that's reliable in terms of renewals and that's great.
With dedicated certs, we went into this world of like, I want to add 50 custom hostings.
I want to basically be able to, you know, determine a TLS at end-level hostings.
There are more things with certs in the web PKI space that you can do that people want as we go on.
Things like, I want to only have certificates that are valid for 90 days.
I want to be able to rotate my certs every 90 days for security reasons. It improves the security posture a lot.
And even the industry is moving towards a short-lived certificate space.
Things like, I want to have advanced cipher suite management.
So we usually issue certificates, two certificates, for our paid products.
One is an RSA and one is an ECC-based certificate. A short-lived RSA and ECC-based certificate.
Customers want to be able to say, I want only these ciphers that I trust, that basically are in line with my audit requirements and policies.
And I want to basically only allow TLS termination for clients connecting with these ciphers.
And now that we've opened this box, there's a lot more things we can add.
And we will be adding in the future, such as, I want to get certificates with certain extensions.
So OCSP stapling and must -staple certificates are going to become a big thing.
And eventually we will add support for that too. And as the WebPKI field basically adds more things to allow better security posture, we are set up for success for that.
Yeah, so for people who aren't familiar, OCSP is when a client receives a certificate, how they can validate that it is still valid with a third-party independent.
service. And then OCSP stapling is when you sort of staple that attestation to the certificate so they don't have to do a second hop.
Right, it improves performance essentially.
It doesn't have to go back to the CA. And while we do staple and support OCSP, there is going to be a type of certificate where it's by policy, you should not trust it unless you also fetch OCSP.
We don't. And the other thing with ACM when we launched it, it helped also pay off a lot of technical debt we incurred while building this new system.
A lot of things on the back end, we have more knobs to turn to basically speed up things, just a lot more control to basically offer a more reliable experience for our customers.
Very cool.
I think ACM seems really interesting. I have yet to play with it, but it's on my list of things to do.
Let's see if I can get picky about Cypher Suites for my personal website.
Thanks for chatting with me. Yeah, thank you. Yeah.