Scaling SSL Certificate Issuance
Presented by: Tom Lianza, Mihir Jham
Originally aired on June 30, 2020 @ 2:30 PM - 3:00 PM EDT
Mihir and Tom talk about the technical challenges they've overcome in the last several years of scaling Cloudflare's SSL products.
English
Product
Transcript (Beta)
All right, I'm Tom and I'm here with Mihir and we are going to talk about Cloudflare's SSL products and scaling them from an engineering perspective.
I'm an engineering director at Cloudflare and Mihir is an engineer on the SSL team.
By manner of introduction, when I started at Cloudflare in 2015, we were just under 200 people.
Before Cloudflare, I had been working at startups and the reason I knew about Cloudflare was we used Cloudflare at my startups and the reason I used Cloudflare at my startups was actually because I was tired of dealing with SSL.
That was the number one thing I wanted to just make somebody else's problem and just deal with my application.
And then I came to Cloudflare and it became my problem again.
So these days, I still work with the SSL team as well as a number of other core control plane teams and Mihir is currently on the SSL team.
Yeah, same, I work on the SSL team.
I started on in 2016, Halloween 2016, so it's a fun day to start work on.
I did not come undressed. Since then, I've been on the SSL team and been helping solve SSL problems and then also scale SSL problems, which we'll get into in a bit.
Before Cloudflare, I was working at a coding bootcamp.
I was helping teach the data structures and algorithms section of the course and I was also building out software for the company.
Things like internal tools for tracking student application analysis and basically any integrations they had with other third-party vendors.
These days, I work on the SSL team and my role is basically to ship products, keep the lights on for all our systems, and also help fellow team members get their stuff through and help them with anything that might come up.
Right on. So it's kind of interesting, I think, the space that we're in with SSL because when I joined, I think most websites only used SSL on the checkout path or something.
SSL was reserved.
Some websites didn't use it at all. Nothing we transmitted is secure. It needs to be encrypted and there was a perception and for a while a reality that there's a performance penalty to be paid for doing SSL.
Aside from sites that I worked on that didn't use SSL at all, the ones that did only did it in certain sort of subset.
You have a secure portion of your website and as a result, this assert and SSL generally wasn't as top of mind.
It wasn't like today when you go to a browser and the browser warns you heavily that the site is not secure, maybe if it doesn't have SSL turned on.
And the main product that drew me to Cloudflare and that I think is still our most widely used product in part because it's free, is Universal SSL where Cloudflare developed the infrastructure to give everyone a cert who wanted one, which was amazing and I think that without Peer at the time, to me it was magic and I think it was enabled by the fact that SNI became an increasingly powerful capability that was supported by a broad set of clients which allowed Cloudflare to get a lot of customers to share certs which allowed us to provide SSL for a lot of people without having to manage one cert per customer at the time.
So Universal SSL was the main SSL product when I joined and Cloudflare also had a custom certificate product so you could bring your own certificate, upload it to Cloudflare so you could still have DDoS and other protection on Cloudflare but using your own certificate that you get at some other CA of your choosing.
The immediate challenges when I walked in were Cloudflare, like most startups, was built as a monolith.
Almost every startup starts that way and probably should start that way.
You got to get product market fit before you go architecting complicated solutions to scale.
You need to have enough customers to have to scale first and Universal SSL was certainly part of the monolith.
So it's a code base that just grew over time to the point where lots of different teams were contributing to it and there weren't clean sort of divisions and interfaces the way you might hope in a mature more mature system.
We had multiple CA partners like we do still. Cloudflare, we are a CA but not a browser -trusted CA.
We have other products that can issue certs but they're not browser -trusted and so we had to integrate with them and we use them in ways that their typical customer wouldn't and as Universal SSL became more and more popular and successful, we had to work really closely with them to keep the whole chain of systems working so we could get certs issued and renewed and issued quickly.
And over time, I think the shared certificate pattern was it was critical for Cloudflare to be able to offer Universal SSL to so many people but the challenges, the downsides of customer sharing certificates started to intensify because when you have multiple disparate customers on one certificate and then you need to renew it or add, remove, you know, make modifications so they're all in the same bucket and so the first thing that everyone needs to do before they're issued a browser certificate is called DCV, domain control validation.
You don't get a certificate for a domain if you can't prove that you control it and for Cloudflare, because we provide DNS for people, we can provide that DCV for them but some of our customers are partial zones.
I don't know if we have a different, do we call it partial zones?
Is that the word? I think we call it CNAME setup.
We're CNAME setup. So customers that didn't use us for authoritative DNS, we couldn't prove that they controlled the zone for them.
We didn't have that power so they needed to do a step themselves so then whenever we needed to make it, you know, renew or update a certificate, we can't just help, we can't just do that for them because there's that step where they need to reproof that they control the domain, which might be fine if you were dealing with your own cert but becomes a lot less fine when you're sharing with other people.
CAA became a bigger and bigger issue as it became more widespread.
I don't know that I know exactly what that stands for.
2017 is when it became a standard where if you had a CAA record then the CAs would begin or would stop issuing if it was incorrect, like if they were not authoritative to issue that certificate.
So this is where a customer can say there's only, these are the specific CA or CAs that I allow certificates to be issued for for this domain that I control and so if a customer had done that at some point in their existence and then moves to Cloudflare and the CA partner we used at the time wasn't on their list then we wouldn't be able to get a cert for them and you get into shared certs and that's a pain point as well.
And brand checks were really big in the early days so I don't think it's as common now or maybe not.
Now different CAs have different sort of policies with respect to brand checks but the idea there is if you try and get a certificate that looks to be a violation of a trademark or phishing or something that the CA believes doesn't look right to them, they won't issue you a certificate and that's very well-intentioned but some of the rules from different CAs could mistakenly catch completely legitimate usages, you know, if there's a country they're concerned about and your domain name has that country, you know, in any part of its word, of its name, they wouldn't issue certs and again put one of those domains on a cert with our customers and we get into issues there.
So we had to work through a lot of those sorts of problems over the years.
And then there's some of the infamous, this was around the time we joined, here the deprecation of shot one certs.
So Cloudflare tries to issue, you know, the spirit was we want to make sure everyone's encrypted and everyone's able to reach your site just to make the Internet better and more secure and shot one, you know, it was a weak, the weakest type of encryption that was allowed and eventually was deprecated and we wanted to make sure our customers had support for it as long as, you know, the standards body would afford.
But ultimately we needed to pull that from everybody.
So it was yet another process in which we needed to change a bunch of certificates and those are some pain points.
And so all of that, the shared certificate model, the changing of the rules of SSL, the fact that it was built into a, you know, a monolith, led us in engineering to want to really break this out into a new system, expand the number of state partners we work with and product wants us to build a whole bunch of other stuff.
Right. And I think that's right around when you joined me here, we were both trying to rewrite this system and launch the notion of dedicated certificates when you joined.
Yeah, we launched a dedicated certs in September of 2016 and we basically, I mean, with all the operational challenges that you just mentioned regarding the shared cert model, we decided that dedicated certs would be, at first, we would essentially make it a paid offering where you can, for $5, you can buy dedicated certs just for your website and we'll give you a wild card and the apex.
And then we had a $10 offering, which is basically the, you can add up to 50 custom hostings.
So things like second level subdomains and levels of domains, et cetera.
Right. So if you had example.com, a universal cert would cover the apex example.com and star.example.com so you get any one subdomain, but you couldn't do a.b.example.com with universal.
That would be that would be where you could opt into that.
But now we finally had a way to do that besides bring your own. Yeah. So we were moving into this world of having, you know, dedicated certs, but also more added, like configurable, more configurable to your certs, like add more signs, for example.
And the other thing I think product was asking us for. So subject alternative name.
Yes. It was an acronym that that I struggled with when I first joined.
Everything I knew about SANS in 2015 was storage area networks and everything we spoke about.
Yes. It was the zones, the domain names that appear on a certificate.
Right. And also at that time, product was asking us to build SSL for SAS, which is our offering where SAS providers can offer RSSL services to their customers.
So think of like an example, like a website builder and website builder has example.com and you can be like tom .example.com is your business and you're running it on that.
But then you want to basically have www.tom.com pointed that you don't want to rebuild all your all the website that you build your business on.
And so we wanted to build an offering that can support issuance at scale for those for certificates that were essentially that potentially were not our customers, but our customers' customers.
Right. That was the real challenge of it is, you know, whoever owned tom.com wasn't a Cloudflare direct customer.
They were indirectly leveraging a SAS platform who was our customer.
So coming up with a solution for that was huge.
Right. So that was back in 2016. We decided to not leverage the pipeline that we had at that time, which was our universal and custom certificate pipeline.
And we decided to move into this model of dedicated search. One website or one host name per search or up to or whatever custom host, how many custom hosts you would like.
And we decided to move away and build this away from the monolith as well.
And we decided to build this on a bunch of Go microservices that will allow us to horizontally scale as well as just basically break down the logic into its most simplest form.
That was the biggest challenge, basically moving away from the shared certificate model.
The other thing that we did leverage at that time when we decided to build this was we leveraged, we had the choice of should we basically build all this parsing of certain metadata and inserting into the database so that we can, you know, search our certificates and just have more data to query against.
Should we build it in-house or should we level something?
And we stumbled upon Netflix's offering, open source offering called Libre.
And that allows us to basically, any time a cert comes in and we'd like to issue a cert, we can communicate with the CA partner using an integration they already had in place.
And then also, once we issue the cert, break that cert apart, parse it and store all the metadata in a relational format so we could query in the future.
And we still use that today. So it's worked really well and scaled really well for us.
And so yes, the idea was that we will start building these new offerings on this new platform while serve universal SSL at the same time.
So it wasn't like we were shutting down the existing platform completely.
This was a very, it was a decision made like with in mind that eventually we'd like to move everything to this new system.
Yeah, somehow they wouldn't let us stop shipping things. Figure a way to do both at the same time.
Yeah. So it was a fun challenge, of course.
How do we make sure both systems work in line with each other and slowly chip away from the old system and migrate all that active traffic onto the new system.
So when we move from the shared cert model to the new system and new model, a few things, I mean, it seemed like we made one thing much, much better, which was like the state machine for a certificate got simpler and that's relationship to customers.
But the number of certificates that exploded, which seemed to create its own set of problems.
Yeah, those challenges are often more, I would say, more common to what you see in the industry.
The shared certificate model was a very, very tight to our business logic.
It was very, very hard to break out of that.
Whereas once we basically went into one cert for customer model, it was the challenges that we faced are the scaling challenges.
You always hear about memory management, this is a CPU usage issue.
We had to basically optimize our code to help there.
We had to tune, because we were issuing certificates for a customer, we had to tune our service to adapt to our CA partner systems.
Because at that time, as we were basically as dedicated certs and SSL for SAS was growing, we had a step change in terms of the scale, especially when we got SSL for SAS.
Our pipeline, when we had built it, it was in mind where we have a steady state of customers coming on a daily basis, on a weekly basis.
And that would never have a big step change. Whereas in SSL for SAS, if we signed a large SAS provider and they would like to migrate all their customers onto our platform in a given timeframe, it's like, okay, in a week, we want to onboard 100 ,000 websites, whereas a few thousand websites a day.
And so we had to basically build our system to serve both, which is the steady traffic that's basically our traffic coming in from our zones, from our signups that are directly using us for authoritative, that they're using Cloudflare directly versus our customers' customers coming in.
And so we had to also, while as we were building that, we had to tune our services to adapt to our CA partner system, because they also were seeing spikes of traffic and they were having scaling issues.
And so having a good partner, of course, helped a lot. But also, you know, keeping them, working together with them and helping them scale their systems was also something that we were involved in.
Yeah, I remember in the early days, there were certain operations with partners that were faster than others.
And not that our customers, you know, our customers just want SSL, click a button, give me a serve.
But there were certain operations where we wind up minting certificates, because that was slower, and then adding SANs to them because that was faster.
So we would sort of order certs in advance, like get ahead of the customers, because then we knew we could add and remove SANs a little faster afterwards.
And I think that was a lot of it was just working with our partners to make sure we could fit their APIs and behaviors to what our customers wanted.
Yeah, it was not an ideal, you know, added complexity again, which seemed to get better.
Over time, that was, you know, SSL took off and became de facto every website.
The CAs also had to. Yeah. And the other the other the other problem is very fascinating.
I'm sure you face the same as like, you never like, there is no notion of availability is one or zero, it works or it doesn't.
You either have DLS termination at the edge or you don't.
And if you don't, it's a big outage for a customer.
So it's an interesting problem to solve, because it either works or not.
And you got to make sure that the speed of the thing that you're solving for is, can you get that certificate up as fast as possible?
Because we have ways to to mitigate any downtime while customers are migrating to us.
Yeah, the other like the thing that I really disliked, as when I did this myself before, before using Cloudflare, just you put a calendar event in your a year after you go by your cert from whomever you put a calendar event to remind you to do it again, you go and you relook up the instructions for what you did.
I mean, these days, the obviously bigger company is much, you know, better configuration management and, and a lot of the CA's easier API's, but so it was, it was a thing that was so painful.
And the fact that Cloudflare has had an offering that took away the renewal fear, renewal worry was just, it's just like one less ticking time bomb in your infrastructure to have to think about.
So yeah, not just the speed of initial issuance is obviously important during onboarding, but then just like the, the ongoing renewals, right?
Yeah, I think in 2018 is, so after we basically launched dedicated certs and SSL for SAS in 2017, in 2018, we decided to start moving some of our new signups using universal SSL also to the new pipeline.
And that's when we really hit some of our big scaling issues in the new pipeline.
So we had issues with like the issues with, with our database, basically, you know, making sure there are, what are queries that we wrote initially performant at that time?
No. So we had to go and basically do an audit of all the queries or the worst offending queries and basically start improving, improving those.
We had lots of issues with lock contention on, on, on some of our table rows.
These are like, and we had to figure out how to shard these, these, these, these, these, these tables.
So to reduce that lock contention. And as we're doing this, we, we, we, the one, the one thing we did as an MVP was like, okay, we, we want to build an SSL issuance pipeline.
Let's just build it on a DB back queue to begin with. And then we, once we hit scale, we'll, we'll, we'll, we'll, we'll, we'll worry about that problem at that time.
And that happened in 2018. And we basically had to move our queuing system out from, from, from just using Postgres at that time.
And we moved to Redis.
And the, and the, and the analogy I liked from our, from our, from our product manager at that time was we are basically an 18 foot, 18 wheeler running, going at 120 miles down a highway.
And we're changing the engine while we're, while, while, while the speed just keeps increasing and we're going down.
So that was a, that was an exciting problem. Fun problem. Lots of sleepless nights, but we're here and we made it all work.
We're still using the Redis queues, right?
Yeah, we are. Yeah. And, and, and so yeah, all those, all those issues popped up and those are good problems to have.
People are using the product. So yeah, it was at the one year mark in 2017, we also had a bit of an issue with the renewals.
It was the same, it was issues with, with our queuing system. So it was, it was all like coming all together at the same time.
Yeah. Team put out, put their heads down and we got it done.
Seems like we've gotten, I mean, the, the fundamental infrastructure has really come such a long way.
It was cool to see ACM launch recently, which feels like just icing on the cake in terms of advanced functionality that people who are really, really want to turn the knobs and dials of SSL.
What do you, what are some of the things that you think about that product that are, you know, people would want to use?
Right. So ACM is, is a new product offering, which is launched, it's called Advanced Certificate Manager.
The whole idea is that we've, in the past we've given you, we solve the problem of, Hey, you, you come in onto Cloudflare, you get a cert for free.
And we make sure that's fast and we make sure that's reliable in terms of renewals.
And that's great. With dedicated certs, we went into this world of like, I want to add 50 custom hostings.
I want to add, I want to basically be able to, you know, determine a TLS end level hostings.
There are more things in, in, in, with, with certs and in the web PKS space that you can do that you, that, that people want as, as we, as we, as, as we go on things like, I want to only have certificates that are valid for 90 days.
I want to be able to, I want to be able to rotate my certs every 90 days for security reasons.
It improves the security posture a lot. And even the industry is moving towards a short-lived certificate space.
Things like, I want to have advanced cyberspace management.
So, so some, so we usually issue certificates, two certificates for our, for our paid plan, for our paid products.
One is an RSA and one is an ECC, keeps a certificate.
A short-lived RSA and ECC keeps certificate. So people want, customers want to be able to say like, I want only these ciphers that I trust are, are, are, that, that, that basically are in line with my, you know, audit requirements and policies.
And I want to basically be only, only allow TLS termination for clients connecting with these ciphers.
And, and, and, and, and now the market and like the, now that we've opened this box, there's a lot more things we can add and we will be adding in the future, such as, I want to get certificates with certain extensions.
So OCSP stapling and must-staple certificates are going to become a big thing.
And eventually we will, we will add support for that too.
And as, as, as, as we, we, we, we, as the web PKI field basically adds more things to allow better security portion, we would, we are set up for success for that.
So that we, yeah. So for people who aren't familiar, OCSP is when, when, when a client receives a certificate, how they can validate that it is still valid with a third-party independent service.
And then OCSP stapling is when you sort of staple that attestation to the certificate.
So they don't have to do a second hop.
Right. It improves performance essentially. And while we do, while we do staple and, and support OCSP is, is it's the, there is going to be a type of certificate where it, it, it's by policy, you should not trust it unless you also fetch OCSP.
Is that right? We don't, we don't want to issue those.
Right. So, yeah, so the, and the, and the other thing with, with ACM and we launched it, it helped also pay off a lot of technical debt.
We, we, we incurred while building these, this, this new system.
A lot of things on the back end, we can basically, we have more knobs to turn to basically speed up things, slow down things, just a lot more control to basically offer a more reliable experience for our customers.
Very cool. It's, I think ACM seems really interesting. I have yet to play with it, but it's on my list of things to do and see what, what if I can get picky about Cypher Suites for my personal website.
Thanks for, thanks for chatting with you.
Yeah. Thank you.