Terraform & Control Plane
Presented by: Garrett Galow
Originally aired on September 21, 2021 @ 4:30 PM - 5:00 PM EDT
Best of: Cloudflare Connect 2019 - UK
Join Cloudflare Director of Product Garrett Galow for a tour of Terraform and learn how you can use it to deploy infrastructure as code with Cloudflare. Featuring a guest appearance from Cloudflare customer lastminute.com.
English
Cloudflare Connect
Transcript (Beta)
How's the day so far? Good? I hope you've been enjoying it. I unfortunately haven't been able to enjoy it so much myself, but I'm looking forward to watching the recordings.
So my name is Garrett Galow. I'm on the product team here at Cloudflare, primarily responsible for APIs, account and user management, those sorts of things.
Today we're talking about Terraform and the control plane. So maybe you're familiar with Terraform, maybe you're not.
If you're not familiar with Terraform, what is it?
It's basically a way of treating infrastructure or services as code.
So it provides a configuration language that you can use to define resources and services and deploy and manage those like you would maybe do an application.
It gives you, since it's a declarative stateful service, you can use it to see what's going to happen before it actually happens to confirm that, yes, this is actually the intended thing that I want to do.
And Terraform has a kind of plug-in provider model, so therefore it can work with hundreds of services.
Cloudflare has a Terraform provider ourselves that we brought in-house and officially support, which supports a broad, probably 95-plus percent of Cloudflare services.
I was here in London a year ago talking about Terraform right after we had officially adopted the plug -in, and two things came out of that.
One, someone asked me after was, are you actually going to support this?
Like, it'd be great, like, this seems awesome, but if you don't actually support it, we're not going to be able to use it, and I'm happy to say that a year later we are officially supporting it and we have support for basically all of our services.
The other thing that happened at that event was I was doing a demo where I was provisioning some load balancers over services across a couple clouds, and an hour before my demo, I had to do an AV check like this.
I get up, and my laptop would not project onto the screen properly.
And so I had to move my entire demo from my laptop onto a colleague's laptop.
And if you've ever done a demo before, you know, it's not always the most robust thing.
You have kind of configuration in random spots. It's not really meant to be reproducible.
But because I had written my entire demo in Terraform, all I pretty much did was copy my TF files over to my colleague's laptop, install Terraform, and deploy that.
And within a few minutes, I had my entire demo rebuilt on my colleague's laptop and was able to give the demo flawlessly.
That shows kind of the, you know, very small grain, like, oh, and like, I'm trying things out.
Terraform makes it really easy to share with other people what it is that I provisioned.
But it takes a lot more than that to use Terraform in a production environment to manage all of your services.
So I'm really happy to say that we have lastminute.com here, who's going to tell us about how they've been able to automate pretty much all of their infrastructure deployments, including Cloudflare, using Terraform.
So I'd like to invite Francesco and Lorenzo up to the stage.
Thank you very much. Okay.
So thank you for the presentation, Gabby. So, well, I'm Lorenzo. I'm part of SRE team.
I'm the team manager of platform division in lastminute.com. With me, Francesco, that is SRE tech guy, specialist.
And as you can see in the picture, the team that was involved in the Cloudflare integration.
So part of the people start with the configuration.
Other guys with the automation. And other people follow the roadmap to deploy your website.
So as you can see here, some numbers about our company.
We are an online travel agency. The most relevant number that you can see here is 23,000 passengers per day that we handle with our platform.
And we are a company based in Kiasu, Switzerland.
But we are spread in 12 different Europe region, like Spain, Paris, Germany, France, and so on.
So just a quick introduction about our brands. Lastminute .com is the main one that is available in the major country in Europe.
But we have also a key brand that are part of geo -localized business.
So Vec.de for Germany, Rumbo for Spain, Volagratis for Italy, BrowFly for France.
But we have also other kind of business like Meta that are led by Hotelscan and JetCoast.
Here you can see our stack.
So Kubernetes, Docker is the core of our platform that now is handle 90% of our business.
On top of Kubernetes, we have Java-based microservices.
We have an ecosystem of JDK11, OpenJDK, Kotlin, Ezelcast, SpringBoot.
The data is managed by MySQL, monitoring by Grafite Grafana, and logging by Fluentd Greylog.
So again, some numbers. More than 200 microservices, that honestly are a lot.
700 virtual machines that unfortunately manage the legacy. So raise your hand if you have no legacy in your environment.
This means that we handle 2K of pods in Kubernetes.
So 8K of Docker container, 300 physical machines, and 260 database schema.
We have 300 developers. So how can a team of 12 people, because in SRE we are 12 people, can handle all this heterogeneous environment with our pillars?
We will be discussing with automation all the things. That means we have to avoid any manual interaction with the infrastructure.
We need to be able to rebuild from scratch in any time with the same data.
This means that you have control of your platform.
Keep it simple means that you have not to build a complex infrastructure.
This means that sometimes you should be able to drive your developer and try to focus as much as possible to keep it easy, because you have to maintain your platform.
API first. With these pillars, we changed the mindset and we were able to promote the infrastructure as a code.
So the only way to, let's say, go forward with the technology.
And fail safe. This means that if you are able to have a pipeline to promote a change, you are able also to fail, because if you are able to restore the previous version, you don't care about possible failure.
The challenge that we had was a change from the previous CDN vendors to Cloudflare only.
So we started POC in 2018 with Cloudflare. We made a POC with some KPI, like the WAF, that was the first option that we evaluated Cloudflare.
And then with the automation, we were able to migrate 800 websites in just three months.
That honestly was a challenge. How? Well, Francesco now will explain better some details.
Thanks very much, Lorenzo, and hello everybody.
So the automation that we built on top of the amazing Cloudflare API started in September 2018.
At that time, the Terraform provider did not have all the enterprise features available.
And so we built a custom software, Mjolnir, within Go, that was orchestrating the Cloudflare API missing part on the Terraform provider and the creation of the resources via Terraform.
With that, basically we integrated all the missing features, like partial zone, TLS API, that were not present at the time.
But right now the team has made available in the Terraform provider, so we will be able in a few days to remove Mjolnir and migrate all of the resource management under the Terraform provider.
With this automation, managing one zone was simply putting some records inside the TF files and starting the pipeline.
So in under two minutes, we have all our zones configured. This is an example of the list of resources that we use in Terraform.
The snippet of code is from the Terraform provider documentation.
And it just shows that in 20 lines you can configure a zone with all the default parameters.
So it's very simple. On top of that, we also added DNS records, file rules, page rules, and rate limiting.
Which are very important to us.
Our use case is using the GitOps flow. So we have all of our code stored in GitLab, which is our SCM.
And by merge request or pull request, the team is able to work on the code base.
Merging the request starts a pipeline that applies the Terraform configurations.
We store the remote state in a S3 bucket, so multiple developers can work on the same project at the same time.
And the Terraform orchestrates the creation of resources in the Cloudflare API.
Starting from that, Cloudflare is configured and ready to protect our infrastructure on the origin.
Because we have so many websites and 37 top-level domains, all for the different brands, we needed consistency.
And to achieve consistency, we needed modularity to apply all of these changes in the same way and in a desired fashion.
We used a Terraform feature called modules. So modules are basically packages that you can combine and you can abstract infrastructure with.
And you can instantiate modules many times with different configurations.
In the peak here, you see we have decided to have one top-level domain matched with one Terraform file.
We store the JSON record in a JSON.txt record coming from a previous run of Terraform.
The module zone config that we created just for our use case has an instance for each of the top -level domains.
And inside the module, it accepts variables or uses defaults to drive the creation of all the third-level domains under the top-level domain.
Another challenge for us was using multiple DNS providers.
We don't use yet the Cloudflare DNS. So the partial zone feature was very handy to us because we have three different AWS accounts in which we store our DNSes.
Those come from the acquisition that we made in the past. And so we had to manage different accounts to populate the txt activation records that comes from the creation of the zone in Cloudflare.
We did that by using the provider aliasing feature in Terraform so that we are able to specify a specific provider inside the module and tell that module that it should use certain credentials to connect to a certain provider.
In this way, for example, you see lasmino.com and lasmino.ie are managed by the AWS Route 53 account, while VAC-DE is managed by its own AWS account.
And we were able to do that without much effort just by adding the provider in the main configuration file and aliasing that into the module.
Here is an example of what we are allowed to do. We can either bulk update the whole 800 domains and the whole 37 top-level domain, changing a WAF rule from simulate to ban, or we can pinpoint the change in the module configuration.
This sort of automation with such high numbers was not possible using the API only.
Terraform is the driver that enabled us to have this sort of things done. We talked a bit about the automation, but the decided fact, let's say in this way, using Cloudflare, is that when we migrate from the previous vendors to Cloudflare, we also decrease the time to load by 20%.
That was an unexpected goal for our company, so we were happy about it.
Then, having the opportunity to use the API, that honestly I really appreciated, we created a sort of GoProbe in our Kubernetes environment that, using the API, gather all the information about our zone and our website and create the data for our monitoring stack, so Grafite and then Grafana for the rendering.
This project is now available under our GitHub, is open source, Apache 2.0, and we have also a link of our company blog.
So, well, just thanks for the support of Upgrade, the company that helped us with Cloudflare, and then a big thanks with the Cloudflare team that supported us during the POC and also scheduling a recurrent meeting and feedback about the automation that we put in place with a lot of, let's say, useful suggestions.
Now, I will leave Francesco to explain a bit this demo, that is, let's say, how we create and we handle a domain.
This is a video demo that we did because we don't trust the network. So here I show that we commented out one of our domains.
I'm doing this with our production account because we are very confident.
But mainly this is our project in which we are managing all the zones.
And right now what you can see here in the control plane, there is not a static root .com zone, which is where we serve our static assets.
Terraform runs, and what it does, it's populating the state, refreshing the state for all the zones and all the records.
In here, actually, it's calling also the Go software to populate the TLS API that we are migrating out of the Terraform provider.
Now the record has been generated, and the DNS has been put into the AWS account.
So now the Terraform runs all the projects, and you can see in the static root.com definition we have a lot of page rule targets.
And immediately, one minute after we create the zone, it goes back into the control plane.
If I switch it a little bit.
So one minute of rendering 800 domains, refreshing the domains later.
We see it's now going into the TXT record creation.
And I added 64 resources to my Cloudflare account, to our Cloudflare account.
So right now, it shows that everything is created, and not only the zone is activated here with all the 30 and more page rules that were configured in the module, it's also creating the DNS records in the AWS account that are used then to activate the zone.
So in a couple of minutes, we have our zone ready to proxy traffic.
And in fact, if we now take one of the DNS records that we have created that are present on the Cloudflare network, but are still served by another network.
I think it's here. Well, yes, here it's also showing that we have the availability of all the monitoring and analytics tool for the record already.
The page rules again.
So now, I think if I go a little bit DNS, okay. We are now ready to...
Currently, it's on another provider. So you can see that it's served by some other CDN.
And now, if we point to the new endpoint in Cloudflare, we will have the CFA, which is the identification of the request.
And that's basically it for us. Thank you very much.
Thank you. Thank you. Thank you. I think it's pretty incredible how well -oiled that machine is.
So you make a single change.
It's impressive that you're willing to delete something in production and recreate it on the fly.
That shows, I guess, true trust in things. I'm hoping that the network does hold up for my demo.
But moving on. So I want to talk about a couple things that have come down the pipeline recently and something that is still in beta to talk about.
So obviously, if you are already using Terraform and then you come to Cloudflare, it's easy.
You just start creating all of your resources inside of Terraform.
But if you've been a Cloudflare customer for a long time and maybe you're thinking about adopting Terraform, one of the really big challenges you face is, how do I get everything I have in Cloudflare into Terraform?
If you've actually looked at Terraform at all, the built-in rules for importing resources, which could be DNS records, page rules, zones, is really, really painful.
It takes an ID plus the zone name, in some cases, to go and fetch the state.
And you still have to write the actual configuration yourself. So we've released something called what we call CF Terraforming.
It's based on another open -source library called Terraforming, which is built for AWS.
We basically built our own version of it for Cloudflare resources.
And so what it allows you to do is take all your existing Cloudflare resources and put them into Terraform.
So we'll actually go through and generate not only the configuration of the descriptions of the settings for your resources, but also the state, such that after you run this for something like your DNS records on a zone, you can Terraform plan and see that there's no changes that need to be made.
Syncing, config, and state ensures that, like, things work perfectly from the first time.
You basically import using our tool, and then you're ready to go.
And you can even merge it inside of existing Terraform files.
You don't have to necessarily treat Cloudflare outside of the environment that you may already have configured for Terraform.
We released that earlier this year, so that's now on GitHub and available, open-source.
So take a look at it.
I'd love to hear your feedback if you do try and use it. It's something we're going to continue to invest in.
We pretty much have parity with what we support in Terraform already, and we'll keep making the tool better.
Switching gears slightly, another thing that we're working on that's currently in a beta state right now is what we're calling API tokens.
So if you're using Terraform, you're using our API, you may be using our API in other ways.
Probably...not probably.
The most common complaint that we hear about our API is, I'm scared to use it.
It gives too much power. If you use our API keys, the way it works is you get one per user, and that key has access to everything that user does.
So if you have access to 10 accounts and 500 zones, and you're a super admin in those accounts, then your API key can do any action that you could do in the dashboard.
In very, very limited cases, that's fine.
But if you're trying to deploy something onto a server somewhere, you really don't want to risk the fact that if somehow someone got onto that server and stole the key, they'd basically be able to, like, wipe your Cloudflare account.
So what are API tokens? API tokens aim to solve this problem.
The first thing is, you know, if you're actually using the API in a production environment, if for some reason you need to roll that key, you'd basically have to take a hit to your production availability because if you needed...for the time you roll the key and redeploy the key into your environment, you basically can't make API calls.
So the first thing you can do is you can create multiple keys, multiple tokens.
More importantly, for each of those tokens, you can actually scope them to specific resources with specific permissions.
So you could say, I want a token that can edit DNS records on a specific zone.
So now if you have an application or a service that only needs a very limited access to Cloudflare, you can actually give it that proper access.
And last is, you know, really treating more securely tokens.
So being able to see when was this token last used. So if you think something may have happened but you're not sure, you can actually see, oh, the token got used five minutes ago.
I haven't been using this token. That means something bad happened.
Or ability to roll the secret but not let it be visible again. So if someone logs into your account, they maybe can't see your secret.
This is currently in beta right now, currently a closed beta, and I'm going to show you a quick demo around this.
So we're on the Cloudflare dashboard, and just to – if you go to a user profile, notice that it may look a little different than you're used to.
I now have a few different top options, and one of these is API tokens.
So if I select that, I see on the bottom part of the screen the good old API keys section that you've probably seen before.
But above it, I now see API tokens. And so we're going to create a token.
And so I'll call this demo token. And you'll notice there's permissions that can be defined, and there's resources that can also be defined.
I want to kind of shortchange this a little bit, so I'm going to actually go to our templates and start with one of our templates.
So part of the thing is, you know, this is a new system.
It may take some getting used to. And also, you know, a lot of people want to do similar things with our API, so we figured we could probably make it easier for you to get started.
So I'm going to select a edit zone DNS template. And so you notice it has permissions for zone for DNS to edit.
I'm actually going to change that to read in case someone takes a picture of the token, they can't use it to do anything bad.
And then I'm going to include a specific zone, and in this case, I'm going to pick my personal website, garygayla.com.
And then once I've selected that, I'm going to look at a summary.
And so here, if I was selecting a more complicated token setup, I could actually see and understand what it is that this token is going to be able to do.
In this case, it will be able to read DNS records for one of my zones.
So now I'll create the token, and I get this. Two things on this screen.
At the top, you see what looks like the secret, which is the actual secret of the token that you use to make API requests.
Beneath it, you'll see a curl that you can use to verify the token.
So you can use this if you have a token to see is this token active and whatnot.
So we'll actually start by showing that. Going to curl here.
Put that in and make sure it looks all pretty when we use it. And you'll notice we get a result back.
It says that this token has an identifier, it's active, and you see the message, this API token is valid and active.
So right now, we're sort of, you know, when you create tokens, they're long-lived.
But sort of on the roadmap, we'll be adding support and exposing support for things like not before, not after logic.
So I can say, I want a token that is only alive until three weeks from now, and at that point, it will die.
Things like that. So if you want to give someone a token, but you don't want it to sit around forever, you're going to make it expire.
So that's all good. Let's actually see if we can use it.
So instead of having to actually write out requests myself.
So what we have here is just a DNS request to the zone that I gave this permission to, to view the DNS records.
And so if I send that, notice it completes. I can see my DNS records.
I have a few of them. Great, no problem. What if I try and delete this?
So if I want to actually delete one of these DNS records. So let's say, let's take this one and say, delete that record.
Parse that out.
And I'll get an unauthorized access. So this token, since it doesn't have those permissions, it cannot edit DNS records, I therefore cannot delete it.
If I also tried to access DNS records of another zone, it would be prohibited.
So that's sort of the start of it.
This is in, as I said, this is in a closed beta right now.
If you are interested in trying this feature, either reach out to your CSM or email me at gg at Cloudflare.com.
We're looking for more people to test this. We plan to roll it out in a more open way quite soon.
And that is it. Any questions? All right.
Thank you very much. There's too much that goes into creating high-quality video today that's just simply still too hard for many of our customers.
Most cloud providers don't actually provide a turnkey solution for creating high-quality video.
They provide bits and pieces of the equation, but there's no provider that provides an end-to-end solution from rendering to streaming.
They'll provide bits and pieces that now you have to kind of cobble together to build an amazing product.
Our focus now is how do we simplify and streamline that by providing a deeply integrated, simple and easy-to -use solution.
A big part of what we do at Cloudflare is, as we focus on helping build a better Internet, is take complicated things and make them simple and to enable them to just literally be able to go to Cloudflare, to log in, to point their video asset at Cloudflare, and then on the other end be able to pull a player out of Cloudflare and place it wherever they need to be able to deliver the video.
And that's it.
There's a triplicate where you can do something either well or fast or cheaply.
And so we're striving for all three because we really need it. We need it to be really good because otherwise why would anyone use the service?
You've got an entire Internet out there, use something else.
We need it to be fast because people have no patience.
And we need it to be cheap enough that we can stream to millions of users without it becoming uneconomical.
So you have to get all three, and Cloudflare's a really important part of offering all three.
If you want to deliver a video to anybody on the globe, there really is no better network to put it on than Cloudflare because we can guarantee the highest quality experience to somebody who is in New York City and someone who's in Djibouti and someone who's in Sydney.
Microsoft Mechanics www.microsoft.com www .microsoft.com www.microsoft.com www .microsoft.com www.microsoft.com www .microsoft.com www.microsoft .com