The Curious Case of Caching CSRF Tokens
Presented by: Junade Ali
Originally aired on September 20, 2021 @ 5:30 PM - 6:00 PM EDT
This segment will consist of a talk discussing a real life performance problem faced by a Cloudflare customer, and how it was resolved.
English
Performance
Customer Stories
Transcript (Beta)
♪ Hello.
Good afternoon from what is in London at the moment, the warmest day of the year.
And welcome to Cloudflare TV, wherever you're joining us. I'm Junaid, and today I'm going to spend a bit of time talking you through an interesting case study to do with both performance and security that I worked on a few years ago, but it's nevertheless an interesting story.
So let's dig into things. We'll start off on this, why performance matters.
Why performance is obviously an important thing.
We've got a bunch of statistics here as to stand things as to why it's an important thing, especially for e-commerce businesses, you know, ties in with things like conversion rates and so on.
And so lots of businesses really care about delivering their websites as fast as they can to their customers.
There we are.
So when we talk about web performance, we can really subdivide the process into three kind of crude high -level steps.
The first step is the connection and request time.
It's really like time to make a DNS query, connect to the web server, establish a secure connection, and then the web server really needs to render the page.
It needs to, you know, do any database queries, call APIs, write logs, render the page, send the response back to the client, and then there's the actual response speed where, you know, you download that web page and you load any assets on it, and you pull in those resources.
And in effect, those three steps are kind of the high level as to what ties in with web performance.
Most people, when they're addressing web performance issues, usually tend to think that the issues are almost always around response speed, the actual download time to their sites.
They tend to do lots of optimization measures on that front, you know, minifying the assets which are pulled, JavaScript, CSS, et cetera, using various different tactics to do that.
But this case study really focuses on the middle scenario, the page render, and it is to do with the kind of the caching side of things there.
So, in December 2016, Cloudflare kind of opened up a new feature to our business plan customers.
We kind of expanded the availability of this. It's called Bypass Cache on Cookie, and effectively, it allows a Cloudflare customer to basically configure their site such that when a cookie is set, it bypasses the cache.
That will bypass the process of the Cloudflare Edge, and that allows for instances where, for example, you cache everything on the site, all static assets, but the moment a user logs in, the moment they interact, the moment they do something dynamic, they are no longer accessing the site via the cache for those HTML assets, but they'll still benefit for things like anything which is to do with, you know, static resources like CSS and JavaScript.
And so, I worked on this a while ago, 2016, December 2016, and this feature went out, and it was, you know, quite smooth, and it went out quite effectively.
Then, almost about six months later, there was an interesting case where a customer joined Cloudflare with a credit card on our business plan.
They're running an e-commerce website, which was using Magento 1.9.
Magento significantly increased their performance, but at the time, this was out of the box.
The e -commerce platform wasn't the fastest solution, and obviously, they wanted to improve their performance, so they looked to this feature, this Bypass Cache on Cookie feature.
Now, they had already attempted a lot of the things to do with, you know, response speed and the browser rendering content.
They enabled HTTPS. When you enable HTTPS, you get the advantages of HTTP2, which meant the site itself was faster.
They also enabled things like server push, which means when the first asset is loaded, it immediately starts pushing, you know, the static resources, the CSS, the JavaScript, without having to wait for the page to be rendered and then those to be downloaded.
They inlined a critical pass CSS, you know, the things which are required to load the critical pass.
They put those in line. They used WebP, so they enabled a feature called Cloudflare Polish.
When you toggle this on, images, when they are faster to deliver over a new protocol called WebP, they're loaded via Polish.
And they also lazy loaded images and things as well.
So, you can see these optimizations they've applied really are tied to the side of things, which are to do with the response speed, with the browser rendering, rendering the page.
So, if we go back to our model of what affects performance here, they've got the first step, you know, which is really handy.
They were using Cloudflare's DNS solution as it onboarded to us.
They were using our out-of-the-box optimizations for SSL, lots of those different things.
Then they'd also, you know, enable the features to get a faster page render, the things which rewrite their website to make that process a lot faster.
But the gap was really their response speed. So, in the scheme of things, these things kind of didn't make a significant difference as a whole.
And coming back to kind of connection and request time, you know, removing unnecessary redirects was quite important.
And then out-of-the-box, when someone uses TLS on Cloudflare, when they, you know, use HTTPS, there are a bunch of features which we have enabled, session resumption, OCSP stapling, which, you know, are optimized for speed as well as security, fast elliptic curve cryptography, crypto ciphers are prioritized.
So, in effect, that means the ciphers which use elliptic curve cryptography are prioritized if the browser supports them, which can be faster than more traditional methods, dynamic TLS record sizing.
So, these were things which were already behind the scenes enabled for them.
But what it came down to is what they were optimizing was, you know, the steps we see, which are this kind of an example of what a slow time to first byte looks like.
You have, you know, in this context, the time to first byte and that line color is substantial for the first request for that HTML page, whereas the other things around it, you know, the DNS lookup, the initial connection, download of the content itself is a relatively short amount of time.
But that kind of lime green bar on the first request is what takes up a substantial amount of time to make the request.
And this was effectively blocking everything else.
It was meaning that any other optimizations they made would not yield a substantial improvement, you know, as a whole.
They could, you know, even if they say by 50% cut down the amount of content that was downloaded, in the scheme of things, the time to first byte was the most substantial thing at hand.
And here we are. So, this is kind of the cookie-based caching structure the customer used.
Initially, the visitor would go to a site, you know, they're just browsing around and go to a product page.
The Edge, the Cloudflare Edge network, what it would do is they would make, you know, a HTTP GET request for that asset.
That would hit the cache and it would just be returned straight to the visitor.
There's no customization needed at all. It can be served the same to every single user so it can be cached.
And any of the, you know, other static assets other than the HTML, the JavaScript, the CSS, the images, that can also be cached as well.
Then the moment the user interacts with the page, the moment they add something to the cart, what effectively happens is a POST request would bypass the cache as things are.
It goes straight to the origin. And when it goes straight to the origin, the origin then can determine what is sent back in response dynamically.
And so, it would miss the cache on that request and then on the way back, it would set a cookie.
And, you know, the cookie could be something which sets up a session, sets, you know, give some other form of indication that the user has interacted with the site.
And then from that point on, Cloudflare knows to bypass that element of the cache on their server.
There are some other solutions which can actually be done where, for example, in the cache key at the Edge, you can insert, say, the session ID that's used and that would give the user, you know, the benefits of caching alongside this functionality.
That's called custom cache keys.
But in this situation, we're dealing with a scenario where the user is, or the customer rather, the person running the site is basically using this in quite a straightforward configuration.
There is one state for users who are anonymous and there is another for those who have loaded their web pages dynamically.
So, this configuration can be made via a Cloudflare page rule.
I've just given an example here as to what it could look like on a Magento site.
They enable cache everything. That tells Cloudflare to cache everything at the network Edge, but they say that to bypass a cache if these cookies are seen on a request, external no cache, PHP session ID, admin HTML, the pipe operator there is effectively an or statement.
They can also set the amount of the time to live those assets would live for at our network Edge before they would be purged.
So, for this customer, there was an interesting kind of dilemma when they enabled this.
They could, on one hand, they found that after turning this on on a staging site, they couldn't actually add the items to a shopping cart on the first attempt, but it did work on subsequent attempts to make that request.
And they contacted Cloudflare support.
This was about six months after this feature went out and this support ticket was passed my way in the morning.
I came into the office and someone on the Singapore team had handed over something and asked for my help on this and done a deep dive to identify what the issue is.
And that's kind of where the story starts.
So, when debugging an issue like this, there are a few different steps to follow.
The first question is, is Cloudflare working as it should?
A few steps to do this.
There are initially awareness of anything which may have gone out which could have impacted the feature and whether things like the automated tests in the software are working as they should, they haven't been altered.
There was a deployment which on our side could have introduced any bugs.
So, that looked all clear.
And the other thing, of course, is the support team member, the technical support engineer who had looked at this issue initially, they had already done some end-to-end testing themselves, they'd already done some diagnostics on their site, but it's often good to cross-check this work just to make sure there hasn't been anything skipped, even though our technical support engineering team are very, very technical and good at their job, there's always scope for human error in anything, so it's good to cross-check any work that goes on.
And then, thirdly, about, are there any recent Magento bugs?
And I couldn't find any. This is something which sometimes happens, especially if you're dealing with large software platforms, the large CMS platforms, a network like Cloudflare, we often end up hearing about bugs in pieces of software like that, sometimes before the vendor hears, and so it's often also good for us to check that and do some sanity checking.
And so, after those first priorities are done, and you know those individual components are working, we can then try to replicate the issue as a whole.
We can set up an identical version of Magento somewhere on, spin up a server somewhere, put it behind Cloudflare, have an identical configuration, and see what's going on under the hood, really, and try and replicate the issue.
And this is where we're then able to identify what the issue is.
So, it's really about starting with this problem and trying to break it down into its component parts and see what is actually at fault here.
And fortunately, this allowed us then to replicate the issue, which was quite a positive step forward.
And the results of this debugging process were, it was to do with cross-site request forgery protection, which is kind of the initial instinct you'd get at looking at a problem like this, but having followed this process, we're certain that this is basically what the problem is at hand.
Cross-site request forgery protection on the add to cart form. So, the moment a user clicked the add to cart form on a site, it had cross-site request forgery protection enabled based on a session cookie and a token as a hidden field.
So, cross-site request forgery, for those of you who maybe aren't familiar with it, addresses a problem whereby if you have a form on a website, someone else could take that form, put it on their site, and when a user submits that form, it could do something on a third-party website.
So, imagine you have, say, a banking website, and there's a transfer money page.
If someone's logged in, and then they go to a different site, and maybe there is a hidden form somewhere on that other site, they could hit the end point on the other site and basically trigger that action.
And the way cross -site request forgery works, there are a few different ways of achieving this.
OWASP has some quite good guidelines, but you have a session cookie, and this session cookie basically determines a user ID, and then you have a token, which is set on as a hidden field in that form.
And so, this is generated, it could be from a cryptographic hash function with a timestamp, perhaps, and this basically allows that process for it to validate that the user has actually come from that form on that page, and no one is trying to do a cross-site submission of that form.
So, that's where the name comes from, it's a cross-site request forgery, because someone is mimicking the behavior of that form.
Magento, due to CSRF vulnerabilities, applied on a blanket basis, CSRF protection to all forms, which is definitely a good step forward.
This was a while before even, I think, we rolled out the bypass cache on a cookie behavior.
And this, from Magento 1.8, 1.9, basically broke some caching behavior there, because users were then no longer able to dynamically do bypass cache on cookie behavior, because what would end up happening, the user would basically make a submission, they would add something to the cart, and only at that point would the cookie be set, they would have something added to their shopping basket, or they wouldn't have something added to their shopping basket, but they'd at least have the cookie set so they could attempt it the second time.
But that's not ideal, you know, if someone clicks add to basket, you want it to work on the first attempt, basically.
And I'm happy to say, in Magento 2, this was dramatically stepped up.
They basically instead use some secure JavaScript practices to dynamically insert CSRF tokens, which kind of gets around a lot of these issues.
I'll let you know as to some more resources that can be used to identify that.
But at this time, Magento 2 really was an option.
It's quite a big platform shift as well.
So it's not the easiest thing to migrate to, even if it is something which is available and the customer can upgrade to, it would take them, you know, a serious amount of time to do.
They could disable the functionality on very low risk forms, you know, but again, that still isn't ideal.
It's low risk, it isn't no risk, and ideally, as a security company, you want to eliminate risk as you can.
You could use, at the time, we had something called EdgeCode, but now we have Cloudflare Workers.
Cloudflare Workers allows you to write JavaScript at the edge, which means you can do a lot of these things at the network edge, inserting CSRF tokens and doing validation and things like that, really.
But ultimately, the outcome we recommended was we found a plugin which basically used a secure AJAX way to fill in the CSRF tokens on the site, which got around the caching issue.
And yeah, we updated our guidance, we told them, you know, this is how you install the plugin, this is how you solve this particular problem.
When you set this mode, it will basically, you know, use that method.
And so I actually wrote a blog post about this particular strategy on the Cloudflare blog, and you can go and read it if you search the curious case of caching CSRF tokens, and you can read more about the technical measures behind the scenes there.
But the story kind of doesn't really, the few other interesting bits which followed.
So earlier on, I referenced the OWASP guidance as to how CSRF tokens can be added to websites.
There's a few different ways to do CSRF protection on sites.
There is one way in which, for instance, a user can do something where they inject a cryptographically secure token for the user to validate the user's gone through that path.
And that's one way. And there's another layer you can add on top of this. And someone actually wrote a Cloudflare worker, which does this particular piece of functionality.
And what this does as well is this will check to see if like the referral origin headers, which are received by the network edge, are consistent with the journey of the customer.
You know, they're making the request across the same site.
If it's from a different origin or different referrer, there are basically, you know, there, then there is definitely something going on across site there.
The special thing is those headers, the origin and referrer ones, they're kind of special status protected headers.
So they're not really ones which can be spoofed in an AJAX request, which does improve the security somewhat.
But really in this context, the really critical thing to note is that, you know, both these measures used in tandem kind of operate really quite well together as ways of mitigating this attack strategy.
So it kind of isn't an either or, but it is a both situation.
The other thing I was just going to add briefly is the other thing, which is around, you know, actually doing the CSRF tokens by a Cloudflare worker.
You can do that as well. We now have HTML rewriting at the network edge.
So you can rewrite a bit of HTML to add a CSRF token in somewhere, and you can do that validation at the network edge as well.
So long as you've got something to validate that when it goes to the origin server as well.
So you can really, in fact, use these different strategies at the network edge and ultimately drive performance up.
It doesn't need to make a round trip to network origin.
It is done a lot closer to the user. It doesn't need to wait for an entire request to be formed for the user to do that.
So from this stage, there are a few different conclusions which we reached.
First of all, the ability to cache anonymous page views is really, really powerful.
Things are faster for most users.
Most users, you know, they will not end up necessarily buying something from a website and often the most critical situation or when you want stuff to be fastest, it can often be during those earliest phases when you're trying to capture a user, when you're trying to convert them into a customer, et cetera.
It has other benefits as well. That has significant performance benefits for the customer.
Their origin server, their workload goes down in effect. They're freed up to do other things.
In certain situations, you may wish to use that custom cache key functionality we spoke about earlier where you want a customer to have the other benefits.
The other benefits, they have a degree of caching when they're doing things and logged into a site.
But in many cases, many small to medium sized businesses don't need that.
They just need to cache the anonymous things.
Using JavaScript or Cloudflare Workers to introduce these states when a page is cached is a really powerful functionality.
There are certain things you need to be cautious of when you are doing things with JavaScript in this fashion.
The one thing that comes to mind, for instance, is on cookies, you want to make sure you have things like the secure flag enabled, the same site flag enabled.
In the blog post I spoke about earlier, I've included some of these bits of guidance towards the end.
The other thing, of course, is you want to make sure your get requests remain safe.
If you need to do something which is non-indemnipotent, those types of requests, you really want to use post, put, delete.
Making sure that those things are restful makes it a whole lot easier to make this process more secure in general.
Here we are. There are a number of remaining questions which are at hand here.
We've addressed this issue for this particular customer, but it took about a year for the support request to come in.
The feature was released in December 2016, but the support request came in in October 2017.
It wasn't just one. There were a few customers who asked about this.
There were multiple requests about this issue, but we hadn't seen these requests come in before the specific issue had materialized.
If we go back to see the initial state where things were at for the customer here, when we were debugging this issue, we knew there wasn't any regressions in Cloudflare, so there was no immediate need.
We knew this had been a longstanding issue with Magento, and so therefore, this particular issue couldn't really have been addressed.
It wasn't something that materialized very recently, and that's why customers brought it to our attention.
There must have been something very different at hand.
We knew there were a lot of customers using this functionality, but specifically in this e-commerce setting, we started getting these requests at this point.
Why was this the case and why did this kick off? It turns out it was Black Friday or approaching Black Friday in the US, and e-commerce website owners specifically, they were concerned about the performance of their web properties.
They were concerned about load and these types of issues.
That's why they adopted Cloudflare, and that's why we noticed this issue around that specific timing.
It wasn't because something had materialized as a brand new issue and it was something we need to address.
It turns out in response to a market condition, the customers effectively, our customers had to react to that.
They had to use our functionality in that process.
They had encountered this specific issue. Going through this process and helping our customers diagnose that is an interesting use case.
We get to see it Cloudflare. We get to be able to help customers. At this particular time when I was working on this functionality, it really helped me gain insight into how things look for the business side.
Obviously, I think Black Friday this year and over the next few years, we'll hopefully not see scenes like this.
I'm not sure where that looks like. There we are. Thank you so much for staying with me and going through that little journey and that case study today.
I hope you enjoy the rest of Cloudflare TV's programming.
There is an email address on this page if you have any questions which you want to ask me or you want to get in touch.
Those will be forwarded to me. Thank you so much for taking the time.
And I hope you enjoy the rest of the programming for today. Thank you.
you