How to Prepare for Black Friday
Presented by: Simon Wijckmans, Deon Roos
Originally aired on September 29, 2021 @ 11:00 PM - 11:30 PM EDT
Ecommerce customers often ask for advice on how to prepare for high load scenarios like Black Friday, Cyber Monday, or product launches. This segment will cover a range of best practices to secure your online shopping experience.
English
Transcript (Beta)
Hello everyone, my name is Simon and together with my colleague Deon, I'll be presenting some general best practices on how best to prepare for peak moments like Black Friday.
We're going to be splitting the content over two presentations, one today and one later this week, which will be presented by my colleagues Paolo and Vince.
So let's get started. So we're going to do this in a bit of a Q&A session.
So we're going to ask a question first and then how we would, as Aziz, normally answer this question.
So I run an apparel ecommerce store and we were preparing for a large Black Friday sale.
We have customers all over the world and last year, one of our servers failed to CPU overload.
Other customers complained that our website was very slow and that we could, and so what would be the best way for us to prepare in terms of static performance.
So in this question, there's quite a few interesting things to note.
I marked them down in orange. So one piece of it is a large Black Friday sale.
This would be a high peak moment and something that we'd like to just verify our configuration up front if you already have an existing config on Cloudflare.
If you're new to Cloudflare, you'll do what I'll be presenting in a couple seconds.
We have customers all over the world, so you can expect them to come from a variety of Cloudflare points of presence.
So you might want to incorporate that into your strategy.
And then last year, some of their, they had some CPU overloads.
That would mean that they probably didn't have a CDN in place or they didn't really incorporate for the additional load that we had.
So how would we deal with static performance here? Firstly, let's quickly talk about how the CDN actually works that we offer.
So the Cloudflare CDN is a pool CDN, which means a customer would take the first hit in that region.
So let's say I'm currently in London.
No one else has ever been to that website or ever loaded those objects in that area.
I, as a first customer, would first see a miss, which would mean the content is being served to me directly.
And then the Cloudflare Edge would catch that.
And then it would be a hit and that would no longer be requested from the origin.
So what do we actually catch? That's a really good question. So generally, those would be what we call static objects.
Think of product images, CSS files, JavaScript, sometimes even HTML files, if you specifically mentioned that.
On the right of this page, you see a whole bunch of extensions that by default we always cache.
What is important to know is Cloudflare respects your cache control headers.
So if you have cache control headers in place, it is very much a click and go scenario.
You can quickly onboard onto Cloudflare and we will cache just as you would normally cache.
So there's a picture of what a cache control header looks like in your browser.
You can see that through the developer's tools quite easily.
And so I'm going to quickly show you what good results look like. In the Cloudflare cache dashboard, you will find under cache analytics, a graph that should look like something like this.
So this is a good example because, as you can see, the orange line is the content that is served by Cloudflare.
As you can see, the majority of requests are actually being done through us.
So almost all content is being served as a hit.
Just like about 20k requests daily quite stably are being requested from the origin directly.
So this is quite a good number. Now the next slide is sort of a horror story.
This is my demo site. There's not a lot of traffic on there.
And there's also a lot of content I specifically marked to be service dynamic.
So as you can see, the orange line is below the blue line. That being said, obviously, this is not a real life scenario, but there's a lot of content that Cloudflare is not really touching and I'm just serving directly every time.
So as you can see, I have some misses. I have some revalidates in there and quite the amount of dynamic content.
So what you can do in the analytics pages, you can filter just on, for instance, the misses or the dynamic and then look in what the path was of that specific object.
So, for instance, I see here that Instagram.svg and favicon .ico, so an icon file, are not being cached.
Let's say that's like 10k requests that are not being cached that have an extension of .ico or .png.
That probably indicates that there's no cache control header in place for that specific object.
And so you might want to actually spend some specific time on making page rules to bypass or overlap on those specific objects so that they are actually being cached.
So how do you do that? Very simply, you go into the dashboard and go into the page rules.
You select specific path you want to use and you create a rule that looks something like this.
So cache level everything. You can change the edge cache TTL to whatever you'd like.
You can even push that up to a year. And you can then select whether you want to adhere by the origin cache control or you want to override it.
So that's basically all that that specific page rule does. And so caching is now working.
Cool. But what happens behind the scenes? We have something called tiered caching, which is purely for enterprise customers.
And what that specifically does is if you have customers all over the world, and some of them are in lesser populated areas where our data centers are in lesser use, it's very likely that we will have a tier 3 POP there.
So we have multiple tiers in locations.
And so that tier 3 POP would probably not have that cached content available straight away.
Instead of going directly to your origin, we'd be looking for a content on a tier 2 POP first.
If that one doesn't have it, then we'll go to a tier 1 POP.
And then each and every tier 1 POP has a direct connection to your origin in case they don't have it either.
So it is likely that you will have a couple of connections coming from multiple tier 1 POPs around the world to your origin.
So this is tiered caching. Pro customers and business customers or just free customers will not have this.
They will see a lot of connections coming from all the POPs.
But for enterprise customers, we can go one further. This is a very specific use case, but it's very interesting to customers who have a very wide and very global customer base.
It's called the custom cache topology. And what that does is basically it puts an additional tier 1 POP in front of the normal in front of the other topology that I just explained.
So you'll always see one POP or maybe a secondary POP connecting to your origin, but that is it.
Not all tier 1 POPs in the world will connect to your origin.
So basically, the picture would leave the origin to that tier 1 POP and from there on, it will populate across the whole cloud network.
So only one connection can lead to the whole world having that content available.
This is super powerful. And that is a custom cache topology.
So everything is cached. Everything is working. The topology is awesome.
What's happening behind the scene is working great. But sometimes you want to get rid of cache.
Let's say you change the prices on your objects or on your articles in your eCommerce store.
Or let's say you've got new pictures available or an item is no longer in stock.
You just want to get rid of the cache. You can either just purge by URL, which any customer of Cloudflare could, or you can get into the more advanced options, such as purge cache by hostname, purge cache by tag, or purge cache by prefix.
One very interesting one to mention here is the purge cache by tag.
So let's say you have 100 million objects that you need to purge the cache from and you want to do that specific.
It's interesting because you can add tags to all those objects and you can make an API tag, an API call containing up to 30 tags at once.
You can reduce the amount of API calls you have to make that way quite efficiently.
That's the one that I prefer using. But of course, any architect could use whatever they want to make this work.
You can select a specific hostname.
You can select a specific prefix. You can do it via UI. You can do it via API.
It just works. And so that was all related to how our CDN works.
Obviously, even if our CDN is working great, that last mile, a lot of stuff can still happen there.
So what about just improving, compressing, etc., the static objects?
So Cloudflare also has the support for auto minify.
What auto minify is, is let's say you wrote a wonderful ecommerce platform.
And you have some copies of data in your HTML and your CSS file or your JavaScript is just not written very effectively.
We can turn on auto minify and we can reduce the amount of lines of code that you actually need to have that functionality work.
It's a pretty cool feature. And it really helps in terms of like using platforms like no -code platforms that like output quite bad code or spaghetti code.
This could really help. We can also polish images. There's two ways we can do this.
The one that we use by default is lossless. And basically what lossless compression does is you can always go back to the original quality of that image.
So it doesn't really impact the quality, but the size is not reduced that much.
Lossy, by example, doesn't reduce the quality more, but the quality of the image doesn't reduce the quality more, but also makes smaller size pictures.
So for instance, if you have a lot of customers that use your ecommerce platform via an app, then they might actually benefit from using that lossy because it's a small screen anyway.
And it wouldn't matter that much if the quality was a little lower.
We also support the WebP format. We offer broadly compression as well.
So that's just a different way to encode a website. We also have Mirage image optimization.
And basically what Mirage is, is we can analyze a picture and make it into code, basically.
And in case a customer comes from a very low bandwidth area, we'd be able to serve the content that way.
So basically, it's a different way of rendering.
And then we also support Rocket Loader, which basically prioritizes the look and feel of the content of a website in front of a GS.
So that way, the load time of the page appears to be a lot faster.
And also, we can do HTTP2 prioritization. So that's it from my end, really.
And I'm going to hand it over to Dion. Thanks very much, Simon. And thanks for giving us some really great, insightful information on seeding and then caching.
Great. The next thing is, I want to get into a question that I got. I run an e-commerce store in South Africa.
And it's that time of the year again where the peak loads due to Black Friday.
What visibility do you offer from a traffic perspective of traffic flowing through Cloudflare?
So first of all, I want to get into understanding the flow of traffic.
Before I get into explaining this, and before I get into explaining all about Cloudflare logs, I want to take a step back and make you understand why logs are so important.
Without Cloudflare, in the first scenario, you can see a client who communicates with a server or origin.
The logs are recorded on the origin server and then can be used to analyze things like client location, errors, and application health.
With Cloudflare, in the second scenario, you can see on the right hand side, the communication now goes to the client and then to the Cloudflare POP and then to the origin server.
A thing to note is the flow of traffic, that there's more information available to Cloudflare logs and enterprise log share.
This gives you a lot more information about what actually is happening on the objects as they pass through Cloudflare.
Something to note that this is also an enterprise only feature.
Today, logs are very important for people in all types of businesses.
Logs are used to visualize what's actually going on from a client's perspective.
They can also be used to analyze trends for a longer period of time.
They can also be used to compare week on week and month on month trends. Customer alerts can also be created when possible error codes are triggered.
I'll show you in a bit how that can be used.
Another thing to also notice is who would need those logs.
In most cases, DevOps engineers to make sure that their deployment didn't cause any errors and to make sure that they're seeing normal responses.
Developers can use logs to check that their code was also deployed and being used in the most optimal way.
Or you could also do this for everybody. People like CEOs could use this to do pool annual reports based on trending analysis.
Just to also note what is included in logs, URLs, errors, security events, geolocation and geography, and caching information.
How to understand Cloudflare logs.
So Cloudflare pushes their logs in JSON format, as you can see on the right hand side, which makes it easily easy for analytics applications to work with.
On the right hand side, you can see an example of what they look like.
This example has just a few fields that Cloudflare offers.
There are many more fields that can be added to this as well. Looking at these logs, there are a few fields that are good to monitor.
For client side errors, you would look at something like the origin request or origin response status field.
These would be errors from the client to the Cloudflare edge.
From the origin side, you can look at things like edge response status field.
And these errors are from Cloudflare's POP to your origin. You can also find other useful fields like for the web application, sorry, web application firewall, geolocation, caching information or client information.
Sorry, client IP information.
In most situations, you would make sure that your requests have a 200 response.
A 200 response means that the object has been served from the origin and it successfully reached the client in a good state.
In other situations, you would see 500 class errors.
These are bad errors and they mean something has gone wrong. Another example of how to look at this is when we see 200 versus 500 errors, you can clearly see that something happened during this time period on this graph, whether it was due to a load or the origin server being unavailable.
There are a few different types of 500 class errors.
The common reasons for 500 errors are that the origin server is unavailable.
The origin takes too long to respond, which can result in a timeout.
Network errors caused by packet loss or latency.
Cloudflare also provides a very useful API that can be used to trace route from our POP to your origin.
You can then also, as a customer, trace route from your origin back to Cloudflare's POP and that'll be able to give you a good understanding of what that flow of traffic looks like from the POP to the origin and the origin back.
This can be very useful when troubleshooting network problems and issues.
Next, I want to get into is how to implement logs.
I want to show you, I want to go through a few options that we have.
The first option that we have is what we call log pool via REST API.
In most situations, log pool will be used where you don't have a push functionality.
There are some downfalls to log pool. The downfalls are that you can only pull a certain amount of data over a period of time.
The other downfall would be that you need to specify a time range from a start to end.
And also, this time range can't be as close as possible to now. It has to be five minutes in the past, which will then give you delayed logs.
The next part of log push implementation is one that we recommend, and we have multiple integrations for this, including AWS S3, Google Cloud Storage, Sumo Logic, and in the future, we're going to be supporting generic S3 compatibility.
The analytics integrations that we have are Datadog, Elastic, Greylog, Google Cloud, Looker, Splunk, and Sumo Logic.
Today, I'm going to give you an overview of how to configure log push with Google Cloud Storage in Elk.
In this example, I will show you how to configure your logs to push to Google Cloud Storage, and then configure log stash to pull the logs from Google Cloud Storage and index them with Elastic, and finally, display all the data in Kibana.
For step one, you would need to log into your Cloudflare dashboard, select the Analytics tab, and then select the Logs sub-tab.
In there, you would then configure the push job to push to Google Cloud Storage.
This is really easy to do and can be completed in a couple of minutes.
Once you're done with that, your logs will be pushing to your Google Cloud Storage.
In step two, you can then configure log stash to collect the logs from Google Cloud Storage.
This then pushes the logs to Elastic. Something to note is that you can also configure log stash to easily collect logs from S3 bucket using a similar method.
The last step is to download the dashboards from our GitHub repo and import it into Kibana.
Once this is done, you will then be presented with the dashboards.
At the last step, you can now see all the wonderful dashboards that go with this and the data.
These dashboards include everything from performance and caching, security, which includes WAF and firewall, and bot management, if you have that enabled on your zone.
The last thing I want to step into is once you've got these solutions in place, you can then use them to create alerts.
Alerts are very important for customers out there because these give you early warning signals for when things are going to go wrong.
For example, you could create an alert that basically says, when I ever get 500 errors that are greater than 10 over a period of time, then send me an alert.
You can send an alert to an email, PagerDuty, ServiceNow, etc.
Thanks very much for joining me today.
I've also included a link. This will give you all the information that you need to get started with logs and set up into your environment.
Next, myself and my partner Simon would like to go through some questions and answers.
Yeah. So, Dion, let's say I have an on-premise storage option.
What can I do? What options are available to me? Currently, you can use the log pool with the rest, but we're also looking at developing new log push methods that will be able to push to generic S3 buckets.
And there's a lot of other methods coming in the future.
Simon, let's say my caching is optimal. What else can I do to reduce my calls to origin?
Well, you could potentially consider using Cloudflare Workers to run certain tasks at the Cloudflare edge.
We will be covering that in more detail in the next session later this week.
So, definitely tune in if that's something you're really interested in.
Hey, so, Dion, once the request is posted through the Cloudflare network, how soon are the logs available?
Log pool is available in about five minutes, five to three minutes delayed, whereas log push with our new log stream is available in 30 seconds, up to 30 seconds delayed.
Cool. And so, how long are these Cloudflare logs then kept for?
Seven days at the moment, but in the future, we're not going to be storing any logs and we're just going to be pushing them.
Simon, what is the maximum duration that I can store objects in Cloudflare cache?
That's a really good question. So, we can actually push this up to one year, but keep in mind that it's possible that we do maintenance in the meantime, which would obviously result in evicting that object.
So, it could be recalled in the meantime.
It is an interesting feature, but you have to be realistic when using that, obviously.
So, Dion, you just mentioned how long the logs are available for, et cetera, but what happens if my cloud storage destination is temporarily unavailable?
Very good question, and one that I've actually been thinking about for some time.
Cloudflare must be intuitive to understand that we can retry things.
So, I'm glad to say that we will retry for a period of time, and if that period of time elapses, then we'll disable the logs so that they don't carry on burning.
But the great thing about it, if we do manage to reconnect, then we'll carry on pushing the logs and allow you to catch up with the logs.
Cool. And how long would that retention be if, for instance, my storage option is down for a couple of hours?
I think we retry up to, I think, 10 minutes or an hour, and then we fail. Okay.
That's really interesting. Thank you, Dion. Thanks. All right. So, I think that were most of the questions that we have.
We didn't get any questions sent in via the chat, so I suggest we go on to the next part, and that's talking about what is yet to come in a later session this week.
So, later on this week, we'll be talking more about how rate limiting can help you in a high-peak load situation like we did today, for instance, Black Friday.
We'll be talking about how to use bot management and how a serverless solution like, for instance, our Cloudflare Workers could help you make the best out of those days, and in case of something going wrong, you would still be able to serve a substantial amount of your customers.
Also, we'll go through how to set up your origins to work securely together with Cloudflare.
So, how do you lock down those zones, et cetera. That's pretty much it from our end for today's session.
If you have any questions, don't hesitate to reach out to us via the live studio at Cloudflare.tv, and then we'll see you next time.
Thank you. Cheers. Microsoft Mechanics www.microsoft.com www .microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www .microsoft.com www .microsoft.com www .microsoft.com www .microsoft.com www .microsoft.com www.microsoft.com www.microsoft.com www .microsoft.com www.microsoft.com www .microsoft.com www.microsoft.com www.microsoft.com www .microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www.microsoft.com www .microsoft.com www .microsoft .com www.microsoft.com also costs the nonprofit money.
Cloudflare has been amazing in helping us identify these threats.
So as threats are happening in real time, we can then be aware of what country they're originating from, what kind of threat that that is, and then share that information with our customers.
And the beauty in that is, it's not taking up bandwidth or resources on our side.
How does Raise Donors help make things easier for your customers?
Just last week, we had a customer send out a massive newsletter, but they put in the wrong URL.
So what are they going to do about that?
Well, in that case, we used the Edge Workers so that when the request comes in, we could actually manipulate that URL and have it actually complete as it was intended to.
They were so thankful that Raise Donors was able to step in and help quickly and easily.
And we were able to do that all because of Cloudflare, which was phenomenal.
What advice would you give to all the nonprofits that are out there coping and trying to stay afloat right now?
But if it is something you love to do and you're failing, well, you're learning, and it's only going to help you even more so.
So be bold, don't be shy, jump in headfirst and go for it.