Immerse Sessions: How Surfshark supercharges its VPN infrastructure with Cloudflare Workers

Presented by: Kęstutis Saldžiūnas, Tim Dowdall, Mark Dembo

Originally aired on February 6 @ 12:00 PM - 12:30 PM EST

Kęstutis Saldžiūnas, Technical Product Manager at Surfshark, sits down with Tim Dowdall, Senior Manager and Solutions Engineering at Cloudflare, and Mark Dembo, Principal Solutions Architect at Cloudflare, to discuss how Surfshark leverages Cloudflare Workers to enhance their VPN service infrastructure. Saldžiūnas shares insights from managing a rapidly growing privacy and security platform serving nearly 3 million customers.

In this conversation, they explore how Surfshark uses Cloudflare Workers to handle over 10 billion requests monthly while improving response times up to 3x in regions without local data centers. They examine Surfshark's journey from using Workers for credential stuffing protection to managing 40% of their traffic through the platform, and why moving computation to the edge with Workers and KV storage has helped reduce infrastructure costs while enhancing security and user experience.

The discussion concludes with a look at the future of AI integration in VPN services and Surfshark's plans for machine learning models at the edge, as well as their transition to configuration as code and enhanced monitoring capabilities to support their rapid growth. #EdgeComputing #Security #VPN #CloudflareWorkers

Learn more about Cloudflare's developer platform !

English

Transcript

I'd like to just personally thank you for having a conversation with me last week while you're on PTO still in Japan. I really appreciated you taking time out of your personal holiday to talk with me and get an insight exactly what it is you're doing with workers at Surfshark. So could you just introduce yourself and a little bit about Surfshark and we'll go from there. Hello, thank you for having me here. I'm Technical Product Manager at Surfshark. Surfshark offers a suite of security and privacy tools to protect digital activities of our users and also bring privacy. And it's easy to use that application. It's actually very known, this brand for VPN, we provide, but also we have a suite of products like alternative ID, antivirus, search, alert, and we're working with Cloudflare, it's more than six years. And as everybody participant told, we start also from using. free tire and after we migrated to enterprise and right now we believe we have quite huge footprint in cloud here as we have about three millions customers near three millions and during the month we have in ten billions of requests from these applications on mobile application or desktop application or browser extensions and successfully handled it because we help it's a critical component in our architecture and we rely heavily on it yeah um so just just so i understand how surf shark manages a vpn request for one of your three million customers um do you manage the ingest in the first place and then do you use the then how does it go through our platform and then outside the other side yes this is the uh the first front end the customers using our products face and to that platform or we load balance the traffic and from depending from the different products it goes to whatever system we design are under the cloud yeah and do you use workers as a key component for that first touch with their customer when they're choosing the server point is that correct or the server list yes this is correct just let me a bit clarify so when our customers use vpn service of course it use our own infrastructure but when we talk about the management of overall service so yes that is the cloud that all the stacks in use so going through the waf application firewall down the all the rules we reach the final stage workers. And here we have about 40% of requests handled in some way by workers and 30% of that traffic handled purely by workers. So it means that we have a huge offload from our infrastructure on our Cloudflare platform. And also we have... have low latency because of that. We could reach even about three times faster responses in the regions where Surfshark data center is not available. Yeah, so we just saw Mark do a good example of deploying a worker's application in seconds. Was that a key component of you deciding to use workers? The speed of deployment or was it speed of operation or latency? What was the key? I think, yes. It's easy to use, really. I haven't tried the beard right now, but we'll try next time. But we have a lot of documentation in Cloudlia. We have a shallow learning curve. A lot of examples, use cases to use, and when you know... domain very good you can precisely write for specific function worker to handle it and it works amazingly fast and also it's easy for for team to understand what what this functionality handles for example on cloud layer we have geolocation data so instead going down the stream into our backend systems to retrieve for that specific IP address the location our needed data to provide to customer. We immediately could take it from the worker and expose it in one or let's say five milliseconds. Wow, very impressive. Did you see, I mean the primary reason for that was around customer experience, right? So when they're using the application, the service as fast as possible. What was the reason for you deciding to offload that workload? Because was it being too slow for customers? Were they complaining about how long it was taking to return the list? Actually, to this we were derived by the landscape of how did those attacks on ground? So it means that all those online threads like we're driving us to look for some reason how to handle these ongoing spikes and how to leverage available platform on cloudlet to to for example rate limit these requests and not to pass to on our infrastructure as it indeed to do ourselves is complex and when you have already available tools you can just take and use it. Yeah and I think you mentioned to me that in fact one of the challenges you had was the app itself was essentially almost a self-owned goal. It was a DDoS of yourself to an extent in some cases. Yes, in some earlier phases, in some situations, not fully. tested even maybe before there are some situations when application itself could loop and make a lot of requests on some endpoints and in this situation to mitigate Cloudware was a perfect tool. Can I ask a question? How many requests did you say you have a month again? Above $10 billion. Okay, so one of the common criticisms that I face on the internet is serverless compute is freaking expensive. How did you handle, what is your take on this? You're having 20 billion requests per month. Is workers for you expensive or is it cheaper than what you had before? Is the value good? What are you saying regarding servers is too expensive for production? What you hear online sometimes. Of course, we have a budget, everything is limited, it's definitely. But you always have a choice, we have a budget and you know how to re-architect the system to work differently, how cost, so you are in balance. So, it depends only from... management perspective, how this overall architecture impacts user experience, is it worth to invest more? For example, if it's work okay with such an approach, so you just let it go and just look for some solutions in the future to implement, to migrate maybe for some not pool architecture but for example to push so you're saying hey you went to management and said hey here's the trade-off it might cost a few dollars but the user experience is so much better that we can make that investment right is that what you said yes yes that if we need to evaluate yes what is the the benefits of it so When we were talking about that particular point, you also said that it is still hosted on AWS behind, so it will fall back to AWS. But actually what you found by using workers was you mitigated the compute cost with AWS. Is that right? You actually spent less with AWS because you were doing it in workers and avoiding that. Definitely. If you can do some computation on edge, it gives us lower cost. cost down the stream because we do not need to pay for some intermediate steps we do in the backend. Also, if we are capable to do some computation on edge, we do not need to pay for, for example, for fraudulent transaction. If, for example, we can immediately determine that this is the fraud, we can immediately block or some do mitigation with. passing some additional parameters to OB and how to treat this request. Yeah, that's one of the use cases you mentioned to me, was that you talked about the login security. You had a lot of credential stuffing attacks. Tell me a little bit about how you used workers in KV to handle that. Actually, this was mainly the first use case we started to work with it because... As we were growing, we got more attention from the bad actors, which were trying to tick over accounts. So, with the workers, we found that it's the best way how to block these attempts. Initially, we started to use GA3 fingerprints. to count them and to determine which are the fraudulent ones and to block together with both core data provided. Unfortunately, at some time on the Chrome browser when the TLS extension order was shuffled, this became not so reliable. And we afterwards implemented a turnstile together with the worker functionality. it works right now very well we actually even forgotten forgot such a fraud and it's always surprisingly to see that successful login account is higher comparing to failed account of login so now what we do so we more like focus on other threads and on more like optimization of performance overall system or how we handle processes of changes on in the system again we talked about this in one of the other sessions technology really works when it's doing what it needs to do and you can do exactly that you can forget about it and assume it's solved problem and it's just handling the issue it's great to hear that at this point you're just like we don't even worry about that anymore it's just handled What a great thing, because now, again, you now focus about making the experience for customers better, not dealing with better actors, because it's in that regard, and there's other regards, but in that regard, it's solved properly. Of course, we can't be relaxed, because all the time, bots evolve. They are really, the bad actors are very advanced and invent other ways how to attack. So we need, that's why we always follow what our news are coming from. We also, we have to have very good monitoring to detect when the traffic patterns changes, because a lot of weak points still could be found. in the systems. Gotcha. We also talked briefly about how you use workers and durable objects around restricted networks as well and those sort of things. Do you want to tell me how that evolved? We first started with the worker KV storage to use and this has some benefits and robust comparing because You have on all edge DC these kV values, but to write into kV value takes some time. So it means that synchronizing could take, according to documentation, up to one minute, but really could take up to 10 seconds and it's always improving. So when you need in real time to collect... lot of clients signals and to make some decisions on that then comes into the help durable object it's not as easy to develop comparing yes yeah but when you have a real need you manage it so just this simple Java classes and you can work with yeah I can perhaps elaborate for the audience because this is already getting super technical work is our serverless compute that is by default stateless but at some point you need data workers kv is our globally distributed eventually consistent objects like data storage so that means if you do an update it will take about a while until it's populated globally so if you need always real-time data it's not a great choice why it's a great choice so many things is like you can hit millions of reads to it per second and will not fail that's one of the things customers love about kv and then we have durable objects which is a really interesting concept because your data let's say what you have a user data typically you would have that in one table in your database and that lives in one place what you do with durable objects is you define a user object and then every user the data is atomic so my user object if i create it right now it will be in warsaw if my con my colleague creates one in he in the bay area it will live in san francisco why is that important well low latency right so there's that's the concept there but you're completely right the developer experience is definitely it there's a learning curve for dribble objects but it's a really interesting concept but what we noticed also that always the dashboard evolves and a lot of more features added to manage workers kv storage and also durable object and to see that So it's changing continuously. It's getting there, yeah, a hundred percent. So outside of what you've been doing with workers, you've also been using some of the, some of our smart capture technology to help with fraud and capturing fraudulent attempts to use your service. Can you tell me a little bit about that? When with GA3... got a problem due that changes on Chrome browsers. Luckily that technology was on Cloudflare mature to start using. Beside that we also use JavaScript detection. So it means that Cloudflare runs JavaScript on client side, collects all the signals and with that we can get reliable answer from the cloud. how to proceed with term style what we have more we also by defining how risk environment client is working we can set what type of widget to use and we have here a lot of flexibility how to use and a lot of information how to proceed Brilliant. And also the other thing going back to workers again, right? So you've had success with moving some of these functions that were previously in AWS or in some other environment. Maybe it was legacy, maybe it's in a hyperscaler. But looking to the future, what other key areas would you be looking to move into a workers environment right now rather than where it is today? We have quite a lot of things already on Worker's KV, so we have all kind of dictionaries here like a currency, like country list, our servers, so everything what could be calculated easily on edge, I think should be there, because being closer to customer, we can immediately provide the results. Right now, we provide VPN status information and the context, the application working in, what additional we could think and compare the cost of development and maintenance. We can cache, for example, and keep service list here and do calculation by exposing those nearest, for example, servers or dedicated for specific regions, servers. So everything what is publicly accessed could be processed on the cloud. So it's the question only matter of investment and what is the gain of migrating to there. I'm sure you were as entertained as I was with Mark's last set. on AI. It was entertaining and informative. That's what everybody wants all the time. Do you see any areas where you would want to be using any of our AI models with Surfshark VPN as a service? Of course, this is very promising area so we are closely following how it's evolving because, for example, for payment... endpoint protection, we are using own rules engine. So based on the collected data, we decided how to proceed with the payment. So we are right now developing our own machine learning model, how to migrate from the rules engine to machine learning model and from there get the answer about the fraudulent on. transaction. So it could be one of the options to have such a machine learning model at edge. Another thing what we need to do while we are growing in such a pace to have better monitoring about the events on the cloud and how all the type of transactions are mitigated in real time. because could be that due to some reasons, some public API is blocked, but you do not notice it. Another thing also as a company, as we are growing, we need to rely more on the configuration as a code, because previously more like we are using dashboard, user interface. Now we are going to use more like API tools and all configuration things in Ansible, how I will manage changes in the platform. This is the focus. And of course, recently we have launched, as an example, one worker to serve our short names, how they are resolved into long URL addresses. So, they think that… As the page rules are deprecating, we also have a good opportunity to migrate not to some more like, not such complex tools, but write our own more custom forwarding work. Yeah, very interesting. Bright future ahead for Surfshark, no doubt, especially with the workers platform. Be interesting to see what you do. with AI. Obviously, it'd be something beyond summarizing. You might be using voice or imagery or all sorts of things. It'd be quite interesting to see what you do with it. I think that's all the questions I had. Mark, do you have anything that you want to do? Yep. I would love to know how was your experience initially going to workers for the first time? I mean, like if you're used to, I don't know, running your Kubernetes or your Docker containers, I don't know what you guys... use it's it's a different paradigm right so how was the learning curve for your team it went very smoothly for me it's a first impression but was like since i was yeah and i did the world quite for long so that it works like a oracle pal sql blocks building if you have ever tried it so it works like this so just right and That's just the difference. Oracle is only for the enterprise and for the whole world. I could not have said it any better. I might have to record this. It's always like that was a setup, isn't it? I need your PayPal account after this talk. Like, no, definitely. One more question from my end. What is the feature you would really love for us to develop or release? Is there anything where you're like... I have a list. You have a list. There are a lot of, yeah, so but if just to mention one, we need backup of workers KV storage because we have to iterate and it takes the time to extract the balance, to register the domain through the API so because we have a lot of domains, it's 200 even more and we in restricted networks use it intensively So we generate actually automatically with them. So it's enough, I think, for the beginning. Yeah, that's stuff to work off. I think there might be stuff in the pipeline. You should follow the Cloudflare blog pretty closely. Follow the blog. Keep looking at your dashboard. I mean, it seems to be the two messages coming out of this session. Keep track. We move fairly fast. Lots of features are coming out on a regular basis. So it's always a good idea to keep an eye on. especially like the blog and then occasionally look at the dashboard and it might just be there you never know right right thank you so much uh really appreciate you uh spending time and again like i said that you spent time whilst you're in japan of all places to speak to to lowly cloud player really do appreciate that it made a big difference to us but thank you for your time very interesting thanks a lot and thanks to mark as well