How Cloudflare Built This
Rustam Lalkaka, Director of Product at Cloudflare, will interview PMs and engineers on how specific products came to life.
This week's guests: Engineering Intern Ilya Andreev & Annika Garbers, Product Manager for Magic Transit.
Welcome to this week's episode of how Cloudflare built this. Wendy was calling for quiet on the set before we went live.
I didn't follow our instructions. So we've had a little bit of a hiatus.
I think the last episode was about a month ago, where we chatted with Matthew about how Matthew and Michelle and Lee built Cloudflare, sort of early days.
And today we're joined by two very special guests, Annika Garbers and Ilya Andreev.
And for those of you that don't know me, I'm Rustam Lalkaka. And I'm Director of Product here at Cloudflare, focused on our performance and networking products.
Annika and Ilya, do you want to introduce yourselves? Hi, I'm Annika .
I'm a Product Manager with Magic Transit. And I'm Ilya, very well known as an engineering intern on the Cache team, but previously I was Magic Transit team.
Well, thanks for joining us today. So you both have had sort of short, relatively, but interesting 10 years at Cloudflare so far.
And I'm curious to sort of talk through a couple different pieces of that, actually.
Wendy's still giving me crap.
So, you both started as interns, college interns, last summer, the summer of 2019.
Is that right? And sort of, I think both took sort of different paths to arriving at Cloudflare as interns.
So, actually, before diving into the internship, why don't we talk about like what you guys studied, what your sort of path to getting here was.
Ilya, you want to go first? Oh, yeah, sure. So, when I joined what was then the Argo team, I joined...
Before getting into Argo team and all that, like, where'd you go to school?
What are you studying? Yeah, let's step back a little bit.
So, I'm an international student at the University of Illinois, which is a well-known engineering school, but I'm actually a math major, but very strongly interested in engineering and computer science, obviously.
But I always had a feeling, like, in Russia, I'm an international student from Russia, and we have, there is a strong mathematical culture.
And I was always thinking that if I put enough effort into always also studying computer science, the formal mathematical education can bring something else to the table that would be useful long term.
So, so far has been working out pretty well. And yeah, as a, and my path with Cloudflare started as a rising junior one summer last year, I interned on the Argo team.
And why, why'd you, I mean, I guess, first off, how'd you find Cloudflare?
And then why'd you choose to join Cloudflare as an intern? Yeah, it's an interesting story, because Cloudflare has a lot of offices in big, well-known cities, but it also has an office in Champaign, Illinois, which is the place where University of Illinois at Urbana-Champaign is.
And as a huge engineering school, there is so much talent.
And some of the graduates of the school also essentially started the Champaign office and started teams there.
And I was introduced to Nick, the manager, one of the managers working at the office, a little over a year ago.
And he offered an interview for, for an internship there, and I decided to go ahead with it.
And yeah, that's how the story rests. The rest is history.
And Akka, what about you? Yeah, so I went to Rochester Institute of Technology, RIT, up in upstate New York, for a dual degree in industrial engineering and engineering management.
So like Ilya, my educational background is a little unconventional for people that work at tech companies, but people come here through all different kinds of paths.
In college, I did a whole bunch of random stuff with and outside of that degree.
I like built race cars, I did like social justice stuff.
But I always knew that I really liked tech companies, and software as a space, and I wanted to, to work in tech.
So after a couple of different internship experiences at different companies, I ended up at Cloudflare, kind of, kind of randomly, I actually had an internship lined up with a different company last summer that ended up falling through at the last moment.
And through some, some connections in my network, which I'm so grateful for, I was connected to Rustem at Cloudflare and ended up working with Ilya, and another intern on the Argo team last summer, which is awesome.
And so grateful for that serendipity.
Cloudflare, and yeah, I think, I think the serendipity is appreciated by a lot of folks.
What was the project that, so it sounds like you two, and Zheng Yao, the third intern, were sort of all sort of grouped together early on.
What were, what was the project, and what were your respective roles on that, on that team?
Yeah, so we built a project called Real Origin Monitoring, or Origin Monitoring.
And the problem that we were trying to solve with this was that Cloudflare, as a proxy, sees all of the traffic that goes to our customers' origin servers.
So we know when our customers' origin servers are down, there's a problem with them.
But what the actual end user experience was often, for our customers, is if an origin server is down and we, Cloudflare, can't get to that origin server, we'll serve an error page to the end user, the eyeball.
And so the end user would think either, oh my gosh, like this is a Cloudflare error page, there's something wrong with Cloudflare, or, you know, just the, the end website is down, but often the end user would find out about the problem before the person that actually owns the website is using Cloudflare.
And that's a frustrating experience for them, and they're kind of thinking like, hey, if Cloudflare already knows that my website is down, why don't you just go ahead and tell me about it?
So we decided to make a service that told them about it.
Sounds easy. What took so long? And there is a distinction, before we go into details, there is, like, you could configure active health checks, right, with like load balancing or some other Cloudflare products, but there are a little more upper-end products, and we were targeting more simpler origins and users that don't want anything so sophisticated, just like you sign up for Cloudflare and it works, and if your site goes down, you will be alerted about it.
Right, so there's some, some class of user who's, like, sophisticated enough to configure monitoring and set up fancy probing systems and all that, and they were presumably covered without this functionality, but a large section of our customer base and the Internet at large, you know, is run by folks who don't have those capabilities, and it sounds like this was sort of targeting that segment.
Was that, sort of, speaking to the project management, product management aspect of this, was that distinction between sort of different personas and what the real use case was clear from the start, or it took some research to sort of get there?
Yeah, it definitely took some research. We started with kind of an ambiguous problem, this, like, amorphous, like, we don't tell customers when their origins are down, so we talked to all kinds of different customers.
We talked to some enterprise customers that were already using other tools to probe their origins and do this kind of discovery and had pretty sophisticated monitoring and alerting systems, and then we talked to some that were in that kind of use case that Ilya talked about, where they were just, like, I want to, I want to use Cloudflare, I want to not even think about it, I just want it to be, like, plug and play, everything works, and then only if there's an issue, tell me about it.
So, narrowing down on that use case of those type of customers that want a solution that just works, TM, was part of the process early on.
Got it. So, once you did that research and, or, I guess, it sounds like the work kind of split into two chunks, right?
There's sort of more classic product management, like, figure out what we need to build, who to build it for, et cetera.
Ilya, what were you and Zhengyao sort of doing while Annika was sort of doing the research on the users?
Yeah, I was just thinking about it. It was very interesting.
So, Annika joined, it was Zhengyao, the other engineering intern, also from the University of Illinois.
He joined a little earlier than me, and then I joined, and Annika did.
And before Annika came, our requirement, our task for this summer was sort of, let's alert customers without, with a very loose definition of customers when origins are having problems.
And actually, what we started with, we started looking at large enterprise customers with Zhengyao, and then tried, we looked at the data and tried to categorize different types of failures that they might be experiencing.
And then we also started looking at smaller customers, and it was a completely different type of problem, because obviously, the types of failures customers are, like, a big customer doesn't just accidentally misconfigure their endpoint.
It almost never happens. But it might happen to a smaller customer.
And before, until Annika came, we didn't actually knew who we were building this for, and it was very hard to drive.
So, to answer questions with data, you first need to ask the questions.
And those questions normally come from the product definition that you're trying to build.
And until Annika came, we didn't have, we didn't really have the questions we wanted to ask.
We were just poking at the data and looking at it very unproductively. And then Annika came, and we started, you know, making some more interesting.
Was that a learning experience for you?
Like, I feel like in college, sort of taking computer science classes and things like that, you're taught, like, here's a data structure, and here's some, like, problem set that involves implementing the data structure, and you write some code, and then it's like, it's done, right?
And with software engineering, writing the code is usually the easy part, right?
Like, figuring out the problem to solve, and the right way to solve it, and all that is much more difficult.
How much of that was new to, how much of that process was new to you guys?
The whole process was new to me. It was, and I think to Jinyao too, as well, to both of us, it was very new in that you can't, with this data science-y project, you couldn't achieve, like, perfection.
You needed to say that, hey, by this criteria, we're going to capture 90% of use cases, and then for the next criteria, we're going to capture 90% of those use cases, and so on and so forth.
So, you sort of weed out some use cases that you just can't capture, given the data that you have, and you don't always have access to all the data in the world, and it's, those were the constraints we were operating under.
Let's dig a little deeper on that.
So, I remember sort of getting project updates from the team, and you started out trying to query, like, these giant data sets, and the queries were just, like, either timing out or giving you garbage.
How did you get from, like, that, which is, like, sounded pretty broken and miserable as an experience, to something that actually worked and delivered high-quality results?
Yeah, Anika, do you want to talk about how, like, there is a number of Cloudflare errors that fall under the category of some, there is a problem with the origin, but only focused on one.
Do you want to talk about that? Yeah, I think some of this was about taking the problem space that we had of, like, there is a problem between Cloudflare and an origin server, so, like, the origin server is down.
What does that actually mean, and, like, piecing that out?
And so, some of that was talking to customers.
Some of that was also, we learned about through talking to people on our SREs, our reliability engineering and networking team, who were able to really put in context for us, there's a whole range of, like, 500 errors, or the 5XX errors, the class of errors that we were trying to work with, and some of them are more easily actionable or are definitely related to, sort of, customer-side problems, and then some of them are much more difficult and might be related to, like, lots of different types of problems that could happen on the Internet in the path between Cloudflare and an origin server.
So, through those conversations, we were able to kind of narrow in a little bit on the specific types of 5XX errors that we wanted to look at, because we figured that those would be the most actionable and the most useful for customers to know about.
So, to be more concrete here, there is a number, Cloudflare has those internal errors that we, not internal, they're Cloudflare-specific errors, 52Xs, like 521, 522, and you can look them all up.
They describe different problems, like, hey, your SSL certificate is misconfigured, or the origin actively reset the connection, or the origin, we couldn't even reach the origin, which is a different error.
And, at first, we were trying to look at all of them at once. It wasn't making any sense at all.
And then, Anika told us, hey, we need to provide the customers with actionable advice.
We need to be sure that there is a problem, and we need to actually be able to tell them in a few sentences, like, how they can fix it.
And, within an intern project, the error that met those criteria was 521, the origin actually reset the connection.
And so, we focused on that one single error, and tried to dig deeper in, like, what causes that, for how long are origins down, when they are down, and all sorts of, and what is an origin was another interesting question.
I think you asked a little bit earlier about the process, and what was different between the experience of, like, figuring out the problem that we were trying to solve, and the direction for the solution at Cloudflare, versus, you know, in a typical college class, where you're given sort of a crisper problem.
I had a little bit of product management experience before my internship at Cloudflare, so I was kind of familiar with the, like, customer interview and question-asking process, but what was really unique for me at Cloudflare was the amount of data that we had available to dig into to answer our questions, once we had those questions.
So, we were able to do a lot of analysis on the different error types, make all kinds of really interesting visualizations to show, like, these are the places that the errors are clustering, and, like, only for these certain types of customers.
So, there wasn't really, like, it wasn't like we were just taking guesses and poking around in the dark.
We actually had tons and tons and tons of data to answer the questions that we were asking, which I think is pretty unique and is fairly unique to Cloudflare, just because we have so many customers and so many requests and so much data.
Yeah, interesting. And it sounds like figuring out which data sets to actually use, like, you needed to figure out what needle you were looking for before you started digging around in the haystacks, right, which sounds obvious in hindsight, but, like, when, you know, yeah, you're like, oh, here's a database, and it's got some interesting information in it.
Let's see what comes out. That doesn't go so well. And, I mean, we've all been guilty of trying that, right?
Like, oh, like, let's just run some queries and see what comes out.
So, it sounds like, you know, you had some thesis on what alert types were actionable, and then we were able to make some progress on finding data sources that helped identify when those classes of errors were happening.
How did you actually ensure that that thesis, how did you go about proving that that was true, right?
Like, that all the research and stuff that you had done to that point was accurate and reflected real customer problems?
Yeah, we tested it a whole bunch.
So, we first tried, basically, with super small segments of free Cloudflare customers just tested the email out.
So, we would enable the notification and try a specific message, and then gather feedback from customers.
We would ask for feedback, actually, in the email, and lots of people- Hold on.
I think we skipped a step, right?
Like, where is the email coming from? Why don't we walk through, like, the sort of life cycle of a- Errors start happening, then what?
Of a discovery. Yeah, Ilya, you want to take it? Yeah, sure. So, errors start happening.
Okay, and here's the answer to another previous question answer Rustam asked.
Like, from the engineering perspective, what was very new to both me and Yao is that Cloudflare runs a huge distributed system, and if it won't, though, we were building this simple origin monitoring project, which is some service that is running in some centralized location.
It's analyzing data passively, and then sending some emails out to somewhere.
You still need to understand where the data is coming from, and where it's coming from is now 200 plus different locations, and it's aggregated somehow, then sampled somehow, and then just a little bit of a- Yeah, it was a lot to learn very quickly.
So, what the service does in a little more detail is every few seconds or so, it goes and it looks for the origins for failed HTTP requests, essentially, and then it tries to see which unique origins those requests belong to, which of those origins had a sufficient amount of traffic in the past some time period, so we can actually confirm that this origin is receiving a normal amount of traffic, and we can make statistically sound decisions based on those requests failing, and then if we, based on a bunch of criteria, decide that, hey, it looks like this origin was up within the past 24 hours, but for the past five minutes, it stopped responding to the request successfully, we decide we should send the owner of that origin an email saying that, hey, we're having a problem, here's how you fix it, and then there is also a number of smart things that go into account, like we aggregate the origins for the same account because people can have multiple of them.
We also don't send too many emails within a day. We don't send, again, we didn't send an email if an origin was down for the past 24 hours because then the person already should have noticed if they were reached out to, or otherwise they probably don't care.
Makes sense. Okay, so that's the pipeline, right?
So origin goes down, data gets written to some error tracking database, and then there's some process sort of querying that database at some interval and then making a decision on whether or not to alert the customer, and then that alert goes out in the form of an email.
Is that the rough sketch of it? Yeah. Okay, and so to go back to the original question, how did you know that the alert was high quality and that users cared at all about this, right?
We've all heard stories of building a product, being like, oh, we understand exactly what the problem is, and then just totally whiffing, right?
How did you make sure that that didn't happen?
Yeah, so Anika would be the best person to answer how we ensured that the customer liked them, but on the engineering side, the obvious way to ensure that the alert makes sense is to actually, when you see based on your criteria that the origin is down, the natural thing to do is then to try to poke the origin yourself and see if it's up.
And well, we did run those tests many, many times for different criteria and until we reached the point where essentially everything that we got from the log stream is down, and when we poke it and 100% of, in our live test, 100% of those origins do really appear down, then we think we can send that to the customer.
But Anika can explain more about how just sending an email is not sufficient.
It has to be pleasant to receive and all that, and it has to make sense, and Anika knows more, the most about that.
Yeah, so we sent basically those emails to test groups of customers.
We asked for feedback. We got feedback.
People responded to the emails, wrote to us on Twitter, all kinds of things, so that was really useful to have in terms of qualitative response, like whether they thought that the email was helpful or not, if there were things in it that were really confusing.
But then, as Ilya said, we also had a quantitative way to measure whether this was working, so we would probe the origins that we thought were down or that the alert said were down, and then we compared basically after a certain period of time if the origins that we'd sent an alert on were back up, and so we looked at basically a control group and then the group that we sent the notifications to, and then the ones that we were sending notifications to when their origin was down were consistently coming up again after they had the notification.
So, that was our quantitative way of measuring whether or not the notification was working.
I forgot about that part. I am amazed that we did that.
Yeah, I was about to say, I'm impressed. It's a smart group of interests.
So, I mean, it's interesting because everyone, when you talk about alerting, mean time to resolution is almost a buzzword, right?
You're like, oh yeah, MTTR needs to go down, and it's awesome that you were able to bake that into the actual design of the development process.
It's like, here's our MTTR target.
Is it going down? And then as we improve our product, can we keep getting MTTR to go down?
That's a really neat story, and a great example of being actually data-driven, right?
I think that was, I want to credit really great feedback from you and then also JPL, who's another product manager.
That was the feedback that we got on our original KPIs for this project, which were something like customer engagement with the emails.
If people are opening the emails, or if they are clicking on the notifications, or if there's increased traffic to the support page about 521s, all these kind of metrics that were actually sort of indirect, and the great advice that we got from you guys was, well, people opening emails doesn't actually mean that the problem's getting solved.
The problem is solved if the origins go back online, so how can you measure that?
And it was a little bit of additional work to do that, but I think it was worth it, and it led us in the right direction.
Cool. I think it was a less good idea now that I remember that it was.
No, I'm just kidding. So, okay.
So, the product sounds super useful, and it sounds like customers thought it was too, and you were able to sort of iterate quickly and sort of keep getting the product to become more useful, and have like a clear success metric that improved as well.
That all sounds really neat. What allowed you to do that, right? We've all heard the sort of intern horror stories where folks show up at a company, and then they get assigned a desk, and then they sort of sit there for 12 weeks, and nothing happens.
What explains your ability to, as a team, be really productive and hit the ground running like that?
I have some ideas, but Ilya, what do you think?
I think you should start, because it really, like, when you joined was when we really started in a way to iterate faster, so you should talk about this a little bit.
I mean, I think there's two things that really stand out for me.
One is it was really awesome to work with other interns, because the three of us were just focused, like, laser focused on this project for the whole summer, so there weren't a ton of other distractions or other things that we were working on, and also, this was sort of like a pretty standalone project, too.
We were sort of embedded in the Argo team. Our mentors and managers were in the Argo team.
We got a lot of great advice from the Argo team, but this was a separate product from Argo, so there were very few dependencies on other groups, so that allowed us to go faster as well, and then I think the other thing was just setting as slim MVPs and milestones for the project as possible, so instead of sort of, like, coming up with this grand vision for what we wanted it to be and working the whole summer to deliver that at the end, we were like, no, we're going to make this, like, really, really, really small, like, almost embarrassingly small thing, ship that, and then learn from it and iterate on it fast, and I think that allowed us to make a lot more progress than we would have if we'd built toward a grand vision and then tried to release it all at the end.
That process you just described is way easier said than done, right?
What enabled you to be successful in actually scoping the thing down into something practical to ship?
That's a good question.
I think always coming back to the question of what is the actual value that we're trying to deliver to the customer, like, what does the customer care about, and what's the minimum thing that we can do in order to satisfy their requirements?
So, the, like, you know, maybe we could have delivered a bunch of, like, really fancy visualization and diagnostic tools in the UI that they could, like, look at and zoom in on and see all kinds of other analytics and information about their origin health, but really the core problem was just, like, I want to know before, or at least at the same time, you know, soon when my origin is down, and so just, like, just tell me about that.
Like, that's the simplest, smallest core of the problem, and so anytime our exploration questions or ideas about what other things that we could possibly do started reaching outside of that, it was good to just kind of, like, focus in and rethink about, like, okay, but are these actual other things that we're talking about contributing to solving that core problem?
And often they would be really cool, and, like, customers would probably like them, but not directly related, and so we were able to slim down and de-scope.
Makes sense. We talked about a lot of things that have gone really well.
What were some things that went less well over the course of building this thing?
Or, like, what things were hard that felt like they should have been easy? Ilya, do you want to talk about false negatives and false positives?
Yes, yes. I think that's the hardest.
The biggest challenge that we had for a while was that Cloudflare has a lot of data coming from all those requests going through the Cloudflare edge, but then accessing the data is actually a harder problem than you might think it is, because there's just so much of it, and you can't read too much without, you know, taking something down.
So, the data pipeline is set up so that you can either get a sample of the requests with, like, all request fields and all the information about each request, but only a sample of them, or you can get all the requests you want, but about each request, you only get a around the problem.
We had the data source where we have all the requests with a little bit of information about each one, and just a portion of requests with a lot of information.
So, how do we, given the fidelity of data, how do we not cause too many false positives?
How do we not cause any false positives, actually? Because I wouldn't want to send the wrong emails.
That's a scary one. Your origin's down! Yeah, it was very hard, and that's why I was talking about the filtering out decision and it's fine if we have a lot of false negatives.
Just so that we're all clear, in the context of this project, what's a false positive?
Yeah, a false positive is when we send an email when we shouldn't have.
It's a very bad thing. We don't want to send an email when we shouldn't have.
A false negative is when we don't notice that an origin's down, and not noticing that an origin is down is fine, as long as most of the times we do.
For the purposes of this project, and it's a mannequin question, and we have to make decisions like we need, based on that request sample that we're looking at, that data sample we're looking at, we want to set up some baseline of data we want to know about each origin so that we know, hey, there is enough traffic flow into it so that we can make statistically significant decisions about the traffic that we're seeing.
Not that just when we see one request in the past hour and that request failed, we don't just decide, oh, it looks like the origin is down.
It doesn't make any sense.
We don't want to send email based on that. It sounds like each of these little edge cases sort of made the alerting algorithm a little more complicated.
Were you testing that sort of in the lab, or were you iterating and improving that algorithm by actually emailing customers?
There was a little bit of both, right?
We were obviously not... Yeah, I didn't remember.
I think, Annika, you were just telling me that you remember we made some mistakes when sending emails, but I didn't remember that.
Yeah, it was actually cool. I mean, one of the great things about the fact that we offered Cloudflare a free service to lots of users is that a lot of people appreciate that and they like Cloudflare and they love the fact that they get to use it for free, so a lot of people answered our emails when we sent an incorrect email or we sent an email for a false positive and were like, hey, I checked out my origin.
It looks like everything's fine.
I'm wondering where this email came from, but they would give us a whole lot of data and information and diagnostics from their perspective and other tools that they had that helped us chase down the source of those false positives.
That was cool that we were able to do that, and I'm grateful to those couple of users that went super in-depth with explaining to us exactly what they thought happened.
We sent a couple of them t-shirts. Shout out to those guys. In the email, there was literally a call to action that was like, if you think this is useful, please tell us, and if you think this isn't, please tell us.
Please tell us.
Yeah, totally. Just to sort of close the loop on this, what happened to the project?
It sounds like at the end of the internship, you had gotten to a good place and this was in production.
What ended up happening with the project and the product after you guys went back to school?
It's still running out there. It's still sending emails, and that's the most exciting thing about it.
It was a completed thing, and obviously, there is a lot of things you can build on top of it and you can expand, but it also accomplished the mission that it needed to accomplish, and it's still running out there and sending those emails to the state.
We found a full-time engineering team to take ownership of the project.
I think that showed how useful it was.
We could have been like, oh, that's cool. Interns build this thing and then shove it into the corner and let it die, but we found a real cool full-time engineering team to take it over.
Then, Ilya, you actually flew out from Champaign to San Francisco at that point to do some knowledge transfer and share with that team how the thing worked, right?
Yes. Wait. Do we want to answer the question that we got before we...
Oh, we got a question? Yes. Yes. Ilya, do you want to read the question?
Yes. We were asked if we can explain what origin means again, and it's an interesting question because you intuitively think that an origin is something like a website or a web server, but when you actually need to make data-driven decisions, you need to define it very precisely.
Before defining precisely what an origin is, in general, what is an origin?
Annika, do you want to take this?
Yes. Yes. Sure. Without Cloudflare, when a user is trying to access a website, they send an HTTP request, and it has to go somewhere.
Presumably, that's a computer.
It could be your computer that you own. It could be someone else's computer like Amazon's or Google that they own and that you're borrowing to use from them, but basically, the place where that request goes to be processed, you can roughly think about that as the origin.
When Cloudflare is in the picture, we sit in the middle, and we cache your data, and so sometimes, a lot of the time, requests that come to Cloudflare, we can just serve right back, but then sometimes, the requests do have to go all the way through Cloudflare back to the origin server.
It's where content originates from, which is, I guess, why it's called the origin.
Now, these two words, origin and eyeball, I probably use – I don't know.
I feel like if you're a Cloudflare employee, you say these two things hundreds of times a day, and if you're a normal person, you say eyeball once a year when you go to the doctor.
But yes, eyeball is our – it's not just a Cloudflare term.
It's an industry jargony term for a user or a visitor because just talking about clients or servers gets confusing, so talking about eyeballs as the entity requesting content and then origins as the entity that ultimately serve content is how we keep all these things straight.
But Hylia, what was the point you were making about – it turns out, actually, defining origin at a deeper level is a much more complicated problem.
Why don't you speak to that? Yeah, so our first attempt was to define origin as a hostname.
It's essentially like the URL you're trying to access, like example.com.
Then it didn't work very quickly because behind the same hostname, there might be several physical or virtual web servers with their own IP addresses and where you can be on AWS or something.
If you have virtual servers running somewhere, you might misconfigure one of them but not the other, so you don't want to mix the data together when trying to make a decision of the origin or not.
So, we need to make this definition more precise. So, we took the IP address into the definition.
It became the hostname plus the IP address. Then we ran into the same problem again because now, on the same IP address, you can have different ports and those would be the different origins.
I think, if I remember correctly, what we settled on eventually is the hostname, the IP address, and the port is what defines an origin.
Then we looked at the metrics for each of those.
Cool, makes sense. Thanks for the question, Deepak. So, Anika was working in San Francisco.
Ilya flew out to San Francisco to meet with the engineering team that was sort of assuming ownership of the project after you guys left.
What else happened that week while you all were in the same place? I think the other thing to call out here is you were working remotely, basically.
Ilya and Zhengyao were in Champaign and Anika, you were in San Francisco.
Actually, before we get to what happened that week, how did that remoteness work and what made it work?
Obviously, super relevant in COVID times now, right? We had daily calls, which were very useful.
Yeah, just to chat and to sync every day, it was for three of us, was very helpful.
And then Anika would come with a lot of insights from the San Francisco office where she would normally like, oh, I just talked to this person and they gave advice about this and now I know this.
And we would come with some information from Nick, the manager in the Champaign office, and then we would talk to each other.
I remember that was when we were in, like, sorry, go ahead, Anika.
The time zone thing was interesting to navigate. I'm in like sort of the opposite situation now because I'm on the East Coast and a lot of the people that I work with are on the West Coast.
But over the summer, I would wake up and then like be on the train on the way to the office and Ilya and Zhengyao would be like texting me being like, oh, we found out this cool thing, or like, look at this new cool visualization that we made, or like, oh, like this feature that we talked about yesterday, we're already ready to ship it.
And so I would show up at the office like so excited because there is, you know, two hours earlier than me amount of work ready to go, which was actually really exciting.
I remember we were in like peak conference room shortage that summer.
And I remember like looking over at Anika, you'd be like huddled in the corner on the phone.
I like found a little spot in the corner that not many people like walk past or knew about or would use.
And so I'd just be like hunched over, that was probably not great for my like spine health, but we got a lot done.
Yeah, cool. Okay, so you guys were all in San Francisco.
Any funny stories or amusing things that happened while you were all there together?
This isn't a leading question at all. Anika, you should tell us.
Yeah, so so Kloppler has intern presentations at the end of internships, which are awesome way for the company at large, and then other interns that you're working with to learn about what you worked on over the summer and for you to showcase your work.
And then we also have a weekly all hands meeting. And so Ilya and Zhengyao came to San Francisco to do the intern presentation, then we'd also plan to do an all hands presentation.
And so we like spent the whole week like working on our slides and practicing and getting ready for this presentation.
And we were super hyped about it.
And then we showed up, I think, yeah, on a Thursday. But what actually happened was a little bit different than what we'd anticipated.
Much more exciting.
Can we just play the clip? Yeah, do it. So this is a recording of...
Actually, let me make sure the... Share computer sound.
This is a recording of our all hands meeting where you guys were meant to present.
And the speaker right now is Matthew, our CEO. So let's hit play. So the first thing that I have to do is...
Gosh, I'm all ready. Is apologize to Anika and Ilya.
They were supposed to present today. They put together a whole presentation to get ready.
And so I'm very sorry. They're interns. Ilya is based in our Champaign office.
Anika is here. They're doing something on origin protection. But we had some other news this morning.
Are we okay with the news this morning? Okay. So here's the news.
First of all, introducing Project Holloway. We'll tell you more about what that means in a little bit.
We said back in Q1 of 2018 that one of the things that was really important was for us to not get distracted.
And there were going to be a lot of rumors and things that were going on.
And people were going to talk about what was going on at Cloudflare and everything else that was going on.
Every once in a while, some of the rumors actually turned out to be correct.
And so we filed to go public. I still feel excited when I see this.
We were sitting there in the audience, right? Yeah.
I remember being so bad. I was obviously super excited. But I was like, oh my God, these guys worked so hard on that presentation.
But I wouldn't remember an intern presentation for the rest of my life.
I feel like I'm going to remember this for the rest of my life.
Like how many people experienced that thing like that.
Yeah. And we didn't get to practice. The most knowledge comes from the practice rounds.
And we practiced a lot for that presentation.
And Ilya, the environment in the office that day was pretty nuts. People were just super excited.
So it was cool that you all got to see that. Yeah.
And then you even had like a cameo in the IPO Roadshow. I remember both of you showed up.
Just because we were so excited. We were like sitting in the audience looking so pumped about this.
Wait, what? I don't know if you even know that.
But yeah, Ilya, you're in the video somewhere. I don't even know how to find that at this point.
Okay. So that was last year. We've got like 15 minutes left. You're obviously back at Cloudflare, both of you now.
I guess first question for you, Anika.
You finished this 10, 12 week internship, whatever it was. And then we asked you to come back as a full-time employee.
Why did you choose to come back? That's a good question.
I'm going to have a cheesy answer, but it's honest. There's two reasons actually.
One of them is just I think Cloudflare's products are awesome. I think the product that I get to work on is really awesome.
I'm so excited about it.
I'm so excited about the vision for the future, the point that it's at right now, the incredible growth that it's experiencing and being able to be a part of that is really unique and huge.
And I'm so excited about it. And then the other thing is the people at Cloudflare, there's just so much brilliance and so many really good leaders.
And I just wanted to come back and continue to soak up all of the knowledge that I just barely started to soak up being around for the summer.
So those are my two reasons.
And I think both of them have been satisfied so far in my experience.
I'm really excited to see what happens next. No pressure. Cool. And then Elliot, same question to you.
You, I'm sure, had a lot of other options to do another internship at a different company.
Why did you come back to Cloudflare and potentially stay beyond that?
Yeah. Cloudflare, for me, the main reason is Cloudflare is very deeply technical.
There are very few companies that are this technical.
And from the engineering perspective, you just chose a team randomly to work on.
It would have a lot of fun. There is just so much happening so quickly.
And the amount of agency, I don't know if it's the right word, as an intern, you've got it just unparalleled.
And those are my reasons. It's very hard for me to envision an environment where I would learn faster than I'm learning here.
Yeah. So what product are you working on now?
I think we skipped that part. So yeah, Anika.
I can go. Yeah, I'm working on Magic Transit, which is Cloudflare's product for networks.
So you think of Cloudflare, sort of Cloudflare classic, as being all about websites.
And we have DNS and web application firewall and CDN and bots and all of these things that make websites and other things running on the web and with HTTP traffic faster and more secure and more reliable.
And Magic Transit is all of that faster, secure, more reliable goodness at the network layer.
So actually protecting and accelerating our customers' IP traffic, any packets that are on the Internet.
Yeah. So over the summer, so after my first internship, I also spent the semester working part-time on Magic Transit, where I didn't get enough.
You couldn't get enough.
Yeah, yeah. I was just so excited about it. And over the summer, I went ahead to learn a little different part of the Cloudflare stack.
I went up a few protocols, and now I'm working on this always online feature on the cache team.
And so actually, a different story that came to mind. While you were wrapping up your internship last summer, you actually saw the beginning of Magic Transit, before it was even called Magic Transit.
What was that like? Yeah. So while we were building our little origin monitoring thing, the real professionals right next to us in the office were building a new huge service called Magic Transit.
And we got to see that a little bit. And I remember one particular night, everyone, all the full-time engineers stayed late.
And they were testing Magic Transit by putting the Austin office behind the surface.
And they had to test outside of their working hours, because they didn't want to take the Internet connectivity down for all the engineers in the office.
And I think that night, it didn't go well. Luckily, no one was working over there.
That's why we dogfood. Yeah, that's literally dogfooding.
Yeah. So just to make sure our viewers understand. So because, as Anika said, Magic Transit is focused on IP networking, the way we were doing initial, beyond sort of lab validation of the product's usefulness and effectiveness was by actually taking chunks of the Cloudflare corporate network and putting them behind Magic Transit well before any customers ever saw it.
So our first customer was our Austin office network.
And to Ilya's point, we have hundreds of people working in Austin.
We don't want to take their Internet offline. But we do need a real test environment.
And so that was chosen as sort of victim number one.
And yeah, we had some bumps and bruises, right? But that's truly why we do it.
And now, well, actually, even before COVID, all the offices were using Magic Transit without issue, right?
So it sort of shows the life cycle. What's being a PM?
And Ilya, you have some experience as an engineer on the product, too. What does a day-to-day look like working on Magic Transit?
Oh, good question. Um, it's a lot like being a PM intern, just, like, corner dragged up, like, 1000%.
So yeah, I talk all day.
I talk to a lot of people. I talk to customers. I talk to people on our sales team, on our marketing team, on our support teams and solutions engineering teams.
And my job is really just to learn about what all of those people, especially particularly our customers, care about, what their problems are, and how we can work to build things that solve their problems.
So it's a lot of talking, a lot of learning, a lot of asking questions.
I think I totally agree with Ilya's statement that there is nowhere where I think I would learn at a faster rate than working at Cloudflare.
But it's really exciting. It's a really exciting place to be.
And then basically after all of that, after gathering all of this information, occasionally I do have to actually synthesize it and help the team figure out a direction that we're going to go for specific features.
So that could be something small, like we're going to introduce, you know, a specific type of, like, visibility for our product for something that we had that was an internal metric, but we want to give customers to allow them to understand how the system is working better.
Or it could be a whole new big giant feature that we want to add to the platform that's going to deliver a huge amount of customer value and maybe bring some new customers in.
So broad range of stuff, lots of info, but it's really fun.
Ilya? Yeah, and as an engineer working on that team previously, it was different from the intern experience in that you have a very, very, very well -scoped small problem that is very complicated and you need to work specifically on it.
Even, like, Origin Monitoring was both scoped as a product, but now you're working at a very, very specified engineering problem as an engineer.
You're saying that the problem space is larger, right?
Like, you have more, like, when you finish something, you're like, okay, what do I work on next?
And it's like, well, you have a thousand options.
Yeah, yeah, yeah. And you pick one and the one that is the highest priority and work with very deeply.
And it's also, it's just a much higher, much more technical product to just, you have to be very careful.
And the stakes are higher, right?
Rather than just sending too many emails, you can actually take someone off the part of the Internet.
So you have to be very careful about how you're developing this.
Yeah, I mean, I think one of the interesting things about how systems are set up at Cloudflare is, Uzman explains this much better than I do, but we have sort of the edge data plane, which is like the, I have to avoid using the word core because that's the other thing I'm going to say, but the central systems that really process customer traffic in the 200 plus locations we have around the world.
And then we have sort of supporting services that help do things like bill people and alert people when they're origins are down or any number of other sort of like more ancillary functions.
And working in those two environments is very different, right?
You've seen both sides of that now. Yeah. Although you can cause an outage.
You have to be careful everywhere. Yeah, right. It's not to say the stakes are any different.
It's just that the sort of, yeah, the way we work in either of those environments is a little different.
Yeah. Different problems.
Yeah. If you guys were, if you two were trying to teach Annika and Ilya from a year ago, the most important lessons you sort of picked up in your time at Cloudflare, what would they be and why?
That's such a good question.
Do you mean like, if we were going to give ourselves something at the beginning of our internship that would have made our internship better, or just like in general, what's the, I mean, yeah, there's a couple different ways to tackle that.
Right. But like, what's the most like maybe surprising or non-intuitive thing you've learned that that is important to being successful as software engineering professionals versus versus college students.
I think for me, again, this is going to sound cliched and cheesy, but like question, question every assumption.
I think like product management is a really interesting balance of using quantitative data, using qualitative information that you collect from customers.
And then also this third thing of like Spidey sense.
And like, once you've gathered enough data, you can, and seen enough things and you have experience, then there's a lot of sort of guesses that you can take that will lead you in the right direction.
But I think even continuing to question every one of those guesses and the assumptions that are at the root of those guesses has been a really important lesson that I've learned and I'm continuing to learn at Cloudflare.
I think Cloudflare people are really good at doing this, like looking at a problem and being like, well, why do we do it that way?
And sometimes there's a really good reason. Sometimes there's a really complicated reason, but there's usually a reason and asking that additional question of like, why, why is it this way?
Instead of just accepting that it is, I think is really important.
Yeah. Sort of just really being explicit about trying to avoid the sort of conventional wisdom trap of like, oh, like obviously software has like X features, so we need to build a feature now.
Yeah. Yeah. And if you get, if you get to the point where you're like, well, this is how we've always done it or like, well, this is how other people do it.
Like that's the wrong, that's the wrong path.
Most of the time. Most of the time. I can think of some painful counterexamples, but yeah.
What about, what about you, Leo? Yeah. I, I was thinking to answer this question as an engineer, it actually, my answer would relate a lot to what Annika just said.
I would say really slow down sometimes and think, think things through from the engineering perspective.
And you would often discover that you need to do something well.
And a lot of things around that you don't need to do at all.
And when you really slow down and think about it, like you, you decide, oh, here's a lot of stuff that I don't need to do.
And I can really focus on something small and do that well instead.
And slowing down is actually a beneficial thing here.
And as a person, I learned to over-communicate, I think.
Well, maybe I'm just too talkative now. It really helps, especially in the remote environment with a lot of written communication, just over-communicating really helps because you need to put some, like everyone has, we have so much knowledge in everyone's head and you need sometimes to take that knowledge and put into someone else's head and you don't know how much your knowledge is already there.
So just assume none of it is there and over-communicate, it really helps.
Yeah, as a person. I remember the first time I met you and you were a man of very few words.
So it's a turnaround for sure.
So we have three minutes left.
Any other sort of words of wisdom to impart on folks sort of thinking about looking for internships or pursuing jobs in companies at Cloudflare or companies like Cloudflare sort of in the near future or far distant future?
I think neither Ilya nor I are a CS major and I think definitely there's been times in being at Cloudflare and in previous jobs where I've been like, oh man, it probably would have really been helpful for me to like have taken an algorithms class or something.
But I think a lot of people will get either intimidated by that or just assume that like that is the sort of default and correct path into tech, especially for super technical companies like Cloudflare.
And I think that's wrong. I think you can learn a lot of stuff on the job.
I am like a pretty good example of this because I'm the product manager for a network layer product and I had almost zero networking knowledge before showing up at Cloudflare full-time and I've learned a lot in the past couple of months.
So I think, yeah, like don't be afraid.
Just be really excited to learn and to ask questions and to dig into things yourself and you'll figure it out.
You don't need a CS degree. It probably helps.
I'm just thinking about our PM team and just the backgrounds of the folks.
I mean, there's certainly some that fit the sort of like classic CS undergrad and then went to work in some big tech company as a PM at a school and all that.
I mean, I match part of that profile.
But then there's a lot of us that, our chief product officer, Jen has a public policy bachelor's, right?
And she's like, you know, very technically like that, that doesn't get in the way of anything.
Yeah. Well, thank you both so much for joining.
And my sister is watching and says that I'm a bad influence on Ilya for encouraging the use of the green screen.
So it wouldn't be a Cloudflare TV episode without me getting some amount of trash talk from my family.
But yeah, thanks for joining and tune in in two weeks for the next episode of How Cloudflare Built This.
Thank you both. Thank you. Thank you, Anika. Bye.