📊 Data Engineering and Analytics at Cloudflare
Presented by: Harry Hough, Uday Sharma
Originally aired on April 13 @ 5:00 AM - 5:30 AM EDT
Join Harry and Uday, Big Data Engineers at Cloudflare, as they discuss what role data engineering plays in a data-driven organization.
English
Analytics
Business Intelligence
Transcript (Beta)
All right, cool. We're live. Let's do this. So hi, everyone. I'm Harry. So thanks for joining us today.
I'm a big data engineer on Cloudflare on the BI team.
And joining me today is Uday, who is a principal engineer who is also on the BI team.
We happen to work closely together on a number of different projects. I've been at Cloudflare now for about one and a half years working as a data engineer.
I previously worked for about four years as a sort of hybrid software engineer, data engineer at a small startup company also here in Austin.
Uday, would you mind introducing yourself before we jump in?
Yeah, sure. Thanks, Harry. Hi, I'm Uday. I've been at Cloudflare for the past two years now and I've been working in the data and analytics space for the past 12 years.
Cool. Yeah, just for the record, two years is a long time in Cloudflare years since the company is growing so quickly.
I believe we've at least doubled in size since then. When Uday joined, I believe there were only maybe two other people on the BI team and we have now grown to more than 25 people.
Thanks, Harry, for making me feel old. You're welcome anytime.
All right, so let's get started. Today, we're going to give some background on why you need data engineers and talk about how data engineering plays a really important role in any data-driven organization.
So next slide, please. Thank you.
So why do you need data engineering? The next few slides, I'm going to attempt to answer that question, but I think first, it's critical to have a little bit of background.
So let's pretend for the sake of this example that we're running a small startup e-commerce SaaS platform similar to something like Shopify.
The architecture is going to look something like in the slide above.
So basically, you might have some relational database, an API backend layer, and then a frontend layer with a storefront that might be kind of HTML, CSS, and then a control panel, which might be a more sophisticated web app that lets you control what's in your storefront and then also shows you analytics, like how many people are visiting your store, how much money they're spending, all this kind of stuff.
So let's say that I'm the CEO of this small e-commerce SaaS platform.
Like if I want to answer some basic questions about the state of the business, it's very, very, very easy.
All of this data is in a single database. There's no consistency issues, no distributed systems, no microservices.
We've crafted here a beautiful monolith.
Some people might not think monoliths are beautiful, but I think in this case, they're pretty good.
So we have some simple questions on the slide that we want to answer for our imaginary CEO.
It's extremely easy to answer these. So for example, what are the average lifetime value of our customers?
What are the top 10 customers with highest usage of our product?
In effect, the software engineer writing the SQL to answer these questions is acting as the data analyst.
In this type of situational architecture, we really have no need for data engineers.
And I think that this is why it's hard for people to understand why data engineers are needed a lot of the time.
So we have an evolving architecture, right? So as the company grows in size, and our customers grow, we might have to make some changes to this architecture for scalability reasons.
One example of this is that relational databases tend to use row-based storage and the performance ends up being pretty slow for analytics type queries.
They also tend to prioritize consistency over performance or availability.
So the architecture in this slide looks pretty similar to our last slide.
It's just that we've now added a message queue and an OLAP analytics database.
But we're still serving data through the same API, and we still have the same front end, and of course, we still have our relational database.
So now what if we try to answer these same questions that we asked on the previous slide?
For example, what is the average lifetime value of our customers? We can still answer this question without any issue, right?
Like everything is still in the relational database except our analytics data.
So what about the second question?
What are our top 10 customers with the highest usage? We can answer this question, but it is getting a little more complicated, right?
We've got to query our analytics database instead of querying our relational database.
And in our analytics database, we might have the account ID, but we don't have account information.
So what if we want to know what those customers would build for that usage?
It's not actually possible to join across databases.
We can export to a spreadsheet or something similar, right?
But it's not really a scalable solution, and it's not going to work.
In this example, it's top 10 customers, but what if it's top 100,000 or top a million?
We're really reaching the limits of spreadsheets. I'm sure people have tried this before, and it's definitely not a fun time.
We're also only asking to join and merge two different datasets together.
What if we had to merge five datasets or 10 or 100?
It's really slow and cumbersome. There's also really a limit to the complexity of queries and joins you can do in a spreadsheet, and even small changes are going to be manual, right?
This isn't SQL where we can just adjust our query.
We've got to re-import everything into our spreadsheet. I mean, this is clearly not a scalable process, and also the data is out of date immediately, unlike a SQL query, where every time we run it, we get the most up -to-date, fully consistent results.
So as you can see, we're already running into some serious issues.
And honestly, this architecture at most companies would be considered extremely, extremely simple still.
So things are about to get a lot worse.
So data hell, or maybe we should call this, give me my monolith back. Because as our literally commerce platform grows, right, we need to do some things to more easily scale our teams, not just our traffic.
So it does make sense to some degree to separate things to reduce single points of failure, as well as allow teams to push code and database changes without relying on other teams, and make sure that any changes that could affect other teams are well-understood API contracts, rather than just database changes.
This isn't always possible, but it's a good goal. So in this example, right, we have a billing service, an inventory service, an analytic service.
And these effectively, we could have different teams completely responsible for these different services, which is a little bit more efficient.
We also have third-party data, such as customer support tickets, CRM data like Salesforce, marketing data, email provider data, and many, many more that we could need to bring in, right?
But otherwise, the overall architecture is pretty similar, especially in terms of our front end, right?
It does look pretty much the same.
But the bottom line here is, right, that our data is getting more and more distributed.
So one quick solution to this might be to start replicating tables across services, or even directly joining this data in code, right, which is not very efficient.
So do we need to replicate this data into 10 different systems? At this point, our scalability and reliability is ruined by going down this path.
We're just really creating a giant mess. And we're also probably not a data -driven organization.
So most questions are going to be one-offs, likely, right, around important events, such as Black Friday, and will require one-off engineering efforts as a result.
So this really means that we have enormous overhead to answer even the simplest questions.
So enormous overhead and not scalable solution.
So honestly, at this point, I don't even think it's really worth trying to answer our original questions again, right?
We basically can't, like I said, without complex one -off solutions.
So how do we solve these issues?
What happens when your organization gets to this kind of level of complexity?
And even then, right, people have systems that are much, much more complex than even what we're showing on this slide.
So next slide, please. So the answer to this, right, really is data lakes and data warehouses, at least in the data engineering world.
And data lakes and data warehouses are really a fancy way of saying a data monolith.
Like while this is not a monolith in the traditional sense, right, and is actually a complex distributed system, it actually feels like, and more importantly, acts like one when you're using it, or at least tends to.
And there's been quite a lot of progress in the industry made, like even in the last few years, to improve the consistency of data lakes and make them act even more like a single system.
So we've basically created some giant data abstraction for our original monolith.
So really the job of the data engineer, right, like we're talking about, is to create a new single source of truth.
So in this example, we're showing a simple batch processing pipeline.
And this is what a simple processing pipeline might look like for a data engineering team that is part of a business intelligence team.
The key idea here is to pull data from many different sources, process them in Spark.
This could also be aggregates, security tables, or even just raw data.
We then write this data to our data lake in the second phase here, and then query them in a single distributed SQL processing engine in the final stage, which is known as a data warehouse.
This process also is known as ETL or ELT, which probably sounds familiar to a lot of people.
But this depends on exactly how you decide to load the data.
ELT, or extract, transform, load, is becoming more common, especially with, sorry, extract, load, transform is becoming more common, especially with data science teams who might need access to the raw data.
But obviously, the cost is higher. This means that all your tables effectively end up in a single giant database, which is able to be joined and queried at any scale.
Basically, we are back to where we started, like we were saying, with a single database again.
Obviously, this process does come with many challenges on how to reliably and replicate this data at scale, and how to make it easy to consume for users of all different parts of the organization.
So we can now answer our business questions again.
So a key responsibility of data engineers is to reduce data friction and enable analysts, data scientists, and others to easily access data.
We also need to ensure that we have a scalable and affordable storage of data.
So basically, in the previous slide, we were talking about replicating all of our data into a new system.
And as you can imagine, doubling the size of our data can become extremely expensive.
This system also means that data analysts and data scientists don't have to think about the where and how of the data.
It frees up their time to work on more advanced use cases and helps companies to be truly data -driven.
It allows direct, safe, and curated data access to marketing and sales and other stakeholders who would not make sense to provide access to source systems or could cause load issues that we don't expect.
Different from our original example, though, on the first slide, right, this means that the CEO is now not the only person who's data-driven.
This entire process we have just talked about also represents an important commitment to data-driven decision-making in an organization.
Like, we as a company have allocated resources and think being data -driven is important enough to do so and expand these enormous costs and resources.
So, Uday, would you mind taking us through how this space has evolved over time?
Yeah, sure. Thanks, Harry. I think the example that you gave is quite apt for explaining the overall concept at a high level.
And I do want to emphasize that nowadays, it's not just about answering the business questions, but it's also like the use of personalizations.
We use so many products online today that how some are giving you or feeding you personalized information to some extent.
And in order to be able to achieve that, you do require a little bit of data engineering behind the scene.
So with that said, I'd like to move on to the next slide where we'll talk about the evolution of this overall data engineering as a concept.
When I started my career back in 2008, I read this book from Ralph Kimball.
It was primarily considered as the holy grail for designing any decision systems back then.
And as you see, even though it was published in the 90s, it does have its impact today.
And it is still relevant to a lot of solutions, which are at small scale for different companies.
Because that book not only tells you the fundamental way of structuring the data and storing the data for your analytics, analytical use cases, but it also helps in minimizing the tendency that you see in your current data footprint.
But obviously that was back in the days when the storage was expensive and the data was not growing at that pace.
As we move forward, storage became cheaper and cheaper over time, the data grew.
And with the growth of data actually led the rise of some MPP systems like Vertica and Teradata and whatnot.
But what didn't change even with that is the way we were processing the data.
And I remember back in the days, the only skill that you need to be a data warehouse developer was either you should be able to write a PL SQL stored procedure, or you should be able to use one of those out of the shelf, like ETL tools, like Informatica and Talent or whatnot.
But after we saw some, we started seeing some challenges as the data volume kept growing and the platform kept evolving.
And obviously, yeah, 90s was one of the best time for the pop music and grunge was one of the favorite genre.
But today I'm not going to talk about the music, of course.
So moving forward, what we saw in the early 2000s were the evolution of Hadoop.
And as Hadoop came into existence, it fundamentally changed the way we store the data and how we process the data.
And even though it was introduced in early 2000s, the wider adoption actually happened a decade after.
And as it was being adopted, it also was evolving.
And the overall tech stack required for processing data in a distributed system was also evolving at the same time.
So that led to the rise for Apache Spark. And Apache Spark had a very quick adoption because people had already experienced some of the challenges that they were experiencing with MapReduce and Hadoop.
And building on top of MapReduce, Apache Spark actually changed the game for how we process massive amounts of data in just a fraction of seconds.
It was still batch, but it just worked just fine. And it has only improved over time.
Over this course, what every developer realized that SQL has been the bread and butter of interacting with data.
You cannot forget SQL, even though the other technology stack is evolving.
So while you need to know a programming language in order to be able to write a Spark job, the Spark developers, they have actually made that available as a SQL endpoint as well.
So that even though people who are not familiar in writing distributed data code, they can just simply write a Spark SQL code and actually execute a Spark pipeline.
But it didn't end there. As I've seen in the last decade, this space has been continuously evolving over time.
And there are so many tools that are out there for processing data.
Some of them are actually using these core principles from the open source projects.
And they just have some kind of facade on top of it to make it simpler for some of the folks who do not have a big data engineering team in their organization, or they just do not want to invest in building things ground up.
And they just want to be able to use those tools. However, a couple of years ago, I did read this book from Dr.
Martin Kleppman about designing data intensive applications.
And I believe this is the new holy grail for designing decision systems.
It covers both batch as well as streaming. And it just gives you a modern way of thinking on how to structure your applications in a distributed systems.
It demystifies that. And the biggest difference that you see from this book versus what we had back in the days, like traditional data warehouse was, earlier it was all running on the entire data was processed on a single server.
And that was the biggest monolith. And that monolith has been broken in this current world, in this distributed systems world.
And this book is actually the go-to resource if you're just getting started in the way and trying to adopt the current tech stack.
Now I've talked about the evolution of overall space here, but as a data engineer, I still believe you have to have a knowledge of the past and the current, and think about building the future, any applications for the future stage that can address all the use cases.
I was talking about building stuff. I'd like to move to the next slide where we're going to talk about what we do at Cloudflare as a data engineer.
Now a data engineer at Cloudflare plays an important role in not just building the data pipelines, but also maintaining and evolving, the continuously evolving the data platform that we have.
We work on building the reusable utilities for our platform so that you don't have to do the same thing over and over.
And also plug those, some of the maybe your data pipelines for configurable and cut all that redundancy involved in the development process.
So as you're building this scalable process, you have to make sure that like the overall pipeline that we have, it can handle a huge search in the data as and when it happens.
And it doesn't really feel like especially like for the critical mass.
Now, the last part is like, while building all these data pipelines and producing the output in different data sets, a data engineer is also responsible for connecting the dots between these data sets so that they can provide a consolidated logic or a simple way for our end users who are writing ad hoc SQL queries or using the data sets that we have produced in several reports or whatnot in a more consistent way.
And that's where I feel even though a data engineer is actually just working behind the scene, they are the subject matters.
They know subject matter experts and they know a lot more about the data than normally anyone in that company.
But I know, Harry, we talked about all these things here.
Do you think the job is done at this stage? Unfortunately or fortunately, I guess, depending on your perspective, the job is never done.
So Uday has touched on some of the history of data platform design and development, as well as the current state of our team's data pipeline.
So as you're talking about what's next, right, where do we go from here?
Data engineers are really the bridge between the data and those consuming the data.
This includes sales, marketing, product, and actually even teams within our own team, such as the analysts and data science teams.
They rely heavily on us to provide them with data accurately and quickly.
So data engineers are also expected to be the subject matter experts on the data.
No one knows the data better than us, and we're expected to have business and product context as well.
So how do we leverage this valuable expertise and knowledge?
I think this is something we think about so much as a team and also as a field in general.
So as I previously mentioned, we're building solutions for many different teams.
So how do we minimize data redundancy and control costs with things like aggregations of tables, as well as consolidate and propagate business logic with curated datasets?
If we achieve these goals, can we go beyond self-serve SQL and even self-serve analytics products, such as Tableau?
So Uday will talk a little bit about where our team is going and how we think about the future of data products internally at Cloudflare.
Sure. So as you said, self -service is actually the key thing for product adoption of your data and data democratization in an organization.
And that actually talks about the data delivery.
So the data can be accessed by some business folks who are SQL savvy, and they can write the traditional SQL query on top of our data warehouse or whatever we are using for the presentation there.
They can use either that, or they can use some canned reports in Tableau that we have today.
But the problem is sometimes this solution is also not scalable, because you cannot expect everyone in the company to know how to write SQL.
You cannot expect all the data that you need to explore to be able to fit in any reporting tool.
And this is not a new problem.
It existed 30 years ago. It exists today. It's just that the volume of the data that we are looking at is changing over time.
And there have been some intermediate solutions.
You can use an OLAP cube or some kind of metadata layer on top of it to minimize this redundancy in writing SQLs.
But to be honest, those solutions do not fit the use case that we are trying to go for.
What we want to achieve is a state where an end user doesn't really need to view a report or write a SQL query by themselves, and insight is directly delivered to them for the context that they are looking for.
So if I'm a sales rep and I know which customers I need to talk to, I should get that insight by the system.
And that is what our vision is.
And we have started making some smaller efforts towards it by building some in-house applications that puts the information at a customer's context.
And it allows people to look at who this customer is, what they're using, how much they're using.
And it actually improves the overall customer engagement with our go-to-market teams and the customers.
And it provides an end-to-end picture of how things are going as a business.
And same thing can be applied for looking at a product lens.
If I want to see how much we are, how is our product growing, which region where our customers are growing, what features people are adopting over time.
And last but not least, if I'm using five products and I can benefit from the five others or three others, how do I identify such customers without having a single person trying to figure this out that, hey, this could be done, that could have been, it has to be intelligent.
So that is where the next phase I see for the decision systems, they will eventually become more and more intelligent and you will have more and more data science models that will be predicting and giving you an internal intelligence for the organization.
And with that, I'd like to close this talk.
And thanks, Harry, for giving an amazing overview about the data lifecycle for our data engineer and to our viewers for watching us.
And I'd like to highlight that we're still hiring for our team. Please go and check out our page.
And if you are looking for a data engineer role, please either directly reach out to me or submit your application.
Thank you. Thanks for watching us.
Yeah. Thanks everyone.