📊 Building a Data Platform From Scratch
Presented by: Uday Sharma, Haz Darapaneni
Originally aired on August 2, 2023 @ 6:00 PM - 6:30 PM EDT
Join Haz and Uday as they discuss a focused approach for building an enterprise data platform from scratch.
English
Business Intelligence
Transcript (Beta)
Hi, everyone. Thank you for joining us. I am Uday. I'm a principal engineer at Cloudflare in the data and analytics teams.
And joining me is Haz. He's also a principal engineer at Cloudflare.
I joined Cloudflare a couple of years ago and have been working in the data space for the past 12 years now.
Haz and I, we go long. Like, we've worked on several initiatives in the past five years.
Haz, can you talk a little bit about what you've worked in the past and what you're working on now?
Yeah, sure, Uday.
So nice to see you on Cloudflare TV. So I've been working in the data space for the last 15 years and went through multiple transitions and implementations with different technologies during my journey.
As early members of the team, we got another opportunity to build a data analytics platform at Cloudflare.
And so far, it has been an amazing journey. So Uday, how do you feel about your journey at Cloudflare?
It's been great. I remember I was the third person to join the team, and we were just getting started in building the data platform and a lot of design discussions and then trying to understand which sources we're going to use, how are we going to use it.
And the primary goal was to get our financial KPIs out for going public.
So we have definitely accomplished a lot in a very short span of time.
And today, I'd like to take this opportunity to talk about the learnings from our current and past experience on designing and building a data platform from scratch.
So without further ado, let's dive into it.
I believe for any solution that you're designing, to understand why are we building it and what are the use cases going to be is the most important thing.
Once that is clear, designing and just figuring out how we are going to do it, it becomes seamless.
And you actually get more clarity about what are the steps that you need to include.
So I'd like to move on to the next slide, and I just want to quickly talk about some of the points that we considered while we were building this platform from ground up.
Can you talk a little bit about it? Yeah, sure, Vathay. Thanks a lot.
I believe our pitch, our designs should account for the present and future use scenarios and any unforeseen situations, and also avoid redundancies.
Because there are basic premises that we need to consider when we're building the new platform or when we're considering for a new architecture.
Here at Cloudflare, we consider a few things as a base to build a platform.
And the first thing is portability.
So when we say portability, it's to avoid the vendor lock in other areas.
And how we can achieve that is more of like, so we made our platform to run everywhere, in the sense which means it is more of a platform agnostic, which can run anywhere, like on-premise or any cloud environment.
And the second one is flexibility.
So our platform is also designed to integrate to any type of data source.
It doesn't matter what it is, and it's flexible enough so that it is not restricted due to the limitations of third -party tools and technologies.
So therefore, we embrace the open source technology stack, and it gives us much flexibility to act fast in the case of making any customizations which are not available readily.
So we talk about these things a lot at our work. But what are your thoughts on the rest of the items?
Yeah, sure. I just want to stress one part on the flexibility, because any decision regarding a tool or a platform at an organization can either serve the organization or harm the organization for the rest of the years.
So it is very important to be flexible, and I'm glad we took that approach.
Moving on to the cost and performance part, I think these are certain things that they require constant love, because your platform is going to grow over time as the data grows over time.
And performance is something that you'll have to always keep an eye for anything that you're building new.
And it will vary for use case to use case.
But especially with the resources available at your fingerprints in a cloud, cost is something that can seriously surprise you if you're not thinking for the long term.
So it's good to have a vision about what is your future state for your platform going to be, how much you will be spending, because it all adds up eventually.
Last thing, security, it is the utmost important thing at Cloudflare.
No tool or platform can be brought in or built without having a clear clearance from security team.
And we have a thorough review process.
So really, I learned a lot when we built this platform over the process and introduced our security review at each step.
Now, these are certain really good guidelines that we started our design discussions with.
But I'd like to jump onto the blueprint of our architecture, how things look today.
And I'm just going to move this here for the recording sake.
Yeah, so on the left, what you're seeing is the data sources that we have.
And it includes both internal as well as the external data sources.
And as we get these data sources, are pumping data into our data lake through our data ingestion process framework, which we'll talk about in a sec here.
Once the data is ingested in the data lake, we are doing a lot of transformation on top of it and writing into curated data set.
And then from there onwards, we have the data access layer where our users are actually writing our Docs to just get the data that we need.
Or you have some standard dashboards and reports that are using this data set.
We also serve the same data to our machine learning platform as well as the data science model so that they do not really have to reinvent the whole wheel for transforming the data sets that we already have.
I know this is a very standard architecture of any ALT stack or ETL stack.
But even though this is very basic, I think we did make sure that all the points that we talked about in the previous slide were taken care for here.
Can you talk a little bit about that?
Yeah, sure. Thanks for the high level flow.
I just want to bring up one point before we get in there. The security is in our blood.
In a sense, for anything that we can do, security takes precedence. And that has to go through the reviews in multiple areas and multiple cycles to make sure we are doing the right way.
So for any data platform, we can consider a lot of things to do it.
But we kept it simple to make sure we're not overburdening the stuff.
So for the platform, there are a few components that we need to do.
One is the combination of resources and storage. So these two take a lot of major stake in the platform.
So how did we make this happen with the things that we consider portability, and scalability, and flexibility?
So we leverage the open source applications like Spark, Kafka, along with Docker, which are packaged to run anywhere and easily switch from one to another for computation resources with a resource orchestration like a Kubernetes system, which means a Kubernetes has a lot of inbuilt features, which can scale it to a certain level.
And then you can always spin it off to run things on top of it, which is a very flexible enough for us to utilize it.
So even for storage, we can switch storage from one to another easily with our initial framework as it has configured our developer to write to any object storage on the on-premises or on -premises cloud.
So as we're talking about the storage part, so we use a Parquet as a major data storage format with a snappy compression.
And it also comes with a primitive schema evaluation features and analytical data pruning features with the block-level indexes.
So these are some of the cool features that Parquet supports and then we end up utilizing a lot on it.
So we can talk about a little bit of that in a later point of time.
And then when it comes to scalability part, so Kubernetes system comes with the inbuilt autoscale at the cluster level or at application level.
Here, an application means a data pipeline for us.
So it can be configured with the resources, boundaries at the part or container level, and leveraging the Spark dynamic allocation features to scale based on data volumes, which means we don't have to touch that system in our pipeline when there is a huge data that comes in on certain days and certain months.
So we just can take care of on its own as we enable the scaling at the low and high boundaries.
So I know we recently hit one milestone with our platform within a short span of time.
So I would like to share that one.
Oh, yeah, sure. So in the past decade, I've always been curious that we heard here the term big data, big data developers, big data engineers.
And I always question how big is big enough for a big data scale.
But jokes apart, we have achieved a milestone on our data platform that we are now object store that we have the entire data lake is about a petabyte.
We already had a petabyte scale environment and in our on-prem environment.
This is something that is growing in our data lake.
And we want to keep an eye on as it adds on to the cost as well as, I mean, what are some of the use cases that you really can get from this level of data that we have.
Moving on, I think as I know there are different categories of our data architecture out there.
The most traditional one was the batch where people used to write like ETL pipelines and used to write the data overnight.
Then as the technology evolved and the data collection patterns evolved and things got streamlined, then came along the streaming part.
Then people started using a combination of stream and batch and started calling it as Lambda.
Now we have a cup architecture, which is like stream first mindset.
So which category would you put this architecture in as?
Yeah, that's a good question. So I mean, I would see like our platform is kind of a Lambda and a hybrid, as we already kind of like processing data in a batch and a stream based on different use cases at scale.
So when we say like the use cases that we can talk about later, so we kind of drive through the situation based on use cases.
And then the way that you already mentioned, so we kind of like made in a shareable data platform, which means it can support any use cases.
So as I said, we are placing more on use cases under scale.
So those are the key things that are driving like Lambda and hybrid architectures.
Yeah, and I want to emphasize the use cases is actually the key part.
Like it can easily be a stream first approach. But at the same time, do we really need that?
Do our downstream users need data? Do all our downstream users need data in real time?
That's why we went with the Lambda approach. Because any solution that you have, like it comes at a cost, a cost of your dollars, as well as the cost of operation and maintenance.
So we just have to be mindful about that. I think that comes to my next point.
I know when we were building this, we thought about keeping it more portable.
And we wanted to make sure that if there is any change in the overall decision regarding the implementation, that should we move toward tomorrow, like we want to move it to a hybrid or a different cloud vendor.
How much time do you think it will take for us to make that change happen? Yeah, sure.
I'm glad you brought up the question. So as we speak, I think about a couple of base points around that.
So with the current design and implementation, I would say we can switch from one to another, either to the cloud or different cloud, or to the on-premise within a matter of a few weeks without any kind of work that we need to do on the power pipelines or platform.
So that's kind of an amazing thing that can be achievable within a matter of a few weeks.
That's incredible.
It takes years, or sometimes multiple quarters, sometimes to figure out what kind of infrastructure strategy you're going to have, or what kind of data platform strategy you're going to have.
And not only we build this overall platform in just a short amount of time, it's even shorter if you were to switch and take a different approach.
So that is incredible. And it actually covers another point of flexibility.
It gives us the freedom to customize things the way we want, and it doesn't really limit us in any particular tool and tech stack that if you were to use any third-party tools here.
And it also keeps our jobs interesting for developers.
They are not just writing data pipelines, but they are also contributing towards building the platform.
So yeah, I think speaking of data pipelines, we have, as you see, we have Airflow.
That's what we are using for orchestration of our data pipelines.
I'd like to talk a little bit about how we are moving our data pipelines on a day-to-day using our metadata-driven and self-healing approach.
So on the left, you see the connectors, and read connectors. On the right, you see the write connectors.
So in between is where the magic is happening for your process to figure out when is it supposed to run, where to read from, where to write, and all that.
So Alette has talked a little more about the implementation details.
But on the bottom, as you see, we have the Airflow as a scheduler.
And every scheduler comes with its own metadata. It can record when the jobs are supposed to run, where to read from, and set environment variables and whatnot.
But we didn't want to use just the Airflow standard metadata because we felt like tomorrow, what if we want to switch it to some other scheduler, right?
So we set up our own metadata layer, as you see at the top, which is just a database in Postgres, where we are maintaining the configurations for each job.
And it actually helps us in a lot of ways, which Haz is going to talk about in a short.
But the idea of having this metadata separate is it goes beyond what the current implementation is.
So Haz, can you talk a little bit about what we have today and how are we using this metadata -driven approach?
Yeah, sure. Thanks for the high-level flagging.
So as I was saying in the Airflow part, so I think, yes, we could utilize the Airflow operational metadata.
But we actually decided to put our own custom one.
So the reason that we have to, again, it comes to the flexibility part.
So I assume tomorrow, some other application that we need to move on, or we just need to move to the on-premise, or to implement a new orchestration scheduler.
So which, again, we need to move everything from Airflow custom build to the new one that we are migrating to.
But with the custom one, which everything is packaged as one solution, so we can definitely move from one to another as quickly as we can.
And we can always keep enhancing our own, keep adding more enhancements to that to make it better.
So with that, we'll definitely focus on the metadata-driven and self -healing pipelines.
So if you look at it, there are two metadatas which we have seen in the PostgreSQL database.
One is the pipeline metadata, another is the operational metadata.
So pipeline metadata is something that the developer team or engineering team can input in an operational metadata.
It's something that is captured as part of the process. And they like to look for what is next to whenever the process runs next time.
We can look for what is the last checkpoint, or what is the number of records that have been processed, or what are the duplication countups.
And those are things that are part of operational metadata.
So before we get into this pipelines, metadata pipelines, we just want to talk about a little bit of how what are the reasons that we actually made a decision to go with the metadata-driven pipelines.
So at Cloudflare, our data service are very agile in nature.
And we need to act as fast to bring in new data sources.
And as an engineer, we need to avoid redundancy.
And the solution for all these problems is metadata-driven pipelines. So in metadata-driven pipelines, metadata needs to configure for a source connection and what tables that we need to bring in to process the data set.
So once we configure, pipeline can be scheduled at a source for all tables or at individual table.
So if it is scheduled at all tables at a source level, then there's an option to run all of them in parallel or go by one by one.
So if you are trying, we achieved this like running all the tables at once with a scale of features along with Spark concessions.
So because Spark doesn't provide that functionality at its basic level.
So even at a table level, so data can be processed in parallel as well as like a single thread based on the data set or based on the data if the data set is too large or if the data set is beyond thresholds which we configure for.
So if it is configured, if it is a schedule at the source level to process all the tables, if one table fails in between, so then the process did not stop at there.
So the process will go on to process other tables and then ignore by ignoring the intermediate failure.
So once that process all the tables, then it will come back and see what has been failed.
And it will alert the, I mean, alerts the data to the engineers who are on the call.
So in all these aspects, pipeline can make decisions on its own to run data in parallel or in a single threads based on the data volumes or based on the configuration that are part of the metadata.
So overall, our pipelines are self-healed, meaning in the case of failure, it stops in the failure point as we enable the checkpoints at each and every step.
So it won't really go back from the starting point onwards.
It always starts from the failure point and then we achieve with the checkpoints.
So we keep having the checkpoints for each and every crucial step so it doesn't really go to the starting point again.
So with all this embedded into the pipelines, so we named our processes as intelligent data pipelines, which means they are actually making their own decisions and are self -healing on its own.
And then it can be fast and it can operate as it's fast and optimized approach.
So OK, these are all the process, right? But what is it for engineer part?
So an engineer just need to configure metadata with appropriate information and the process will take care of on its own with the metadata provided.
And this can be done with a few hours of work.
So an engineer doesn't need to spend days of work to get this data into the data lake.
So by doing so, engineers can focus more on new technologies or more challenges that are part of another area.
So I would like to provide some of the other cool features that pipelines support, like I mentioned previously about the Parquet format.
So it comes with a primitive schema evaluation, meaning there are two types of data types in the Parquet in overall, one of the primitive data types and the complex data types.
So primitive, like the Parquet supports the primitive schema evaluation, but when it comes to the complex schema evaluation, it doesn't.
The reason that we are highlighting this, so all our data sources are very rich in the complex data types.
So if there is any change in the array of struts, so which result in our data pipeline failure, and also it will alert one of the on-call engineer, which is not a good idea, which is kind of redundancy, which we are hitting very frequently.
So since that part that is not available with the Parquet, so we had to enhance the primitive schema evaluation to support the complex data types.
So we developed that part, and then we had to make it open source, but it was kind of available to support all the complex data types.
So if there is any change in the data source, so it will automatically reflect in the Parquet site.
So this is what we can avoid redundancy, like there isn't such failure, which is kind of not good for the engineer to disturb their day-to-day work.
So a pipeline comes with inbuilt controls. When it creates a new table or source, they automatically assign a proper ACL, who can see this data set, which means that we are enabling access controls as part of the pipeline itself.
And the retrying features are already available as part of the, if there is any intermediate failures, like timeouts or the source system unavailability.
So those can be kind of retryable at some extent, and then it will fail afterwards.
So there are good features that are provided, and it supports all the data scenarios, and also some of the data validations and audits that are part of pipeline as well, which are inbuilt into that stuff.
So, and it can do some data sampling, like if the engineer wants to inspect the data set during the development phase.
And it also supports analytical, point-in-time analytical tables.
And in our data warehousing spaces, so most of us are aware of type two or history tables.
In the past, we used to create those pipelines for each and every table, but these days, on our side, we actually inbuilt it into the pipeline framework itself.
So if you can configure that, we would like to make this a type two table or a point-in-time analytical table, then it will automatically create a point-in -time table based on the changes that are coming from the source.
So there are a couple of, a lot of features that we embed into the pipelines, and take it off on its own without any manual intervention.
And so our deployment as well with the CACD features, which we are making a head up, making progress on it.
So I know I didn't mention a lot of things today. So what are your thoughts on these pipelines so far?
Yeah, I think I'd like to just say that all of this is the reason why we were able to move so fast on getting data in particularly, because yeah, spinning up a Spark cluster or whatever, it's not like a week's time.
It may take just a couple of hours, but as the engineering team grew, how did all the developers, they started using this framework to ingest the data into Data Lake.
I believe this is the primary reason why we were able to move so fast.
So I'm glad we invested time in building this. And talking about developers, I think we have built some custom connectors which were out of the box.
And that actually highlights the build versus buy mentality or being stuck with some limitation of a third party tool.
So that at least we found a way around it and we built something to just move forward.
I think at the bottom, I know we are talking about the streaming pipeline and I know we have a really interesting streaming architecture for pipelines and we cannot cover all of that in the short span, just have like four more minutes to go.
So I would say, Haz, can you give a high level overview about what are some of the fun things that we have in our streaming environment?
Yeah, sure. Well, definitely. So I'm not sure, I don't think we can cover most of it, but let's talk about like a high level.
So unlike that, like my streaming pipelines are more of like, does not come with like a pipeline or data.
So it has like a more, we're using like a Kubernetes deployment and controller features along with our internal lookup module, which is written in a Golang base.
So that module comes with like all the features that we talked about the back pipelines and it takes care of on its own and become the operational metadata as part of the process.
So that is kind of like to utilize to allow the engineers to leave something goes off or like adjust the resources on its own if there is any lag on the consumer side.
So some of the interesting parts of the streaming packages, so we use like a custom data format, which is a CapnProto.
So if you look at like any other outside market standards, like most of the companies or organizations use like a JSON or Avro.
So in some areas like it uses like some other formats, but we use a CapnProto which is a very lightweight and the advantages being is like, it doesn't need like much of time on storage when it is serialization and deserialization.
So you can look at like there is a CapnProto website which actually shows like now how much time it takes like when you compare it to other traditional formats which we are using in outside.
So, and I don't think we can cover much of it.
Like, so most of this streaming pipeline is kind of like written in the Golang base, which is also lightweight, not the JVM base language.
So which is another cool feature, another cool thing that we have on our site.
So if you want to know more about like about the streaming and other components of it, so we have like a well -written Cloudflare blogs which are available in the Cloudflare.com site.
So we can definitely like go through those things and we can definitely like provide more questions.
So if you have any questions, we definitely provide like an answer to them like now during the blog interaction there.
So I know like we build all these within a short span of time and we are posting a lot of data as we speak.
So like as we made our platform stable and strong and where do you see ourselves in future?
Yeah, that's a great question. And it's always exciting because, you know, once you have built a platform and everything is like just configuration basis only, it takes away the fun from the developers like, okay, what's next, right?
And one thing I really see us doing is that, you know, since we are capturing the pipeline level metadata, we can leverage the same and extend it to capture the overall data lineage so that we know at any stage if we have run any transformation, like where it is coming from and what the logic is.
So that's the next stage that I think. And I also feel we can extend our current access layer from, you know, an ad hoc way and reporting from a steady dataset, we can probably extend it to more programmatic use cases.
We already have few that we are currently working on, but eventually I see like, I think we will be serving more and more and have like, you know, those feedback loops going as well.
I think I really enjoyed talking with you Haz on all of this.
I know this was a ton of work and I'm really proud of the team that has worked on this.
But again, like, thanks for joining me. Thanks for going through. And I'd like to thank our viewers as well for watching us.
Thank you. Thanks for having me.
And thanks to the viewers for listening to us. I know it's a lot. All right, take it easy.