📊 Aesop's DevOps
Presented by: Keith Adler, Lucas Stephens
Originally aired on July 7, 2023 @ 12:30 PM - 1:00 PM EDT
Panel discussion of two engineers about DevOps, including data and machine learning platforms.
English
DevOps
Transcript (Beta)
All right, welcome to ASOPS DevOps. So I'm Lucas and I'm here with Keith Adler. We work on the BI team at Cloudflare.
So Keith, you want to give a little introduction about yourself?
Yeah, so I work on the data science team within the business intelligence team at Cloudflare.
I'm a machine learning engineer, so I work a lot with data scientists and try to enable them on their platform and just get the best possible solution we can that works for everyone.
Nice. Yeah. And I work tangent to Keith.
So I am providing all the data that Keith's team uses. So I'm a big data engineer at Cloudflare.
So we wanted to come together and have a little discussion today about DevOps.
And I think it's a really interesting conversation because DevOps is something that's pretty new in the software industry.
It's a pretty new way of doing things and it's got its own challenges implementing on a data team.
So before we get started, I think it's good for anyone watching for us to go over maybe what is DevOps?
Because there are a lot of different definitions out there.
For me, the easiest way to understand it is to understand the history behind it.
So if you think back 30 or 40 years ago when software engineering was still first becoming a thing, everything was really, really simple.
Developers, whenever they needed to fix code, they would do it live in production.
They would literally tell them that in the servers there was no staging development environment, there wasn't really a concept of testing, things were very primitive.
And over time, as things got more and more complicated, you started to have sort of two disciplines emerge.
You had people who were code experts who knew how to write really, really brilliant processes.
And you had people that were systems experts that knew how to optimize the systems that actually ran the code.
And so over time, what you saw is the split between code, which is dev, and systems, which is ops.
And as this split became bigger and bigger, what you had is a huge skill gap.
And it brought us to sort of our modern way of developing software, which is the people that write code oftentimes aren't the ones tasked with deploying it or supporting it in production.
And so what DevOps is about is addressing that problem, right?
Because there are a few sort of obvious, maybe non-obvious problems that come with that approach.
The biggest one is that when you have code experts not maintaining their code in production or not deploying it in production, you're effectively allowing them to abstract away the systems.
And sometimes that's a good thing, but sometimes that's a bad thing, right?
Conversely, when you have systems experts maintaining code in production, they weren't the authors of code.
And so it's really hard for them to debug things sometimes.
And so there's a real silo gap there. And that's what DevOps tries to fix, and hence the name DevOps.
DevOps is dev plus ops. So what we're going to talk about today, I think, is a really interesting problem.
Because I think when you go into an organization, you have your engineering organization, you have product management.
And I think traditional software engineers have their own way of practicing DevOps.
But when you're on a data team, I think a lot of data teams are expected to be self-sufficient.
And so I think it presents its own challenges about how does a data team practice DevOps?
So maybe, Keith, you could walk us through what does the ML platform look like at Cloudflare, and how are you guys practicing DevOps?
Yeah, for sure. I mean, machine learning platforms are kind of their own story, if you will.
It really starts with how do you enable data scientists?
And so if you really look at the broad picture of what does a tech stack look like, it's a different structure than a typical software project.
So it's pretty common that you use the term model pipeline.
How do we get end to end from data to production?
And there are many stages along the way. You have your discovery phase, you have your feature engineering, you have to adjust all this data, you have to be able to split and experiment and train on the data.
And then once you have a model, you finally have that package, you've iterated on it, you have to build something out that you can deploy in production.
And it's often a tall task because data scientists are usually more focused on the model and getting to production is not something that you want to have to repeat that task manually every time, as you said, and pushing to production and editing in production, it's not really ideal.
So it's better to build a system around that where you can automate a lot of those processes.
And so we really started from the deployment side.
We started with, okay, let's wrap everything in Docker containers, let's push out to Argo workflows and Argo CD so that we can have batch and stream microservices that are running.
Those run Kubernetes so that we can scale and have efficient deployments and we can use Helm and different tools to deploy those well.
But you have to monitor those services, you have to have logging, you want to trace your metrics, especially for streaming.
And all of that is important to having a real productionalized system.
And for each different model, it's its own challenge.
So some models are smaller, still have, we just need this to be able to be up on production in the next hour, something quick.
And then you have other ones that it's a large effort. This is a mission -critical model, we have to make sure that this is robust and it can handle lots of transactions or lots of load and has to scale or use GPUs or some complex deployment state.
And so it's pretty common in production that you have to really harden your system to make sure that things run smooth.
But once you've done that for one model well, then it's easy to replicate that to the other models if you've automated it well using DevOps.
And so one example of this is visualization.
Instead of building a different set of visualization tools for every different model, it's better to at least wrap some of them into a shared pattern across each of those different layers.
And so that's one way that we can take some of the pressure off of data scientists to rebuild everything from scratch every time and have it be more automated.
But things like security are vital because having security defining code and iterated on as part of the platform means that you're not going to have different implementations of security or controls or user access management along the way.
And those are really important.
We don't want to pigeonhole our data scientists to be stuck with one tool or another, but I think it's important to have that opportunity to deliver on these different products, different models.
And all of that is pretty important. But I think data scientists and MLEs complement each other well.
And when I say MLE, machine learning engineer, because an MLE allows the data scientists to focus on the models, and that way you can work together to get a mesh of different models in production.
Yeah. I think you're hitting on a great point there. So from your perspective, and I'm sure you guys have discussed this as a team, let's pretend you're in Utopia.
What's the ideal workflow look like for a data scientist at Cloudflare?
Yeah. I mean, we'll start with the data is available. My job. Yeah. I mean, if you don't have data, you can't really start.
So I think you have a project that's well-defined and you have your data available, and it's something that's already ingested or able to be ingested soon.
It's something that you could quickly start on.
You have tools so that you can query it, different like BigQuery, SQL, or different things that you can use to, for instance, like PySpark SQL is one way that works.
It's like a very large data set. You have a way to adjust it and look at it, look at that data frame, make sure everything is working properly.
And if you don't have that, then you can't get started. So it really is end to end because from there, feature engineering and experimentation are really important.
So tools like MLflow are really great because you could really categorize your different models and you can use tools like DVC or data version control that allows you to get a better grasp of what data is going into your model, what data is coming out of your model, what version of the model you're running on.
And then when you're actually deploying this, you know what you built it on.
And so you can be assured that what you're building is actually concrete. And then from there, that's where I take over and the deployment goes its way.
So, yeah. Nice.
Yeah. So it sounds like a lot of the basics of the platform are being done. What's some big things on you guys' roadmap as you start to roll out more and more functionality?
Yeah, definitely. I mean, you got to start from somewhere. So often with the business, time to market is the most important thing.
So starting on the deployment side of running Argo and Argo CD was important for us to deliver things quickly and be able to have different things that could run and scale independently of each other.
But I think some key things that I'll bring up are probably monitoring and logging is definitely one.
Another is infrastructure as code.
A lot of times this infrastructure stack has to be something that's repeatable and maintainable.
And if you define all of your service accounts or all of your different tools around your infrastructure manually, then it's just existing on one webpage in someone's head instead of being in code where you can iterate on it and have a history of the changes and you could see who has access to what and how the systems work.
And clusters will be defined differently, but that's the benefit of Kubernetes is you can define it all in YAML and you can have that in your code base.
And then when it's in your code base, then you have testing you can do, you can have all your good software practices around pull requests and code review, and you can really work as a team to make sure you have the right solution and then progress through different environments.
So again, a great example for infrastructure as code is you have your development environment, but then you also want to get to production.
You have to have some steps in between. So having a repeatable infrastructure means that your infrastructure should be the same on each of those environments with just different credentials that are injected into each.
And we do that as Kubernetes secrets. So it's kind of standard practice, but then you have to secure those.
And that's also something you don't want to have to repeat manually every time because you might do it wrong one time and now you've exposed secrets or sensitive data.
It's important to do it right.
Yeah, absolutely. You hit on a lot of different technologies and tools that you guys are looking at evaluating and using.
What's that process like look inside your team?
Do a lot of the data scientists end up inevitably learning about these different components of the platform, Kubernetes, things like that?
Yes.
The better the team, the more that there's a mutual understanding. Instead of having a model and you just toss it over a fence and now it's an Emily's problem, it's something that we work in joint together to make sure that it works properly.
So I work with people on my team. We've had models where our first time we deployed on a new environment, we found new bugs and new issues.
And that's normal, but you want to catch that in the early environments so that you're solid when you get to production.
And iterating on those with the data scientists, kind of gives you a better idea of how their models work and what they're focused on and what their features are.
But then you have the same challenge of how do you make sure that this is good going to the production on their end.
And so building containers is something that they've become familiar with.
But having that as much as part of CICD as possible is the goal.
Yeah. So it sounds like data scientists, they can write code, huh?
Yeah, definitely. Definitely. That's a misnomer.
Data scientists are talented. Yeah. Very smart people. Yeah. I love listening to that because I think you hit on a lot of the core philosophies of DevOps, which is even though you guys have different specialties, you being more of a software engineer and our data scientists being more mathematically trained, it really shows that as you work together and as you go through the DevOps process, you end up learning from each other and there's less of a gap there.
There's much less of a gap.
And I think you mentioned our data scientists are now very familiar with Docker and building containers.
And I think that's great because at the end of the day, they're the ones authoring the models.
And so if they're more enabled to do their jobs, that's better for everyone.
I want to take some time now and talk a little bit about how things are done on the data engineering side of things.
So I think our platform is a little bit more simple than yours because our primary product is data pipelines, right?
We take data source, apply some transforms to it and move it to a destination.
And so for us, our primary destination is BigQuery and we have a lot of different sources.
We have some of our own self-hosted sources at Cloudflare, our first party data.
We have some integrations with Salesforce, Zendesk, things like that, different external APIs.
And so it's a lot to manage.
So for us, I think one of the biggest benefits for DevOps is eliminating those bottlenecks.
As a data engineering team, we have a lot of different things to cover, a lot of different subject matter, if you will.
We've got different sort of pods on the team that work on different areas.
And when we practice DevOps, that enables us to really spread knowledge across the team, right?
So even though maybe I didn't work on the Zendesk data pipeline, I can still know it and I can still debug it because of the way we practice DevOps.
I think one of the big challenges we've faced, and I think this goes for our entire team, is with Cloudflare being a security company, most everything we do on the engineering side of things is self-hosted, right?
Because we believe in security and we don't really want to outsource our things to the cloud.
But for a data team, most modern data teams are done on the cloud, as is ours.
And so a lot of the tool chains and a lot of the dev tools that exist at Cloudflare today, they really didn't translate one-to -one with our use cases.
And so I think a big challenge for our team, one that we've tackled pretty well, is building our own DevOps tool chain.
So if you look at the data engineering platform, most of what we do is Spark Scala.
So we had to build our own tools to assemble jars.
We had to build our own tools to provision clusters. We use Airflow for orchestration.
So we built another CI builder that actually deploys DAGs automatically and updates them with the new version of our jars.
So there's a lot of work that goes into establishing these DevOps practices.
But at the end of the day, those big benefits, I think they're critical because they're the only real way you can scale a team.
Which by the way, if you guys are interested in working at Cloudflare, you should check out our careers page because we are hiring a lot.
We've hired so many people since I joined the company. So as our team starts to grow, it becomes more scalable because a lot of these things, as you said, they're version-controlled, they're automated, and it lets developers focus on the end code at the end of the day.
Yeah, definitely. I mean, developer experience is really important when it comes to all teams, but with big data teams, it's kind of different.
So what would you say is the ideal workflow or ideal developer experience for big data?
Yeah, definitely. I think for me, a lot of time in data engineering is spent understanding the data.
And once you understand the data, it's tough for us because when you think about maybe traditional web development, what a lot of engineering jobs are doing nowadays, there's a focus on, okay, maybe I'm working on the back end or maybe I'm iteratively.
That doesn't really translate well when you're on our team and we're processing terabytes of data every single day, right?
That's not something that can fit onto my MacBook.
So for us, the optimal workflow is having that ability to, one, investigate all of our data and understand the source data.
And then two, as we're building the transformation logic, being able to iterate quickly and understand how different algorithms or different functions might work.
And that looks a lot different than traditional web development because all of our stuff is done in the cloud on big, big compute machines.
I think one thing our team has really enjoyed is making use of Google's Dataproc.
So it's their managed Spark. So what that enables us to do is not just run Spark jobs on a Spark cluster, but they actually provide several different interfaces for interfacing with the clusters.
So you've got a Yarn UI that lets you see all the different jobs running.
You've got a Zeppelin notebook that lets you actually query data from Spark and run different processes.
So I think that's probably the big focus when it comes to a data engineering workflow is how do you make it very, very easy and very, very safe for a data engineer to investigate the data and transform the data quickly.
You can think about if you're a data analyst, a lot of your work involves querying and finding the right query and figuring out the right data to bring.
And so a lot of your workflow is running queries against the database and there's some latency there.
And that same problem kind of translates over to us. So building utilities around that, that's really where a lot of our investment into DevOps goes into play.
That's great. And I know one area that we both deal with, but maybe more so for you, is data scale and data security.
Like what access controls and how do you deal with sensitive data?
That's a different front. How do you handle that?
Yeah, absolutely. So we took an interesting approach with how we do PII and how we do authorization and role-based access control.
On the technical level for all of our processes running, obviously, we make use of service accounts, we make use of proper IAM permissions.
So from a technical perspective, all the processes running, that's how we secure that.
From a user perspective, I think that's where it gets more interesting because what you find with a lot of these different data warehousing technologies or these different cloud providers is they give you your own way of locking down access control.
And I think when we first got onto GCP, a lot of that functionality was really primitive.
I think at the time, they only allowed something called an authorized view, which is exactly what it sounds like.
You can create a view off of a source data set and only grant certain people or certain tables access to that.
And we really didn't find that that met our requirements because we want to get granular.
We want to say that people only have access to certain columns or to certain tables or even sometimes certain rows.
And so we actually built our own framework to handle our role-based access control.
And so we have essentially a giant ACL table that we are able to update and grant users access to specific data sets.
So our access control, it goes beyond a cloud provider.
And in that way, it kind of makes us more scalable because if we were to move cloud providers or even come on-prem someday, that framework for providing access and preventing PII is totally transportable.
And so that's something that we invested a lot of time in.
It's definitely something, when you look at it from the outside, it's something that you could easily justify and say, it's not worth spending that much time.
Just use the provided solution for you.
But here we are, our team is over two years old now and it's really starting to pay off is what I see.
And I think that really summarizes a lot of the spirit of DevOps is being able to make those kinds of decisions.
Because inherently, when you look at a lot of the work behind operations, behind doing infrastructure as code or having automated testing or deployment or doing logging, monitoring, alerting, what you'll find is that none of that work sounds as good as something that is enabling the business.
And none of that sounds as good as writing the code. And so finding that right balance where you're able to prioritize things that you know are important, but are a little bit harder to justify to the business, that's a crucial part of adopting DevOps as a team.
Definitely. I mean, that was something that I think we deal with that too.
How do you prioritize what features you start with first?
We started on the deployment end because we want to make sure that we're able to get time to market all these different models.
But you have to start with the data.
So you also want to have at least something up so that the data and experimentation is easy to go with.
So I definitely can relate to that feeling of, well, you want to deploy it, you also want to be able to sleep at night.
So there's a balance of how do you ensure that it's ready to go?
Yeah. And it's the nature of software.
Nothing is done perfectly from the start because if it was, we wouldn't have jobs.
And I think the famous saying, code decays. As soon as you push something out to production, that has a lifespan.
And I think that applies to everything. It applies to your infrastructure, it applies to your monitoring.
And so where you start, I think is super, super important.
And what's interesting to me in my experience, both here at Cloudflare and at other places is there's sort of like this inverse trade-off between the two.
If you think about code, it's really, really easy to write code from scratch and get something working out with minimal features.
You've got things like Create React app that makes it like, you can spin up a front-end app in like 10 minutes and it's easy.
Writing code from scratch is really, really easy.
Getting a lot of these infrastructure decisions and these DevOps tool chains right is a higher cost at the start.
And I think that's why a lot of teams nowadays, they shy away from that work.
I think one, there's a skill gap there, but two, there's a lot of time and effort to invest at the start.
And when you're trying to build a new product, you want to be able to move fast.
But when you look out over the horizon and you look over the long-term, what you find is that while a lot of the operations and infrastructure things have a high cost, that cost lowers over time, right?
Because once you build the CIC pipeline that works, it's very, very easy to maintain it.
It's easy to make sure it's still supporting your use case.
The same thing goes for infrastructure. The same thing goes for a lot of monitoring alerting.
A good example for us, when we built our monitoring framework, that in and of itself is a lot of work, but now every single time we build a new data pipeline, plugging into that is trivial, right?
And that's really the big benefit there.
And so over time, what you see is the cost of doing ops goes down over time.
I think the opposite is true for code, right? I think it's really, really easy to start writing code, but as you get bigger and bigger, your code gets more complicated and inherently gets more rigid.
And it's unavoidable.
You can do things like practice good design patterns and have good abstractions.
But at the end of the day, as you add more and more and more features, doing something costs more, right?
You can think about lots of companies nowadays, most people don't think about this, but lots of companies nowadays are giant monorepos.
I think a good example is GitLab itself is like a giant Ruby on Rails monorepo.
And so making a change there is really difficult. And so when you embrace DevOps, what you're really saying is you're thinking about the longterm, right?
Because you are going to reach that stage where your code gets more and more complicated and the time and cost to change something and to add something is going to be a lot higher than on day one when you had nothing and you could just write something from scratch.
And so when you reach that point, where you definitely don't want to be is when you also have a high cost to deploying, to testing, to making infrastructure changes or to implementing logging, monitoring and alerting.
And so it's kind of like a dance, right? And being able to think that far ahead and be able to prioritize that far ahead.
Definitely.
I mean, the scale across a project and across a team can be different too.
So you'll have like one project where you're like, oh, I can just automate this for my team.
And I think that's all we need. Or it's just a smaller effort. But then at some point it becomes worth it to take that time to do it the right way.
And there's like an old proverb that says like, whatever is a temporary fix can become permanent unless something is done about it.
And that happens all the time in software.
And so building DevOps into your process can start to alleviate some of those issues where temporary fixes become permanent solutions.
And there are different offerings you can use.
You can use open source. You can use things like Cloudflare Workers or Cloudflare pages.
They kind of like take some of that and do it for you.
They automate some of those things behind the scenes that make it easy to spin something up quick.
But the more times you have to iterate on one thing, the more times you need to do it right.
So if you're building a web page, you might do it one way.
But for instance, on the machine learning platform side, you have to do a lot of different models.
And so at some point, you need to have a common way where, as you said, you can just quickly plug in to different systems rather than having to do it each time manually.
And that consistent experience I think is really important.
I mean, as a business, you want to be reliable. And there's no way that the different internal clients or external clients or whoever is using your model, you want that to be consistent for them as well.
So it's very important that you have a standard pattern for how you generate model results and where they go and how you evaluate them and checking for things like feature drift.
So if your different features are changing in values drastically and you need to know what's going on with your model, it's not just the software.
It's also the data that can change.
Yeah. And I think that that's what's unique to our workflow. We have our own unique problems that maybe traditional engineering teams don't really face.
We only have a few minutes here.
So I want to have some closing thoughts here. With your experience, what are some of the common obstacles to adopting DevOps?
And what would be your advice on how to overcome those?
Definitely. Adoption is a hard one. I mean, part of it, as you said earlier, is balancing between features and DevOps.
Spending time to do things the right way does take some time and some investment from the business to say, I'm going to pay these people to work on things that aren't necessarily directly related to our market value.
But in the long run, they will pay off.
And DevOps is why you take that investment is because without it, you're not going to have that time to market, that consistency, that reliability that we've talked about today.
And that's a challenge. You've got to get buy-in from your team, your management, from different roles.
So big data engineers, machine learning engineers, you have data scientists, you have more traditional software engineers.
An important one is working with security. We have to have regular early and often security reviews to make sure that each of our tools is working and up to standard.
And those are all things that take additional time. But if you don't do them, you're really risking it in the long term and creating more work for you downstream.
You'll have more tech debt, and you won't have time to fix it.
And as I said, sometimes your tech debt becomes a permanent solution, and it's not good.
So it's sometimes easier to do it now than to root it out later. There'll be more effort.
Yeah. So I think for me, I want to speak to the fellow engineers out there.
I think in my experience, what I've seen is one of the big hurdles of having a DevOps culture on your team is the skill gap.
Lots of people don't think that they're able to do things like automate their deployments or do infrastructure as code or implement logging, monitoring, alerting, to do these things beyond writing code.
Because that's what they're experts at, is writing code. And I would challenge all of my fellow engineers out there, learn how to do these things.
Because at the end of the day, if you can be a software engineer that is able to go end to end, you're not just able to write code, but you're able to deploy it, you're able to support it, you're able to touch the infrastructure that it runs on, and you're able to also address and have a security framework around that, nothing can stop you.
And so for me, I think lots of tool chains out there, you've got things like GitHub Action, CircleCI, you've got secrets management frameworks like Vault, you've got infrastructures code like Terraform.
These are all things that are very much worth learning.
They're not things that are just meant for operations people.
Those are developer tools. And I taught you Terraform in like a week.
So I also, I would encourage you, these things aren't huge things to learn.
It's not like you're going to school and getting a CS degree. So if you want this change on your team, a lot of times the answer is just go out there and do it.
Yeah, definitely. I mean, as you said on Terraform, it's something that I just picked up and ran with.
And now we have a great infrastructure as code.
And I think it's really vital to do those things the right way. There you go. And that's our show.
Thanks, guys. That's the show. Thank you, everyone.