Cloudflare TV

Automating Project Management

Presented by Usman Muzaffar, Tom Lianza
Originally aired on 

Join Usman Muzaffar, Cloudflare Head of Engineering, and Tom Lianza, Cloudflare Director of Engineering, for an exploration of how Cloudflare uses automation to streamline project management.

English
Project Management

Transcript (Beta)

My name is Usman Muzaffar. I'm Cloudflare's Head of Engineering. I'm joined here today, this afternoon, with Tom Lianza as Director of Engineering.

Tom, you want to say a few words about what you do at Cloudflare?

Yeah. These days, I spend a lot of time thinking about and helping with how we do things at Cloudflare.

When I joined, and for the first several years, it was just do, do, do.

And as we got bigger, we had to get a lot more thoughtful about how we do things.

And that's a lot of what we're going to talk about here. Yeah, exactly. We even debated what the title of the session should be.

It's Automating Project Management.

And that's sort of an unusual title. And someday, I'd be curious to see who's decided to watch a session on automating project management.

And so let's just start with, before we talk about automating anything, how do we do project management?

What do we use to keep track of stuff at Cloudflare? And how's that been so far?

Yeah. So certainly, before our time, Cloudflare had selected the Atlassian toolset, and we used JIRA very heavily.

It's a great effect, really. I think it is a best-in-class tool, extremely flexible, a lot of dashboarding and customizations, to the point where you can hire pure experts in it.

Yeah. At any field you want, model any kind of workflow, super easy to create, all kinds of permission sets, good auditing, good tracking, pretty much everything you would ever want in a way of, if I can dream it, and if I can describe it in data, I can model it in JIRA.

And both of us have worked at companies which use JIRA before. I've been using it since 2004, I think.

It's a long time ago. Yeah. And even as you're saying that, though, you're talking a little like an engineer, like, I can model the data.

And so there's a clear mismatch between just what we do and live and breathe in JIRA and sort of the things we need a tool chain to do business-wide and with other constituents that don't live and breathe JIRA, like the eng team.

Yeah.

So you mentioned Cloudflare as it grew, and I think I joined right at that inflection point.

You'd already been at the company for a good 18 months or so before me.

But right at the time when I joined, it was getting to be that coordination.

It was just a classic communication problem where the number of people that talk to each other is now growing proportional to the square.

And so communication has becoming a bigger and bigger challenge.

It was crystal clear from the management team level what new projects we should be shipping, what the top five big things were that we should be paying attention to.

It's all in JIRA. And yet we have these meetings.

We started to have a couple of meetings where we're going over a few JIRA reports and it didn't quite, it wasn't efficient.

Do you remember some of those meetings where it was just pure JIRA and what was it that we were struggling with?

Yeah. I mean, I remember one thing that we still have to this day, but it almost has a legacy name is what we call an RM roadmap, which at some point were articulated as like those top five.

And then as we grew, it became more like software projects and their names got a little more inside baseball.

And for a while there, anyone could create one.

Anyone could own one. Sometimes they're owned by product managers.

Sometimes they're owned by engineering managers.

Sometimes they're owned by individual ICs. So there was not a lot of structure on that part of it.

And then it loses some of that connectivity to the business.

Like they can no longer read what that list is and what order is it supposed to be in?

Where did this come from? Yeah. And so the head of product at the time, it was interim head of product.

I remember he sent me a text saying, I'm asking facilities to order two of the biggest whiteboards they make.

Like whatever the maximum size is, like 16 foot monsters, we're going to have them put up on the wall and that's going to list the big projects.

And where are we? And everyone's going to pay attention to that board.

And so we did that. So that board actually, we actually had a massive couple of whiteboards in the main floor.

So what was, what do you remember about that?

I mean, I feel like it was very ceremonious when it came over and like, Oh, what are we erasing?

Like, we must've slipped a date.

He's got the eraser. And, but it was clear.

And it was actually so clear. I mean, we're a security company, right?

We had to erase it anytime we had a company event, an event that had external parties, because it was so vague and obvious.

Exactly what was going on, what was on track, what wasn't.

It was also missing a lot of information. It was impossible to drill down.

For one thing, it was literally marker on a whiteboard. So you couldn't go through, it was by definition stale the minute it was up there.

It was impossible to share with the global team. At one point, there were ideas of pointing a webcam at it and broadcasting that, if you remember.

And I remember people starting to spec that.

And at one point, finally, I was like, wait a minute, we, something is wrong with this picture, literally.

All of this information is in JIRA.

It just doesn't show up in this way. And so I remember trying to even get a font that matched the look and the color to match the look of the whiteboard and write a Python script that pulled from JIRA and literally reformatted the tickets and the rows and the columns and a little bit of metadata underneath each one.

So that without any new kind of data stored, just pulling from JIRA, it reorganized the information in a way that made sense to us.

What happened next?

Well, yeah, so that's, I think, I don't remember if that's when we call it the shipboard or if it was called the shipboard on the wall.

You were so intent on making the font right that you chose a font that was non-standard and then people had to install the WAFF file or whatever to see it the way in the non-Comic Sans version.

And I mean, I think it was around that point we did either invent or realize we needed to invent a layer of indirection between the projects that are engineering speak and the milestones that show up on the board and summarize what may be multiple internal projects.

Right. And so if we were to describe for our audience, just what are we talking about?

I think one of the key things was to create a JIRA ticket that represented something really big, one of our product lines.

You know, something like, you know, if you go to Cloudflare's external website and click on products, you'll see that we have products for our CDN and our WAFF and our DOS and our load balancer.

And so we've created tickets in JIRA that represented these major product lines.

Then we created tickets in JIRA that represented big, huge hunking chunks of features, new versions, big things, quarter level efforts.

And by definition, those were owned by product managers and only product managers.

And then underneath those were the engineering efforts that were sometimes cross team.

And that was one of the most important things about chipboard was that a single ticket that represented the release now had just using JIRA relations a way to model all the big engineering efforts that went into it.

And, and by just laying that out as a tree structure, a sort of a tree and small embedded trees on a calendar, you could at a glance say, for this product, that's what we're shipping in this month.

And this is the status of the engineering deliverables that are underneath it.

And that transformed a particular meeting that we have, where we on a weekly basis go over this.

And I think that's really one of the one of the themes of this whole tooling that we came up with.

Yeah, I mean, it was, it's also like, that was the point where, though we still live and breathe in JIRA, we could not create that representation in JIRA, that sort of nested tree and like, play with it exactly the way we want to see it.

We've iterated on this a number of times. But, but we then moved to a world where JIRA was still truth, JIRA was still accurate, updated, people could work in JIRA, but we needed to reflect that data in completely custom ways.

And that's where this custom tooling. That's right. And so part of what part of the lesson for me was that, as a source of truth, as a system that is the official system of record, JIRA is fantastic.

I don't want to try to replicate any part of that.

But as a view, I want a view that is very bespoke to what the audience is looking for at that moment in time.

And so the shipboard, as we call it now, is actually on its third rev, as the board really evolved to the mechanics of the meeting that we run, and starts to take into account things like, what time zone are the participants in?

Let's get the London people to speak first, so they can excuse themselves and get on with their evening.

And if something is, if something is urgent, it should float to the top of the board.

But if it's an urgent, let's, if I'm going to, Tom, you have a project that's in trouble.

Let's talk about that first.

But while I have you, Tom, why don't you talk about all your other projects?

Because I don't want to have to tab back to you. Right, exactly.

So we created the, there's the version of the shipboard that still looks like you could put it on a wall.

But in fact, if you tried to do that, it'd be huge.

The company's very big, there's a lot of rows. And if, and since we want it to be time-based, it gets really wide as well.

And we want to review those products in a meeting.

So there is the reconstituted shuffled version of the same data that you're alluding to that we use to drive a meeting that you could have been attempted via just a JIRA database.

Yeah. And that's, that's, I think, a line that both of us use a lot as a database driven meeting.

And in some, in some cases, you know, the idea of a database driven meeting is very old in any kind of project management.

So for, if you're on a support team, you probably have some kind of review from your support system.

If you're on a sales team, some kind of pipeline review, anybody who's ever had a standup where they review the tickets that their team have, or any kind of issue tracking system, those are basically meetings that are database driven, but the tool is presenting the information in the best possible way it can, which has no idea how the people who are reviewing that want to present that tool and in particular ask, so if something is read, what are we doing about it?

Right? Yeah, this is the best part of the sort of next series of tools I feel like is, okay, so we know we're going to go through the projects in this order.

Okay. So we shuffle the view of the project. So they are in this order.

We also know what we're going to ask. Like, we also know if the project you know, if the PM says this product is going to ship on this date, but the engineering team says it's going to ship later, we're going to ask about that.

Right.

So we automated that part too, and baked it into the tooling. So what does the tool do if it notices that the product manager said, yeah, this is shipping on the 15th of August and the engineering manager says, key component of that is it's not going to be ready until the 20th of August.

What does the tool do? Yeah, I mean, in the, this used to be a real problem between just a communication problem, but it was totally automatable.

So the tool spits out this series of warnings about what about your project doesn't look right, doesn't square with how we say we're doing projects.

And because we use JIRA as the backend and JIRA is sort of a structured database, we can apply, you know, basic logic to any of those fields.

But the model, like the metaphor of a linter was really was came from you.

Similar to a code linter, like how do we, what is a machine that looks at this project and spits out things that don't look right about it?

I don't know how you came up with that.

So yeah, I can, I can speak to that a little bit. You know, when I was, when I was learning to code, I was, just in another session earlier today, I was mentioning how I don't have a CS degree.

So a lot of, a lot of, a lot of my learning and getting good at computers was self-taught.

And anything which helped point out problems I valued very highly.

And there's a class of very old tools called linters.

And the idea is lint as in they're literally looking for, for, for, for problems, little, little mistakes that it should, it should pick off and you can pull off was something that I liked a great deal.

And you could run these tools against your source code and they would point out problems.

And sometimes the problems would be completely inane.

It would complain about you, you know, not paying attention to some error code or not paying attention.

And in many cases, it was like, oh my goodness, shut up.

That's not a, that's not a meaningful warning.

I know exactly what I'm doing here. But every now and then it would point out to something that was a real problem.

And it was so much cheaper to have the computer try to point this out to you and just quickly go through and go real problem, real problem, real, you know, not a real problem, not a real problem, real problem.

Let me fix that. And, and, and linters got more and more sophisticated and they can point out, you know, simple things like your, your formatting doesn't match the rest of the code.

And so it's just jarring to read and more complex problems.

Like you probably got a bug here. You're just not paying attention to it.

And the system can tell you it's a bug. And about 10 years ago, I had a job in support and part of the job of support is, is moving the ball along.

So you're, you're receiving problems from the customer and you're shuttling them to engineering.

And if the ball is ever stuck in your part of support, you've made a mistake.

So the first, the first rule of support for our was like, just, it should never be blocked on us.

Like either it's next action on the customer to get us more information or next action on engineering to fix something.

But what we noticed was is that we could accidentally pat ourselves on the back by going, yeah, we've, we've, the next action we said was on engineering, but we never noticed.

We never actually asked engineering, wait, did you fix this bug?

And if you did, we better, we better tell the customer, actually that's fixed.

You should try it again.

Or we should, if we told, ask the customer here, we're waiting for more information.

If we don't hear back from them in two months, we should probably assume they're not interested in this anymore.

Otherwise it just piles up. And the more I started to do ticket reviews, the more I realized it wasn't three or four different ways things could get stuck.

It was dozens of different ways things could get stuck.

And if this product is still being released, but is on, is on a maintenance contract and the customer is paying their bill, then we should be following up.

Like every time you described a reason, every time you asked a support engineer, why aren't you, why aren't we moving on this ticket?

They would give you a very reasonable explanation for why.

And then it was easy for me to say, well, given that circumstances, what we should do in that case is this.

And we started to make this giant list of rules, like this encyclopedia of almost like the rules of a meeting or the rules of like the CCR, like the rules of a, of a housing association or a, or a, you know, a club, but nothing was enforcing them.

And so the, the idea that hit me is like, what if the tool, what if there was a tool that told you where you were potentially violating our own process rules?

And that was, that was, that had worked for me. Like once we pull that off, suddenly like open tickets plummeted, it was very clear.

We could, a very small team could stay on top of a very complex set of products and it worked really well.

I totally shelved that idea for a decade because I didn't need it for a long time.

And then when I came to Cloudflare, I suddenly felt, oh my gosh, I think it's that same problem again.

It's, we know, we know we could ask smarter questions, but I don't want to ask them.

I don't want you to ask them. We're busy. We're humans are really busy at Cloudflare.

I don't want to go around trying to run through a manual list of rules.

I want a computer to be the referee here.

And that's where we, we started with this idea of a, of a JIRA linter, an issue linter.

Yeah, I think it works both ways too, where not only do I not want to do that work and tap people on the shoulder, I don't want to be tapped on the shoulder either.

If I can, in the middle of my workday, an email pops up, Hey, you've, you didn't, you haven't told anyone the latest status of this project in a week.

Okay.

I'll go do that and move on with my day. It's so much better than feeling like, you know, my boss, whoever that may be is, is jumping down my back about not doing this.

Right. It's just, it's just, it's just an automated warning. But part of what we did was that automated warning, it goes out by email and I'm CC'd on it.

So I 99.9% of the time don't even look at them because I know the team will jump on them.

And every now and then I'll be like, that's interesting. That's like the 11th warning in a row that this manager has.

And it's usually something's going on. Like they were, they, you know, they were out of the office for a little bit or someone wasn't covering or, and so then, you know, we can, we can catch up.

But the most delightful thing about this is when we run that status meeting, all the data is fresh.

Almost every, and no one wants to have the yellow triangle by their name that says that they, there's something about their date that doesn't make sense or something about their status that has, that doesn't make sense with our rules or is missing.

And so the result is that morning meeting can go through dozens and dozens of projects at a pace of about 10 or 15 seconds per, and we still can just make noise from signal, signal from noise.

And the density is astounding. Density is astounding.

Yeah. But why do we still have the meetings? Oh, great question. So if it's that good, if the subject of the, of the thing is automate the project, why do we still have a meeting?

What will you start? What do you think? Why do we still have a meeting?

Yeah, I think the robot saves a lot of toil, a lot of what's the latest, you know, but there is something, well, first there's some basic things.

The reason my project has slipped is we're blocked on this, this other team.

We get them all in one room.

Like there's no, we, we get, these meetings are getting large, but it also means everyone's available at that instant.

Like something important is slipping.

We're blocked on this team. This human being is right there in the room with their bosses in the room, understanding the context of this dependency that they might not even have thought was very important.

And that, that we haven't figured out, I think, how to automate.

No, I don't think we have. So we, we still don't.

And I can't, I mean, the number of times that I'm like, well, why is this, why is this red?

Or why is this yellow? And immediately like people like this is red and yellow because of the priority that you said in that other meeting that that was more important.

And so we did that per your direction. And so, you know, part of part of this is how does the tool start to capture the knowledge that is built into the discussion?

And I think that's a, that's, that's the next evolution of this.

Before we, before we wrap, I want to talk about one other meeting, which is near and dear to both of our hearts, which is how do we really, it's a tech debt pay down meeting.

It's how do we make sure that we pay attention to things that cost us, that cost us mistakes.

And, and it's the title of the meeting is the instance review meeting.

But really, it's about it really should be the instant follow up meeting, because that's really what it's about.

It's about the work that we know we need to do, to make sure that we follow up.

And, and that's, that's got its own board with some very interesting features.

You want to talk about that for a minute?

Yeah, so we call that board and ops, because it's, I mean, we're all engineering, actually, that's kind of almost a legacy term, right?

Yeah, we're less, we're less similar than they are now. I think Cloudflare has always been very good at there's a problem, let's jump on it, let's fix it.

And quite good at, at root cause and writing reports and getting to the bottom of things, and identifying legendary blog posts around that too, actually, yeah.

And identifying what we should do next. But there's the actual mechanics of doing the thing that you should do next.

Because all engineers are bombarded with people asking them to do things.

And, and, you know, we could say some blunt instrument, like, anything that would have prevented an incident is top priority.

And then you would see, you know, us over index on some, some long tail of things at the expense of other things our customers want to keep shipping.

That's right. So the, this board and the way you get ready for this meeting is you create the JIRA tickets, again, using JIRA that that would that had you had you done them, they would have prevented that incident from ever occurring.

And you use JIRA this to physically link, structurally link those tickets to that incident with a special relationship that we configured JIRA to have called prevents, we call them prevents tickets.

And then the incidents themselves have a slew of custom fields that represent attributes of the incident that we think are reflective of how much pain our customers felt, how long it took us to remediate the problem, how quickly we detected all these things.

And so that we can derive a score, how painful was this was this incident.

And then when you have identified work that would have prevented an incident that was extremely painful by a machine generated score, we can, we can prioritize the most impactful prevents work without having a blunt instrument like all prevents work is number one.

Right. And so to me, the best part about this is that I feel it worked. I mean, it was a bit of a gamble to see what this would this system actually reflect?

Can we trust the score? Can we trust the prevents? Can we trust that the meeting actually has it?

And so in this case, right, this is a meeting, I would never automate because the act of talking about will this actually prevent is really valuable.

And people learn a lot about systems. But the result is that by definition, whatever the top prevents ticket is, that means that of all the work of all the tech debt of all the features of all the things that are on our plate, if we agree that prioritizing quality is important, which there's no disagreement on at any level, then, then this, this ticket should trump anything else.

And this should get the full force of engineering, including my voice behind it, with no disagreement.

And that works, because I can say with confidence of all the things I know this one was very painful, because the computer scored it.

And in the end, the weights, the coefficient from that score reflect our values of what we think is important for the business.

And yeah, that's right. And use like my least favorite cloud word important because people use it like a Boolean, it's either important or it isn't.

But there's so many shades of important.

I mean, it's um, what are we what are we doing? That's not important. So this is the mechanism by which we apply, you know, score value has some nuance to it, not just Yeah.

And again, not a very complicated piece of software, the end jobs board, I think you and the crew cooked it up pretty fast, evolved it very quickly to has no data store of its own, if I remember correctly, everything comes from JIRA.

And, and so again, another example is something that is 100% based on information in JIRA 100%, like nothing, there is nothing living outside of it, and yet would be virtually impossible to build in JIRA using its existing tool set, because it uses a synthetic score, it has multiple tabs and link cross link things and rank them, it even has really nice touches, like, because we know we run two versions of this meeting, one in one in the US time zone, and one in Europe time zone, so that our global teams can, and it's smart enough to know what issues should be covered in which time zone meeting and where the cutoff should be, because, and so it's very clear to two team members, what is what they're responsible for coming to that meeting.

And again, because it's backed by a whole bunch of lenders that nudge people, I have the confidence that when we have that meeting, we're not wasting time going, wait, is this report up to date?

Is this actually is a root cause already been asked, and we already identified what the prevents tickets are, that mechanics, I can trust that the team has already done, and they were under no confusion that they were supposed to do it because something reminded them.

Yeah, I think it also as an artifact of all of this scoring and automation, is we get data, and we can look at are we getting better or worse.

And JIRA will give you some data, you have more tickets this week or less tickets this week.

But now we're talking about are the issues we are seeing in production more impactful or less impactful over time, and what area of our software is having needs more attention.

And these are the sort of details that just some bespoke tooling.

Yeah, we have a whole bunch of other tools like this. We have tools that help summarize our work that we send out to everybody.

We have an equivalent tool that we work with support teams.

So it's sort of a cross report between JIRA and the support issues, issue tracking system.

What do you think is the natural future of all this stuff?

Tom, where do you think this is going to wind up?

I mean, I do, I assume there's some future where I used to when I just work at JIRA as an engineer, which isn't, you know, which was so fun, but it's no longer my luxury no longer.

I just want to come up to my ticket list and be like, all right, let's go.

Let's go right now. Let's get some dopamine on board. Yeah. And I think a lot of this stuff, whether we're talking about projects or anything, you can extrapolate from the individual up to the team.

So what I can picture a team's sort of set of priorities.

Okay. You know, we really could have prevented these incidents that are really impactful.

These JIRA, these Zendesk tickets came in from customers.

Product is desperately wants these few features because they're going to help these customers that we find a way to harmonize this stuff on a more automated basis.

So you're not going to four robot meetings or you have some, you know, almost like a team level or org level Kanban board for what we should, what we should be prioritizing.

And think about, so, you know, one of the things that it's, it's, it's as old as engineering, I think is the, you know, I, I got into this business to write code, to solve problems, to create great products and solutions for customers that matter.

I did not get to become an engineer to update JIRA tickets and, and keep project track schedules on track and update Gantt charts.

Like that's, that's a whole, you know, to some extent that is literally all overhead.

How do, how do you sell this whole idea internally to an org that might be very skeptical?

I was skeptical even suggesting this at the beginning, especially that there's going to be an automated system that reminds you to, to update tickets that CC's your boss.

Like really, was this a good idea? Yeah. I mean, this is a whole, this could be a whole talk in and of itself about, about exercising influence in your career and growth, but, but letting people know that, that you've delivered something is pretty important.

You know, if a tree fell in the woods and nobody's trying to hear it, does it matter that you did it?

We're actually trying to take work away from people, make it simpler and more automatic for you to communicate that things are happening, that you're delivering value and put it in context, you know, close JIRA, update this field right next to it.

So that, you know, information can be broadly known and shared across different teams and organizations.

I don't know how you, how you see it, but I don't know how, how else you get, you know, sort of celebrated or recognized if you're not, if you're keeping it all to yourself.

Yeah. Yeah. I think that's a big part of it. And I think another big part of it is having the, you know, without having the conversation of like folks, can we agree, we need to keep track of things and, and everybody will say, of course, we need something to keep track of things.

All right. Can we agree that any active project owes a one sentence update to the business?

And if it's more than a sentence, something's wrong, like it shouldn't be more than a sentence.

We should, it's literally one sentence. Here's where we are this week, good or bad.

Here's where we are this week, this week on this effort. And I think it's important as leaders to sort of look people in the eye and be like, so that's, if we agree on that much and we agree that the subset of projects that you're working on that, like you said, are the, there are the trees that we want the world to know that we felt, right?

Those are the, we don't, we don't need to know. Everyone doesn't need to know about every single bug we fixed.

Everyone doesn't need to know about every config change and every, every, every line of yowl that was edited.

But if we shift something that took real effort, well, we fixed a problem that took real effort, internal or external, let's talk about it.

And so if you can articulate the value enough that this is going to move the needle for the business, then I think it is worth telling your leadership chain.

And then the leadership chain will, is in its, in its rights and responsibility to say, get us, get us just the bare minimum to keep track of this.

And if, if we can, if we can all have a handshake deal around that, and then let the computer be the referee, like, because I don't want to be in the business of, of nagging people.

That's good.

So Tom, this was fun. I've actually really, this is one of my favorite parts of the job is exploring this space.

It's really, I feel like we're inventing a whole other discipline and a whole other science here.

It's very cool. Yeah. I think we helped automate a lot of this.

And we've automated some great things and we've definitely made some missteps, but I think we'll we'll keep doing this and treating this as a side effort as part of Oregon Cloudflare.

I think we'll, we'll talk more about this on Cloudflare TV.

Thanks everyone for watching.