Introduction to Machine Learning
Presented by: Katrina Riehl
Originally aired on July 9, 2022 @ 4:00 AM - 5:00 AM EDT
Machine Learning has changed the operation of modern business. In order to stay competitive, modern businesses have adopted powerful data science teams that apply machine learning to common problems within their organization. In this session, Katrina Riehl will dive into the basics of machine learning and how it applies to the growth of Cloudflare in the future.
English
Machine Learning
Transcript (Beta)
Hey everyone, I just want to take a minute to thank you for tuning in to my little introduction for machine learning.
My name is Katrina Riehl. I'm leading the data science team here at Cloudflare, which is part of the larger business intelligence team.
I've been at Cloudflare for a little over a year and my field is artificial intelligence.
So we're going to kind of walk through a little bit of machine learning and how we're using it on my team and then how we're applying a little bit, you know, to the future of Cloudflare.
I do want to mention that if there are any questions that pop up during this session, please feel free to email live studio at Cloudflare.tv and I'll save a little time at the end to take questions and answer anything that's on your mind.
And with that, I'm going to go ahead and share my screen and kind of walk through What we're doing here.
So, as I said, this is going to be a pretty world one tour of algorithms that learn.
And I do want to mention that this is, you know, this is a huge subject and condensing it down into an hour is almost impossible.
So Just kind of bear with me and, you know, we'll get through some of the highlights and hopefully you'll have a much better understanding of machine learning and how it applies to data science teams.
Do you want to start off by saying, you know, what a data scientist is and there are a lot of different definitions for data scientists out in industry right now.
You may hear data scientists apply to anyone working with data, you know, some people have it more technical.
Some people have it as more of an analyst role.
But the way I look at it really is the cross section of all of these different areas where we do a little bit Of programming.
We do a little bit of data engineering, we do a little bit of, you know, the mathematics and statistical side.
And then we also really need to be able to come up quickly on domains in order to be able to apply these techniques to real business problems.
And so I do want to mention, make sure that we're level set on some of the co vocabulary that I'm going to be using here.
As I mentioned before, my field is artificial intelligence and I've been working in this field really for about 15 years However, machine learning is actually just a subset of what I would call artificial intelligence.
And when we talk about machines learning. We're really talking about particular set of algorithms and models that learn by themselves.
And so it's actually a very small subsection of the larger area of artificial intelligence.
And more importantly, I think sometimes we need to remind ourselves that machine learning is actually classified as a weak AI.
And what I mean by that is that It doesn't work very well in the face of situations or any kind of data that it hasn't seen before.
Since we are using past data in order to be able to predict what happens in the future.
It doesn't work all that great when we have conditions that pop up that we haven't seen before.
But I do want to say that, I mean, even Even with those limitations.
This is an incredibly powerful tool, you're probably used to seeing it on your, you know, personal assistance and, you know, all of the little pocket devices that we have that are answering questions for you.
Everything from recommender systems to targeting with emails that you may or may not want to receive right and then also I do want to mention that the recent adoption of machine learning in in In industry really has been facilitated by the fact that we have big data.
Now we have really cheap storage, where we're able to hold large amounts of information and we also have better access to compute We may or may not be able to do this in the cloud and we have better processors were able to crunch through this data, much, much better.
And we're also able to operate them much more efficiently when we put them out into a production system.
But I do want to mention the three particular types of machine learning that we're both we're primarily using here at Cloudflare Which is unsupervised learning supervised learning and reinforcement learning And each of these is a slightly different area, you've probably heard of other parts of machine learning, particularly right now.
I think a lot of people are very interested in deep learning, which is much more neural network based But to start out with where we are right now in our data journey here at Cloudflare and these are primarily the areas that we're focusing on So in unsupervised learning.
What I'm talking about here is that we don't have a signal.
We don't have a base source of truth.
And what I mean by that is that we can't really tell whether or not we've actually particularly gotten this right, we're looking for You know, like customers who look like each other.
We're looking for groups of information that have things in common with each other, but we would actually need to characterize what those groups are and how they're being used.
You see this quite a bit with, you know, see this with compression and you see this with feature elucidation recommender systems and targeted marketing, a lot of times.
Where we're coming up with competitive sets.
We're coming up with segments of customers that we want to look at.
And then we were able to use these different cohorts. If you want to think of them that way as a way for us to characterize who our customers are On the other side, if we look at supervised learning.
In that case, we really do have a signal whether or not we've actually classified this correctly.
So you see here that there are two different sections for classification and regression.
In the case of classification.
We're really looking for like a yes, no answer or some sort of class that we're trying to predict.
So in that case, if you look at like fraud detection, for example.
We have a signal. Yes, it's fraud. No, it's not fraud and we're able to compare our results to that truth and determine whether or not we got it right.
On the regression side, we're much more looking at a continuous number. Right.
So if we were to, for example, forecast our yearly revenue or if we were trying to look at how our number of requests may Increase over time or something like that.
If it's a number that we're putting out that's on a particular scale that we're talking about a regression And in that case as well.
We have truth to compare it to eventually if we predict something like revenue.
Let's see how much revenue we actually got and know whether or not we were able to hit the mark.
Reinforcement learning is not really something that we're going to use quite as much, but I've used this quite a bit with like robot control system.
So if you want to think of it as having a target.
Score or a, you know, some sort of way to describe success.
So in the case of like a robot that's trying to to learn how to stand up, you may have some sort Of equation that allows you to express, you know, knees over feet, hips over knees, shoulders over hips, no head above shoulders and, you know, be able to tell is this robot actually standing up This may or may not be something that we're able to use, but I do want to mention it because it is a possibility that we'd be able to do something like that for a more active problem.
And then here at at Cloudflare. I want to talk a little bit about some of the machine learning models that we're already, you know, talking about doing So we have turn models.
Can we can we figure out which one of our customers are going to leave us It's always a little bit easier for us to hang on to our customers as opposed to acquire more customers.
So can we predict whether or not they're going to leave Can we predict our revenue.
Can we predict what our costs look like.
Can we look to see What how these different margins relate to each other.
So is there some way we can predict whether or not we can increase revenue while keeping our costs the same Those are the kinds of things that we also would be able to do with machine learning.
I also want to mention I did talk a little bit about fraud detection.
So being able to tell whether or not somebody is using A credit card that's actually there right in order to pay us.
Are we getting a lot of fraud on our website.
And then I also did want to mention sentiment analysis. So we can even go out and look to see How do people feel about Cloudflare are there's there some way we can scrape social media sites and see what is Cloudflare think, you know, what do people think of Cloudflare What do people think about our new products or allow us to, you know, dig deeper into even some of our customer support tickets and determine how pleased our people with what we're doing.
Is there some way for us to look at the language that's being used and determine whether or not it's positive or negative.
And then also, I didn't want to mention lead scoring.
So in sales or marketing, something like that. If we can determine prospect, you know, prospective new leads that are going to lead to A customer base or if we can predict what is the value, a customer would be able to get from our Cloudflare products.
Those are the kinds of things that we want to be able to do So, you know, taking in data mining it looking for these patterns, creating models that are able to predict the future.
Those are the kinds of things that we're, we're looking to do And so with that, I did want to talk a little bit about unsupervised machine learning.
I talked about it before that we may want to look at clusters of customers, for example.
And this is a pretty simplified example you can see pretty clearly that there are three clusters here that look pretty cohesive they're relatively round.
But they're also within a two dimensional space right this gets a little bit more interesting when we're talking about a set of features that may or may not be You know, quantifiable for us to figure out whether or not we've come up with a good clustering and when we can't look at it, you know, in just a two dimensional space.
So for something like this. If you can imagine extending it out into a lot of different areas.
It may be a little bit more challenging for us to come up with clusters.
But with that, I didn't want to exit out for just a second.
And talk a little bit more about the actual clustering algorithms that we use.
And with that, I'm going to, you know, start with something pretty simple.
It's pretty intuitive for people to understand And that's the k-means algorithm and in k -means what we do is we pick the number of clusters that we want to produce and then we Assign different instances to each one of those clusters.
And so you'll see this a little bit more as I walk through this code.
I do want to mention that I'm obviously using a Jupyter Notebook. This is an open source notebook based cloud.
I'm sorry. Browser based execution environment.
This is pretty common, by the way, in data science. Teams that like to use Jupyter Notebooks in order to look at data and explore data.
And I think you'll see why because it's very interactive.
It can be very visual, we're able to look at data pretty quickly.
And then also, you know, For presentation purposes or making sure that people understand the process of what we're doing.
And this can be a really great tool.
So I'm also using the Python programming language. I'm not going to get into a lot of the details of Python.
But the scientific stack of Python is really helpful to us.
So you'd see a lot of these packages come up over and over again. Things like NumPy, Matplotlib, SciPy.
Scikit-learn is another one you'll see. Pandas.
These are all packages that are really, really popular within this area. So here, all I'm doing is setting up my environment.
I'm bringing in all of the packages that I'm going to need in order to actually execute a K-means algorithm.
And with that, I think the next best way to start and let's, you know, let's create some data.
So with that, executing this little So scikit-learn, this is the package that I'm using.
And then I'm making a bunch of blobs. Right. And so I'm just sampling randomly and creating a certain number of points.
I think that you can pretty easily see that here we have, you know, four pretty pretty intuitive blobs that show up here, which makes sense, since we did say that we wanted four different areas to be sampled.
But you can see pretty quickly, this is one, you know, another one here, another one here, and then another one here.
So when we're applying K-means to that, we want to be able to capture what each one of, you know, each one of these little blobs of information.
And so with that, let's go ahead and pull in K-means.
We already know that we've sampled from four different four different centers.
So it's pretty easy for us to pick our K. Sometimes it's a little bit more difficult for us to pick the K or the number of clusters that we want.
But in this case, it's pretty easy since it's manufactured data. But with that, we go ahead and fit the data into the K-means clustering algorithm.
And then we're going to predict each one of the points to see which one, which cluster do each one of these points end up in.
So we're predicting on the same things that we're fitting with, which is, you know, you know, something that we can talk about a little bit later.
But in this case, we can see where the clusters wound up. So as I execute this one, you can see those four clusters pop up pretty quickly.
Right. And so now we have four relatively stable clusters that are, we're able to characterize things.
So now if we had a new point that came in that we have never seen before and was not used to fit this model, we might be able to look and see which which cluster did they fit into?
Is this an, you know, which one of these do they most look like?
And so we can see just by looking at the distance of that particular new instance to the centers of each one of these clusters and determine which one they belong to.
And so with that, I do want to mention a little bit more about the actual algorithm itself.
So in this case, let's say that we're looking at something like this where we do have our, you know, four clusters here.
And then here I'm going to choose the number of clusters that we want to, that we want to look at.
In this case, it's not very interesting when we look at just one cluster.
In that case, everything belongs to the one cluster. So that's not exactly interesting.
But let's do something like six clusters. Right. So in this case, we're choosing six centers that we're going to start with as we start moving through this data.
There are a lot of different theories about how you should initialize this You know, you can just take the first three points.
You can take random number of points.
There are all sorts of things that you can do here in order to initiate the algorithm.
But as we move through each step, we're going to look to see each one of these points, which of the centers are they the closest to by using something like a Cartesian distance or other distance metrics to determine how close they are to each one of these centers.
And then we'll update the center of the new cluster that's formed by all the points that have have that are closest to that point.
So you can see as I move through each one of these frames.
First, I'm looking to see like, you know, what are the, here are the points.
Then let's go ahead and, you know, look at all the distances across there.
We're going to update the centroids to those particular areas that we have created from our first round.
And then we have our next set. We're going to keep iterating through this.
So we're going to get better and better and better as we move across here. And we're going to move things, move these clusters over and over again.
You can see points as they're moving into these different clusters and over and over and over again until our centroids stop moving.
So we're going to keep doing this. Until it stabilizes So now you can see that the centroids aren't moving nearly as much.
So there's not really any way for it to keep updating. So this is where we generally stop.
Right. We can keep going, but it's not going to go anywhere. Right. So at that point, we've now created six clusters.
When you look at this data, you can see pretty clearly six may not been the right number of clusters that we want to use here.
And there's orange and blue clusters right here, maybe a little too close.
So we may want to try with a different number of clusters. So initialize that K with a slightly different value in order for us to get better clusters that come into that one.
But with that, I did want to move over to a little bit more interesting subject.
So When we look at K-means, I'm going to scroll down here and let's talk about a little bit about color compression, because this is a pretty fun application of something like K-means.
So in the case of, you know, a picture, right, in this case I'm just using a sample image that's, you know, Available here.
It has nothing to do with anything. It's just, just a particular picture that's there.
So if we look at the picture itself. It's this beautiful picture and it has a lot of different colors in it.
Right. So if we're storing this picture in a three dimensional array.
We would have something that looked like, you know, the height, the width and the RBG values for for this entire picture.
So that's what our three dimensional array is going to look like. That's how we look at the picture.
You can see that here. So I'm going to look at the shape here.
So 427 by 640 and then you know the three, three for the RGB value. And so Here we're going to reshape it.
So we're going to flatten the picture first. And with that, we, it's basically 427 times 640 we get a much, much larger number and then we still have our RGB values.
So now we've, we've flattened out the picture and we're looking at these 273 ,000 points and our three dimensions.
So from here we're going to pull in k-means again.
Don't worry too much about the difference between mini batch k-means and k-means for now, but generally what's happening here is that we want to reduce the number of colors to 64 So out of these, you know, huge number of colors that are in this original image.
Can we reduce it down to to 64 Once again, we reshape our image.
We're going to apply k -means.
We're going to actually, you know, reassign each one of the pixels to that new cluster that they're a part of.
And then we want to see, you know, Whether or not we have an image that still looks like something recognizable.
So I'm just going to take a second or two.
But here you can see the original image with all 16 million colors, which is quite nice, but reducing it to 64 colors here.
We didn't lose a ton of fidelity here.
We can still understand what the picture is and we've compressed it quite a bit.
So this is something that is used quite a bit in k-means and with other clustering algorithms.
And like I said, this is a pretty simple algorithm to understand, but I think it gives you an idea of what we do with clustering algorithms in general.
With that, I'm going to move back over to my presentation.
And It's going to present And here we move on to the supervised side of machine learning.
So I mentioned before, the unsupervised side with k-means, you can see that we didn't have some sort of target we were going for.
We didn't know which cluster, you know, each one of these These points was supposed to fit into.
There's no ground truth in that case, but with something like supervised machine learning, we do know.
We know exactly what class something is supposed to look at because we have past data that tells us So if we're able to look at past data and all these different instances, we know whether or not, for example, somebody has converted or not.
And this, I think, is a really great example to start off with is a decision tree.
I think most people are familiar with them. And it gives a good idea of what I mean when I talk about a model.
So a decision tree would be a very weak model.
It's not, it's a weak estimator here. And in this case, you can see pretty quickly, you know, some of the shortcomings of something like this that we really only have, you know, a few number of features we're using And it may or may not really capture what we're trying to do here.
But in this case, you can see this example here for a decision tree where we're trying to predict conversion And we just ask ourself a series of questions, right.
So as a new, a new lead comes in, for example.
Do we, we ask ourself each one of these questions at each level, starting from the top.
So is the company size greater than 100 If it is, we move over to the true side.
If it's not, we move to the false side. Over on the true side, then we ask, you know, is their revenue greater than $10 million If it is, we move to the true side, which is going to give us the answer will convert.
On the false side, it gives us not conversion.
On the other side, I do want to point out that we're using a different feature for the next set of questions that are on the right hand side.
So in this case, the company size is not greater than 100 We now look to see whether or not they use Salesforce.
And so if they do use Salesforce.
We move on to the next question is the revenue greater than $2 million If it is, that's more likely to convert.
If it's not, then they won't convert and so on and so on.
So you get an idea of how a decision is made in some sort of automated fashion.
And with that, I'm going to break out again. And I'm going to show you walk through this a little bit more And in this case, I'm also going to start moving into a an example called random forest and in a random forest, we would take a huge collection of all of these different trees.
And figure out whether or not it gives us a better answer.
And you'll see that in a regression for forest.
If we take something like 100 or 200 even 300 trees. Can we get a better idea Of whether or not this is a robust model.
Are we getting better answers is our accuracy looking better.
So once again, I'm using the Jupyter Notebook here and I'm also using Python.
I've already initialized it here and brought in all my packages should look pretty familiar since we used it in the last one last Last notebook and then from here I already kind of walked you through a decision tree.
Here's another example of one And then we're going to do the same thing that we did with the clustering example.
But in this case, like I said, we already know what the clusters look like we already know what each blob looks like we know what the classes for that one.
So we have truth about which one which which group, they should belong to.
So in this case, We've manufactured some data. And then we were just looking to see what it looks like.
And this is, you know, we already know that this is one area.
This is another area. This is another one. So if you want to think of this as the class of read the class of purple, the class of blue, the class of yellow.
Those are the different ones that we've created And the same vein of what we did before.
Let's go through a little interactive Example. So what we're looking at here is that as we add depth to the decision tree.
So as we move through each one of these Areas are depth would be one.
If we're looking here to if we come here three if we go here.
Right. So as we we add depth to our decision tree.
We want to see how that changes how something is being classified So starting with a depth of one.
It's a binary tree. So it can only have two different classes that come out of it.
You can see that half of them get classified as read And the other half are getting classified as purple and obviously for all of the ones that are blue and all the ones that are yellow.
Those are all misclassified.
We already know that because we have ground truth. So as we add more and more depth to the tree.
We're adding more and more features that are helping us make a decision.
We can see how this changes. So just moving to a depth of two.
It's a little bit better. Right. We can see that now we're classifying these as read We're classifying these as yellow or classifying these as blue.
We've lost our purple.
Right. So now all of these are being misclassified pretty pretty badly. Let's move to another depth here.
So now we're at three we've captured more of the purple.
So we're doing a little bit better. Our accuracy is getting better, but we still have several points that are being misclassified So as we move in depth.
We want to see whether or not we can capture those so depth of four little bit better.
But we're also seeing this little strip right here. Anything that's getting classified from this space here.
It's pretty far away from the area that we really want to cover for purple.
So this may or may not be a great classifier to use And then five here we can see it's tightened up quite quite a bit.
But in this case, we also have these little tiny areas that are being used to capture one or two points and that can be a sign of overfitting And so that's pretty common with a decision tree is to overfit the data, especially if you're using one particular decision tree.
And here I'm going to show you real quick how those things can change.
So what we're doing here is we've created the decision tree classifier and we're just using slightly different parts of the data.
So we're slicing the data slightly differently to see how the classifier changes and see, see what kind of effect that has there.
So taking the first 200 points, we get the first one and looking at the last 200 points, we get the second one.
And you can see pretty quickly.
I had to increase my font. So it's a little harder to see than I would like, but you can see that these are classifying things slightly differently.
I think some of them that you can see pretty quickly, like this purple area right here is not here.
We have this little strange stripe of red that's capturing this one point that's not in this one.
And those are, you know, those are the kinds of things that happen when you're sampling your data.
You can end up, even though this is the same problem we're trying to solve, we've wound up with two radically different decision trees that are may or may not actually be giving us the right answer if we were to expose it to new data.
So with that, I'm going to move on to the random forest idea.
So this is, like I said, is an ensemble method. And what we mean by that is we take a lot of weak estimators and we move them together.
There's a pretty seminal paper quite a few years ago at this point that showed that a set of weak estimators, when put together, will classify better than a strong estimator, which is a little counterintuitive.
But basically, by taking voting a majority, what's what we're talking about here, we're able to get better results and we're not going to overfit our data quite as badly as we saw with the example with the decision trees here where, like I said, we're fitting to like one or two points.
This may or may not be a good way for us to look at or to fit a model.
So moving over to random forest. Let's, let's create some data and then we're going to initialize our random state, which is going to determine what subset of the data we're using to create each one of those little tiny trees and then we're going to fit it.
Oh, actually, we're going to start by Yeah, we're gonna, we're going to look and see how this changes as we interact with it.
So Here, what we're doing here is showing as we move through the different parts of the training data.
How does it affect our classifier. So if I initialize it to zero, I end up with this one.
As I move through all of these different values. You can see how radically different each one of these classifiers can be just depending on how I sample the data.
That's not really a great, great thing for us. It doesn't mean that it's going to generalize very well when we try to apply this to new data.
It's probably not something that is going to help us all that much if we're trying to predict future behavior or anything that we might want to generalize to a larger population.
So with that, Let's add the random forest. In this case, like I said before, we're going to initialize the number of estimators, we want to use.
We're going to initialize our random state and then we're going to see how it how it changes the classification And here you can see a much smoother transition for each one of these areas, the Points that are being captured by the small little areas are not quite as you know as huge as the other one.
So you don't have these stripes that move across which are more likely to result in error.
So it's fit the data, a whole lot better.
And just to show you that as I change this number. We can look to see how it changes.
That is not a big difference right here. But let's say we move it to something like 50 Right, not quite as effective.
Let's move it down to something kind of crazy like 20 And here, not quite as good as well, but still way better than what we were talking about before and then breaking it all the way down to 10 Once again, more degradation, not quite as accurate, but still not that bad.
So even adding a few number of these weak estimators together still gives us a better answer than we would get from a from a stronger estimator.
Or you can also just look at it as as each one of these, you know, estimators is making their vote.
It averages out, you know, the little bits of error that you would see And so moving to regression, as I mentioned before, in this case where you know what we're looking at before we're trying to predict a class.
So in that case, you know, purple or blue or yellow or red.
And in this case, let's try to figure out what is a number that we're trying to create And in this case, we're going to use the random forest regressor.
So instead of outputting a class like we saw before, we're going to output a number So in the random forest regressor, it's the instead of taking the voting maturity, we're going to take an average of the values that are produced by the by the set of trees.
So here, let's generate some data. And you can see this Here, we're just taking a couple sine waves and then adding in some noise and then plotting it.
This is a pretty messy set of data like so fitting this should be pretty interesting.
If you were to put something like a linear regressor in this or taking a line, you can see that there are a lot of points that we're not really going to capture But if we use something like a random forest model, we're much better able to get in there and And look at all these different areas.
So you can see that, you know, sine wave start to appear.
So these are the values that are output by the regressor that we just trained on this data here.
And it's a bunch better Categorize.
I'm sorry. Characterizing the data that we're looking at.
And in some cases, even, you know, it seems to be falling within the error bars pretty well as well.
So once again, the sine wave is showing up again and then we can use that to predict new data that comes in.
So if we were to have some sort of, you know, new data that came in that had, you know, the X and Y values.
Can we determine what the, what the value is that we should output.
And so we look it up here and follow along this red line and that would be the value that we output Alright, so I was going to move down real quick.
I'm sorry. I'm sorry. I didn't mean to do that.
And we're going to talk about this for some handwritten digits. Any of you that are familiar with OCR.
That's something that uses, you know, recognize trying to recognize characters.
So optical character recognition is what I'm talking about here.
So it's really taking a picture of the different characters and then trying to determine what it is.
And so with this, we're going to pull in a bunch of different pictures of digits.
So zero to nine is what we're looking at and then load that from the sample data sets that are already part of scikit-learn And then we can see How much data we're looking at.
So we have We've created our data set.
So let's set this up so we can take a look at our data. So, It's a little bit harder to see, but we're going to look at the truth data compared to the image that we're looking at.
So here you can kind of see how this might look like a zero And that is the truth label for it.
One here. That could look like a one, two, a little bit fuzzier, three, and so on.
So four, five, six, seven, eight, a little harder to characterize, nine, Zero again as we move through each one of our sample data.
But once again, like I said, as we try to train this algorithm, we're going to feed in this data and the algorithm is going to learn What is a one, what is a two, what is a three, and then when we take a new image and apply it to that model, it's going to predict which one of these digits it actually is.
So Here, let's go ahead and create our decision tree classifier. And Let that go.
And what we're doing here is looking at an accuracy score. So for each one of our what we're doing here is we're we're creating a train test split And what that means we're going to hold out a group of our data.
And that's what we're going to test our Estimator on.
So if we've trained a model that is using all of this training data, right, Then we want to hold out a group of data from there and see whether or not we're able to predict what those are with any amount of accuracy.
This will determine, you know, how much better You know whether or not this is something that's going to work for us.
Right. And in this case, you can see here with a decision tree classifier We're fitting it we're predicting from it.
You can see the accuracy scores about 83.78% right That may or may not be good enough for our application, depending on what the cost is for us missing the mark on this one.
It may or may not be something that's actually applicable. And this is something that I talked about quite a bit, especially as we're trying to solve business problems with classifiers Is this enough, we're going to, we know that models are going to get things wrong.
It's almost impossible for us to create machine learning models that are going to be 100% Accurate because that would mean that we've captured all the data that it has seen and ever will see right it's very, very difficult to have a training data set that is that comprehensive and has that much information in it.
So our accuracy can only go so far. And even with a 99% accuracy.
There's still that 1% one out of every 100 that we're going to get wrong. So risk mitigation becomes a really big part of what we do.
And so is 83.78% You know, is that good enough.
Is that able to help somebody with their workflow. Is that automation enough for us to make You know, good decisions.
Is that something that's going to help us as we're moving forward, or is it creating too much noise.
Right. But we also have to consider the fact that humans are also not perfect right Are humans going to be able to accurately predict something is going to convert or something is the way that it is, you know, with an 83.78% accuracy.
Maybe, maybe not. We really don't know.
We have to take those sorts of baselines and, you know, initial analysis and initial Models that we produce and see whether or not we can improve on it.
Can we add more data that's going to give us more information and we're going to be able to get better at accuracy.
So that's a lot of information about accuracy, but I do want to point that out because in some cases.
You may be fine with 60% accuracy, because it still makes your, your problem so much easier.
It still makes people be able to do their jobs faster.
It's still able to increase that margin.
I was talking about when I started to talk But in other cases, maybe it's not.
Maybe the cost of getting it wrong is just too high. And there's $1 amount associated with that.
Or we run the risk of, you know, making somebody angry and insulting them and they may, you know, not want to stay with us as a result that because our interaction with them is so negative so Kind of looking at this accuracy score, we may, you know, offer different ways for us to be able to handle with the handle what the model is telling us And I tell people this all the time.
It's pretty famous quote that came out, you know, mathematics field, you know, at this point, I think hundreds of years ago.
But the whole idea that all models are wrong.
Some are useful and I repeat that all the time. So can we create models that are useful is the target of what my team is trying to do And with that being said, I also want to mention that there are different statistics that we can look at So here, this is something that's pretty common in data science teams as we're looking at machine learning models, something called a confusion matrix.
So here, since we do have ground truth data we have the predicted label that came out of our model.
And then we have the true label of all the data that we've used You know, to feed into the model and see how we've done.
So the important part here is this line down, you know, down and across.
And so this tells us how good we are at predicting each one like so if it's a zero for If it truly is a zero, how many did we predict were actually zeros.
And then you can see here the difference between what we got wrong, basically.
So for each one of these Our predicted label of zero. We can see here, for example, our true label of three, right, we have some threes that You know, definitely look like zeros to our model or even fours, or in this case we have something that we know is a zero.
And then you look over here and see, oh, we predicted as a two or oh we predicted it as a four or six or seven.
And these are really important numbers for us to know because it gives us an idea of like what is our true positive rate.
What is our false positive rate and so Those are the things that we use to distinguish.
And once again, this goes back to risk mitigation.
If we have a huge number of false positives. So people that were, you know, categorizing and having having a particular class.
That they don't have. Is there a cost associated with it and it is an acceptable amount of risk for us to take.
And so those are the kinds of things that we do in our automated decision making systems.
With that, I'm going to pop back over to my slides. And present again.
And talk a little bit about that over and under fitting that I talked about before.
So one of the big trade offs that we're doing when we're trying to fit a model is what is the trade off between bias and variance.
So in this case, like what kind of bias has been introduced into our data set like how we somehow, you know, Can we measure it right.
Can we tell like what sort of bias. So do we have in there.
Do we have a sampling error. Do we have not enough data. Do we have, you know, All of these different things that can be introduced into data versus variance of, you know, are we able to look to see like how far off we would be From our true values that are coming out.
So like what, what is the difference between what is, you know, actually been used to train And what we predict and what the actual truth is.
Right. So I know this is a little bit confusing, but I do want to mention this that we're constantly in the position where we have to balance these two things.
In order to make sure that we're getting a good model. So it's not quite as simple as us picking some machine learning model off the shelf.
And applying it.
We really need to know how to sample our data appropriately clean our data appropriately set up features that are going to be powerful and give us good separability or give us really good signal.
We don't necessarily want a bunch of Features that all represent the same thing we want to make sure that we're capturing truly a representative set of the data in order to fit our model.
But at the same time, we don't want to have that data be the only thing that it's able to look at how well does it generalize right So if we're fitting our model to every single one of these different points like we saw with the decision tree.
Is that over fit.
Have we have we we tailored it way too much to the data that we have In that case, we have over fit our data.
It's not going to generalize well and both of these cases, we're going to see Our accuracy suffer as a result of them when we take new data and it has to somehow perform in the face of this brand new data that's coming in.
And so with that, I just want to mention that This can be a little bit tricky because of things like particularly I want to point out here the training era, which is this black dotted line right here.
So as we deal with different models and we increase the complexity of the models.
You can see that even with You know, the decision tree example we increase the complexity of the model.
So we add more depth to that decision tree right We can see that our error is going down on our training set.
It looks better and better and better. It looks like we're getting really, really good to good accuracy.
All of our tests, you know, all of our training data looks really, really good.
And it's, you know, this looks theoretically like it's a really, really great model.
However, when we actually take new data and we apply it to it and we're actually trying to predict what's going to happen.
We can see that our test error is going up and up and up and up and up.
So this can be a very rude awakening. When we see our training error go down and down and down.
And we think that we have a really great model, but in fact we overfit our data.
And our actual testing error is going to go up and skyrocket.
So going back to what I was talking about before, you know, That model may not end up being particularly useful.
It may not end up actually fulfilling the business requirements of what it is that we're trying to do because it just doesn't generalize well to the new data that we're going to be throwing at it.
And that can, you know, happen.
Like I said, for a number of reasons that we're, you know, we're not using the right model where, you know, Not sampling your data right.
We're not, you know, we are adding more and more.
We're tuning more and more variables in our model to the point that, you know, it's It's grabbing on to every little tiny little bit of signal that's in our training set.
And so we're constantly trading off these different concerns to see whether or not we can get the best model possible.
And we try to be very, very methodical and very careful about it to make sure that we're producing a model that we can actually use and that really helps with decision making.
And with that, I also do want to mention a little bit about machine learning bias.
Since this is something that you see in all data like as we move through our data journey here at Cloudflare, I think that you'll see that, you know, these types of bias are really They stand out.
You can see them pretty quickly.
And some of them are, you know, much more nefarious than others. I just want to mention here that, you know, there are many, many forms of bias, but these are some of the ones I wanted to really point out just because These apply mostly to machine learning and especially since we are trying to find representative populations.
A lot of these Are applicable to population size, but I do want to mention, you know, you'll see this quite a bit, even in analysis with something like confirmation bias where If you have an idea of what the answer should be and the data is not telling you what it is that you want to, what you think matches your intuition.
It's really, really easy for us to say, oh, there's a problem with the data.
Oh, there's something wrong with the way that we're looking at this data. Or, you know, we're more likely to accept data that happens to confirm what we already intuitively think about the problem domain.
And that can be a really important one for us to check ourselves on as we're looking at new data sets, but here I didn't want to mention, you know, selection bias.
Right. This is a really important one.
So if we're selecting points out of our data set that do not really reflect the real world distribution of what that data really looks like and and totally, you know, unfettered unmitigated wild Distribution, then we're going to have some bias in our machine learning model.
You can see that that happens quite a bit with sampling error.
And since we, you know, we are talking about a lot of times we're talking about big data, it can be really tempting.
To sample down our data and depending on how we sample it down.
We may or may not, you know, just randomly not get the points that are necessary for us to really fit a model to that data.
Another one is reporting bias. So this is occurs when the frequency of events properties and outcomes, you know, do not actually reflect the real world.
Frequency and you see this quite a bit with different systems where people have to manually enter data.
So do they take the time in order to actually, you know, Add that bit of information which may or may not be so important.
It actually changes the outcome of the decision making system. And automation bias, you know, on the other side, we do have a tendency to favor results that are generated by automated systems.
Because they're, they're just much more standard, they're more defined, they're easier for us to parse.
It makes it a lot easier for us to do our jobs when a machine is creating data for us.
But, you know, it may not always be possible.
There may need to be some free flowing text.
I think the best example that I can come up with is with our customer support tickets, where it may be too hard for us to characterize what the problem is that we're looking at.
Is that really all the set of problems that we're ever going to face?
Or do we need to provide a space for people to actually tell us more about what's going on with them and why they need help.
And then The group attribution bias is another one I wanted to mention.
That's that tendency to generalize what is true of individuals to an entire group.
So, for example, in our clustering example, let's say that we took one or two points, not an average of points, not a representative sample of points, but just one or two.
And then we just assume that the characteristics of those one or two represent the entire group in that cluster.
That can create a lot of problems. We don't have enough data to know whether or not those two are able to characterize everything in the group.
Whereas what we can say is that the group overall can characterize those points, which I know is very confusing.
Right. So the relationship is only in one direction. It doesn't go in both.
And so we have to be really, really careful about how we look at those groups and how we apply that data across And then with that, I do want to mention a little bit about what is a data science team do.
I think that Especially out in industry, you know, data science teams are becoming a lot more ubiquitous.
It's quite likely that a lot of people are going to have to interact with the data science team over time.
And so I love to kind of walk through what the process looks like.
And I know that each one of these boxes makes it seem like this is a beautifully defined perfect process, you know, process that moves from one box to the next without any kind of problems involved.
Right. But that's not necessarily the case either.
First, we like to start with understanding what is the business problem that we're trying to solve.
Can we quantify it. Do we know whether or not we're going to be successful.
What is the KPI that we're trying to affect What is the, you know, what is the ground truth.
How do we know we were successful.
And can we quantify the data itself. Do we understand like whether or not we are able to actually solve this business problem.
And is it a good problem to apply machine learning to And with that, I do want to mention that she learning is not a silver bullet.
It's a ton of work, actually. And so we have to make sure that we're picking the right problems where we get the most bang for our buck.
And that's a really important one where You know, some people think of it as like, oh, it's just so easy for us to just pull all this data and we're going to train up a model and like, bam, we have the answer.
That's not the case. We're going to do a lot of tuning.
We're going to do a lot of cleaning. We're going to do a lot of analysis.
And we're going to be dealing with gobs and gobs of data and we're going to try and put it in an automated form.
And that can cause, you know, some surprises for people when they realize just how much work is involved here, which these other boxes, try to characterize right We're collecting data.
We're trying to understand our data.
We're trying to prepare our data and take out things like dropouts and outliers.
We're trying to understand how we can sample our data.
We're trying to understand better how much signal we have in our data. And we'll spend a lot of time with it.
And it's actually pretty funny running data science teams for a long time now.
Most data scientists become so familiar with the data that they can tell you almost everything about it.
It's pretty cool actually to watch that happen.
But they'll start to notice really quickly, you know, the customers that pop out to us or the instances that pop out to us really quickly or the ones that come up or the cases that pop up over and over and over again.
Believe it or not, moving on to the next box. The data modeling part is usually the fastest part, which is not, you know, the way things used to work.
Having been in this field for so long, there was a point in time when we had to spend quite a bit of time writing our own algorithms, designing our own algorithms, trying to figure out how we were going to capture all of the signal.
And now we have these really powerful libraries like scikit-learn, XGBoost, you know, and on the, you know, deep learning side we have incredibly powerful models in order for us to train training, you know, I'm sorry, powerful algorithms that allow us to train models really, really quickly.
So we don't have to write the code over and over and over again.
And then also model evaluation, which is a little bit more automated now so we don't have to write functions, you know, for those confusion matrices and accuracy statistics over and over again.
And then we move over to the deployment part, which is the automation piece.
So we have a pipeline where you can pull in more data, get scores quickly and bring them out so that people are able to make decisions quickly.
And then the iteration process. What we see when we put machine learning models out into production systems is they can drift.
And so over time, you'll see them become less and less accurate because people's behaviors change.
Conditions change, things move forward. We may have new products.
We may have new external conditions. I think that all of us can agree that our lives have changed, you know, quite a bit because of the pandemic.
How is that reflected in the data.
Do we see an effect or do we not. And then, so we may need to iterate on the model and retrain on new data that comes in.
And with that, we only have a couple minutes left.
So I'm going to ask now if there are any questions that have popped up On live studio And so Let me pull up my thing here.
And see if any questions have popped up. I do not see any that have popped up so far.
So give it a minute or two. I guess. But like I said, just to kind of wrap up, you know, before we move on to the next talk.
You know, machine learning is an incredibly exciting field and it's just kind of scratching the surface of how I think one day we're going to see artificial intelligence in general impact our lives.
We're going to see it, you know, creeping in more and more and more.
And I think we've even already started to see some of the effects of what's happening as our The landscape has changed so much as more and more data has become available and so You know, hopefully we'll get to a point where we can understand it better and we can make better decisions and hopefully make a positive difference, rather than, you know, just moving blindly into the world and and Going over and over and over again, making the same mistakes are making the same decisions over again in the face of bad data.
So with that, I'm going to go ahead and stop since I don't see any questions and really thank you for the time that you spent with me.
Like I said, I know this is a whirlwind. So I appreciate it and have a great day.
Thanks. Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai Transcribed by https://otter.ai