10 Ways to Lie with Data
Presented by: Katrina Riehl, Chandra Raju
Originally aired on June 29, 2023 @ 10:30 PM - 11:30 PM EDT
Join Katrina Riehl and Chandra Raju from the Business Intelligence team as they discuss how bias can creep into data and create misleading statistics. As businesses become more reliant on data, it's important for people not to fall into these common traps when using it.
English
Interviews
Transcript (Beta)
All right, everybody. We're getting started on our segment here. I'm Katrina Riehl, and this is Chandra Raju.
We're both from the business intelligence team at Cloudflare, and we're going to present on 10 different ways to lie with data, which I know is a great title, but at the same time, a very important topic.
Since both of us work with data all the time, we hear this, you know, from a lot of different people that, you know, data doesn't lie, right?
If I had a nickel for every single time I heard that, I would be a very, very rich woman, and I think Chandra would too.
Well, rich man, but at the same time, since we do work with data all the time, we do have to talk about some of the caveats when we're working with data.
And especially in this case, when we're looking at misleading data. So a lot of times when we look at data, we don't think about the origins of the data.
We don't think about the provenance or governance of the data or how it was generated.
And sometimes the way that that data is put together or the way that it's selected, it can create some bias in how we are looking at the data and skew results that may be misleading or even and sometimes be dangerous.
So with that being said, go ahead and we can get started and do some proper introductions.
Share my screen real quick.
All right.
So getting started. Like I said, my name is Katrina Real. I run the data science team here at Cloudflare.
And what that means is my team primarily focuses on machine learning.
So those who aren't familiar with machine learning, we use algorithms that basically take in a huge amount of data and then we learn from it.
So the idea is to create a decision making system or a predictive system that will basically give us some insight and what's going to happen into the future.
So the idea is to mine the past in order to be able to say what's going to happen in the future.
So as you can imagine, something like bias or mistakes in the way that we look at the data can very much interfere with this process, which we'll get into in great detail in the next hour.
So with that, Chandra, can you please introduce yourself.
Thanks, Katrina. Thanks for the quick intro. Thanks so much.
I'm Chandra. I'm also part of the business intelligence team. I manage the data engineering team within BI.
Our team primarily focuses on building data pipelines.
When I say data pipelines, we bring data from different data sources like network, traffic, all different data is what we have produced and build data pipelines to enrich, transform the data and load it to our cloud data platform.
And then make it available for teams like Katrina's data science team or data analyst team and to our business partners.
So that makes data topic more interesting. When Katrina shared this topic, I said, look, yeah, this is a great topic to have.
Then how data engineering team can enable to avoid the bias in the data when data science team consumes data.
How can we closely partner together so that we can remove those bias when teams consume data so that way when we provide data, it comes in a more generic way that teams can consume and get right value insights.
So that's our goal.
So with that, I'd love to hear more from Katrina on two different biases, how we can share those insights and get this topic more interesting.
Yeah, absolutely.
And then, you know, I just want to add on to that, but I mean really your team, they provide, you know, They are ensuring the integrity of the raw data.
Right. So every single step of the way in the pipeline bias can be introduced.
So there's some amount of error that can happen every single step as we move through Data ingestion and then, you know, cleaning data and you hear a lot about ETL and, you know, analysis and then data science, you know, I think that each one of those steps we can see how bias affects each one of those.
And so, That being said, the next thing I just wanted to mention is that we are going to take questions.
So please, if you do have any questions for a live audience that's watching us right now.
Please submit your questions. I think on the website. There's a button for you to ask questions, but also Via email, you can just send things to live studio at Cloudflare.tv not .com and then we'll, we'll be able to, we'll save some time at the end in order to take some questions.
So, Given that I do want to talk a little bit about why bias is so important and why we need to talk about it.
As a data team. Obviously, we don't want to shoot ourselves in the foot. We don't want people to think like, oh, everything that we do, you know, the data is always wrong.
There's always some kind of problem with the data. We want a healthy skepticism.
People should be aware that these are biases that Should be taken into account.
We want to talk about these things and make sure that people are aware of them.
To make sure that all of us are getting the data that we need in order to make good decisions for our business.
And so when we go through each one of these steps.
I just want to keep that lens on it that the reason that we're bringing all of these things to light is so that People understand the kinds of things that we're dealing with on our team in order to provide those better results.
And so One point I just want to make before we dive into all of these different things is that a lot of times data teams are in the position where we have to deliver bad news.
Okay. People may not like the data. They may not like what it's telling them. But we have to make sure that we maintain the integrity of the data and we have to make sure that we're providing results that are really useful to people and telling the truth.
And so these are all of these different factors that we take into account in order to do that.
And particularly in the data science area. I like to talk about it because we do have the problem where we're trying to predict the future.
So in these cases, if we're mining the past. If we're looking at all of this past data in order to understand what happens in the future.
We can only predict the things that have happened before.
So that means that we have the situation where we could be propagating bias.
We could be in position where we are repeating patterns that we may not necessarily want to repeat in the future.
But there has to be a certain amount of, you know, humanity that's brought into this whole thing and a certain amount of novelty that is brought into this whole process.
In order for us to be able to move forward as you know as a business as a society as you know machine learning touches more parts of our lives.
And so one example I really like to talk about when we do talk about bias is actually Scorings like for, I'm sorry, for repeat offenders in the criminal justice system is one in particular that I like to talk about how this can be a very You know detrimental process to people that are in vulnerable positions.
Right. We don't want to create a situation where we're preying upon the vulnerable.
And so what we see in a lot of different states, they've adopted machine learning models that will score whether or not somebody is going to be a repeat offender while they're in the criminal justice system.
And one of the problems that we have here is that even though these models have only maybe about a 20% accuracy.
These different predictions and the scores influence how justice is carried out.
So at every step of the way, whether it's from a bail amount to whether or not to press more charges.
Whether or not a sentence is going to be longer or shorter. There are a lot of different things that can happen and be influenced by the score that's coming out right And even with a 20% accuracy.
Is that good enough for us to be able to use it to make those kinds of predictions.
And what we're finding is that these prey upon minority and underrepresented groups at a higher rate.
So in this case in particular, what we're seeing is that there's almost a two to one ratio where the scores are affecting Minority populations, rather than majority populations.
And so this is actually even brought up by in 2014 Former Attorney General Eric Holder actually brought this up as being a major problem in our justice system.
And it's something that has to be accounted for. But at the same time, it comes back to, okay, if we're going to use a machine learning approach like how do we account for these things in the past.
And how do we, you know, account for the bias that's creating these repeated processes in the future.
So with that rather long introduction.
Let's kind of talk about some of the different things. And so our first, first topic is Selection bias.
Alright, so this is probably these actually go in order of the ones that I seen the most often.
So this is the number one. You know, bias that I see in most of my data sets that I work with, especially throughout my entire career.
So in this case, what we're doing is we're selecting data in some way, right, we whether it's a group of individuals or a group of data.
Somehow we're systematically excluding parts of the population.
Right. And so when we do that, there can be also systematic errors that are associated with that.
And so there are a lot of different examples of this and I just want to point out that one of the ones that I see, particularly in in data science is actually When I'm sorry with HR systems.
This is another one I want to talk about. So, As I mentioned, the bias that we were talking about before with the criminal justice system in human resources systems.
We see this all the time. Where we're looking at a group of our population that has been has been used in order to determine successful employees.
Right. But what this does is ensure that we always end up with the same kind of employees that we had in the past.
Right. So if we have issues with diversity and inclusion.
We're likely to repeat those over and over and over again because we're not getting the novelty to our pool of candidates that are coming into the into the Into the company.
And this happens, you know, You know, this may not necessarily be a business case, but at the same time, it is because what we see is when we have more diverse Populations on teams, we have a better representation and we have better ideas that are coming in about our customers.
And so that's been shown over and over and over again is that the diversity of your team.
Helps increase the diversity of the population that you're looking at in order to collect data about your customers.
And so there's a real value in creating that that diversity inside of your system.
So that's just one of those kinds of examples. And with that, I just kind of want to turn it over to Chandra because I know this is something that you see quite a bit as well.
Yeah, that's great explanation. I think I know I can agree like selection bias is one of the key topics and how the data science model get biased and how it impacts.
I like to start with a quote. The aspect of like any data.
What is like, like what you do with the data. That's not important, right.
What kind of data you want to pick to do your model that's more important. Like we sometimes don't like okay I have a lot of data.
Let me do that. Is it the right data.
I want to pick it up. Does it cover all the data. I want to do it. Like that's what data science picks up everything and pick what makes sense to build a model output.
Make sure that model output is like more usable for any use case that you want to implement.
I think that's where I think selection bias plays a very key part on like okay how this can impact the outcome of the model.
I like to take an example like Online articles like we have a lot of events going on.
This is the election year that is happening.
So people going to vote like The lot of online articles coming in, like your Facebook or social media like different News channels, but it's all bias each one like who do it like their own way of representing the news, the way they think it right.
So, okay, are we doing some research on our site, pick the right news like say okay That way I'm not going to bias on one specific medium.
Let's do my research, pick the right data set. So that way I know like, okay, what is the right data set to use when I'm doing it.
So that way I understand like what I want to do, rather than trusting on somebody to influence me to say, okay, select this data, then you will be doing good.
So that's My emphasis will be like spending more time to do research.
I think when we talked about data science team and data engineering team, how do we partner together and work with different business stakeholders, data as a means And like who would like live and depend on data, try to build some research to know what is the data about before even starting any work, like, you know, then that way we are clear that we're Going with open minded like approach to look at the data and try to bring in like whatever we needed for our model to be successful.
Absolutely. I'm glad that you mentioned social media.
Actually, I do have some statistics around that that I wanted to mention That even for Facebook, for example, if you're mining Facebook and you're seeing all these news articles that are popping up in your feed.
I think it's really important for people to understand that they've done studies and about 7% of the users on Facebook actually provide 50% of the content on Facebook.
Right. So that's a very real number that you can cite and look at.
And if you think about that right 7% of the population on Facebook. So we have the problem right where actually this gets into our next topic, but we'll talk about this in a minute.
People self select in order to be on this on this platform in this in the first place, which is a type of selection bias.
Right. And then on top of that, we have, you know, just a quarter, you know, a tiny percentage of the people who are choosing to be on the platform are providing all the content.
So it's not a good representation of the world, which I think we've all seen over and over and over again.
There are hundreds of millions of examples, especially in this election cycle right now for people to, you know, talk about the different bubbles that that occur.
And like I said, Go ahead. No, no, right to the point you're saying like, it's like humans can do it.
They're not stupid. Like it's it's get too lazy to kind of do the research.
We don't spend time to do the research.
We think everything is available will consume and do it. I think that's the kind of mentality that we all need to remove and get things like, you know, maybe do more research, know more about data, what you want to do and how we want to consume.
Yeah, absolutely. And, you know, like I said, you know, selection bias is kind of a pretty broad topic.
And I think one of the ones that you know I did want to point out was self selection bias as well, which I mentioned when talking about social media platforms where people select that they want to be on the social media platform.
So obviously, in this case, self selection bias is when people are participants choose whether or not to be a part of the group that's being used as a population.
Right. This may or may not actually represent the population that you're trying to analyze.
Right. And in a lot of cases, it doesn't Not to pick on marketing as much as I probably do.
But this is where we see a lot of examples of this right when we even a Cloudflare if we have people who are subscribing to to news newsletters or, you know, looking at our white papers or spending time on our website, things like that.
Who may be bought into Cloudflare that they are loyal customers who are part of everything that we do, and they are responsive to our emails and they're, you know, actively engaged with us.
Right, sending out a questionnaire to that group of people who select themselves to be loyal customers of Cloudflare may not be the best way for us to understand our entire population.
Of customers that are potential customers for Cloudflare.
Right. So it's very easy for us to draw conclusions from that group of people that may not generalize to the wider population.
Right. And, you know, I think that that's also something that you've probably seen as well.
Right. Agreed. I think we, I know we work with marketing team, a lot of ways how we send up informed customers on upselling a product or how we want to Communicate the news like information about a product to the customers.
It goes to a certain set of points. Okay, I use this product. Okay, this is the right candidate I can target It's more like I would say, based on the use case, how it could evolve.
We've been working with like different stakeholders to see how we can expand that for a broader like audience and the kind of A-B testing method.
Okay, I can pick one set of audience who are like real users, maybe other set of audience are what separately like not part of the group and how that can be evolved.
There are a lot of, I think, you've thought about biases can be removed by that kind of working with the stakeholders, explaining how this is impacting like the overall objective and giving the perspective of the partners will help us to kind of Give that kind of, okay, what are the options available and make that as a success for their business case.
Yeah, I'm actually really glad that you brought up A-B testing.
So just so people who aren't familiar with that, you know, we create An A-B test by selecting for a factor B or A group and then you have another group that doesn't have that factor B group, right.
You have to make sure that you have a steady population and then you introduce an effect.
So in this case, like if we have a targeted marketing email.
We want to see How did this affect this total population.
Right. And we have to make sure that this is, you know, representative and then we can actually measure What is the total effect of this new factor that we've introduced into the system.
So it is very Scientific and it's something that's very measurable from a mathematical standpoint.
And I'm glad that you brought that up because it's one of the best ways, especially with selection bias and self selection bias.
It's one of the best ways for us to mitigate for that. And we'll definitely talk a little bit more if we have time about mitigation strategies as we move along, but that's Especially for machine learning models A-B testing is essential in order for us to understand what's going on, especially when we work with models that are black boxes, right, that we can't really look at our intuition, but I can talk about that one for forever.
So I'll, I'll move on to our next bias.
So the next one I want to talk about is confirmation bias. And this is another one that I see just constantly and so In this case, what we're really trying to do.
And it can be very innocent. Right. People are looking for data that supports their intuition.
Right. They want something that is in favor of whatever they think is going on.
Right. And so it's selecting for the evidence that supports it and ignoring the evidence that does not support it.
Right. And that can take a lot of different forms where you can, you know, Remove that data or say like, okay, that doesn't apply here, you know it but it skews your data quite a bit.
Right. And it will Create analysis that really doesn't jive with the actual world and you end up with conclusions that are not going to be helpful when you're looking at the total population.
Yeah, if you want to take it from there. That's a very nice to put it like, you know, we go with a pre determined assumption.
Okay, this is the expected outcome.
How can I bring that outcome to the person who asked me to request that information.
I would say, like I said, like we've been more of transactional like reporting kind of thing rather than how we can make a transactional drivers, whatever we give data, how that can make sense.
Okay, what is the methodology, are we doing the right way, how I do my research to make sure.
Okay, this is the right approach.
What I've been doing doesn't give me the intended result or do I want to come up different outcome that Some opposes that outcome, whatever I've been asked to do so that that kind of approach.
I think it's more of a data driven approach.
Based on like, okay, I want to implement a cool technology.
I want to get this thing done with whatever data I have rather than like data driven culture like I know the problem.
I know like what is expected. But how do I prove is that correct or is it something else is correct.
How do I bring in that kind of Two sides of the picture, rather than getting.
Okay, I implement the way I do not think of what like actual expected outcome.
What is Really coming out of the data.
I think that kind of thing we need to, I think you're right. You said, like, with the assumption you always Blindsided with what you wanted rather you don't get a complete picture of okay what is expected from data.
Exactly. And this actually happens a lot.
Also, when you're, I was mentioning that a lot of times we deliver bad news.
Right. You know, the data is telling us this right and then somebody's like, no, no, no, that can't be like you need to go back.
But we don't have people don't let me don't take the effort to say, okay, can be the bad news.
We say, no, no, we should convey what they expected. Let's kind of get What they wanted, rather than taking that extra steps and proving us of the capability.
Okay. Can we tell that okay there's a bad news. How can we Share more like information about it.
Yeah, but that is not if that's what I'm saying data culture is very key.
Yeah, so they're driven culture will be like a very crucial for getting away from this bias.
Yeah, I completely agree. And there are actually some red flags that I use in my team that I tell people about when the danger of confirmation bias starts coming up.
So I'm going to share those really quick. So At the beginning of the project.
If you hear your stakeholders say something like, do you have evidence that supports that blank.
If you hear find me data which shows blank right That is a huge red flag that is one where you have to be really, really clear that you need to find an objective criteria in order to be able to do this analysis and look at this data, it cannot be a foregone conclusion and then work backwards.
Right. And then the in the middle of a project. If you hear somebody say, okay, what if we just restricted the data to the last three months.
Or what if we restricted the data to just 24 hours or what if we just, you know, looked at the United States only right Those are the kinds of things that are also red flags of, you know, possibly some confirmation bias is coming in.
Sometimes it's appropriate to make those restrictions, by the way.
But sometimes it's just a red flag to watch out for.
If you are offering that advice because it doesn't give you the result that you want.
And then at the end of a project. One of the big red flags.
I asked people to look at is, can you recast the data in a different way. Right.
If people actually just for, you know, completely out of hand, you know, Just dismiss your conclusion and say, hey, can you redo this so that it says this right obviously that is a huge red flag of like, hey, no, we have some confirmation bias going on here.
And that is not you know that The integrity of our team actually suffers.
If we were to bend to that kind of pressure. And so it's, it's, it's a big deal.
So I have a follow up question. How do you detect if at some point you work on a project.
That is a confirmation bias picked in, like, how do you detect that.
How do you kind of say like, okay, I know there's a confirmation. I was able to detect at that point.
How do you, what kind of methodology do you use to kind of Well, that's actually a really great question.
We, we look at a lot of different things in order to detect virus bias, just in general.
So this applies to all of the different biases that we we talked about here, but You know repeatability of tests is a big one.
Right. So if we do take a randomly sampled set of data and we randomly sample a second time and we come up with a wildly different result.
We can probably say that there is some sort of bias there. Right. We look to see what is the consistency as we move through the data.
Right. And that's a really, really big one, especially Also with confirmation bias and we make sure that our A B tests are powered appropriately.
So we can't come up with statistical significance.
If we don't have enough of the population. We don't have enough of the effect in order for us to be able to actually measure it.
Right. And so if we don't have a properly powered test, then we know for sure that, you know, we know that there's a bias there because we never reached statistical significance.
Right. And so But I mean, there are a lot of other different things, but especially in the confirmation bias.
That's one in particular, like I said, where you may want to ask yourself, did somebody influence me by asking one of these questions.
Right. Am I sure of this, like, if I look at this population. If I slice and dice this population differently.
Am I going to come up with a totally different conclusion that is that's one of those things that you look at.
Right. Yeah, absolutely. So moving on to number four repetition of erroneous data.
This is one of my favorite ones, actually, because I think every company I've ever worked for has a certain amount of folklore associated with it right Where you hear numbers that are repeated over and over and over and over again until they sound like they're correct.
Sometimes these numbers may change and people just don't realize it.
Maybe there was some sort of mistake in the methodology that was used to create the numbers and it just gets repeated so often that everybody takes it as fact.
Or there were assumptions made in the calculation of that number that no longer apply.
Right. And so One of my favorite examples from this one actually is to talk about I think everyone has heard, you know, humans only use 10% of their brain.
Right. I think that everyone has said that over and over again.
I've heard it everywhere. And I've heard it used to justify everything from, you know, obviously ESP is real to You know, some people are obviously you know smarter than others if they use somehow more than 10% of their brain.
But there's absolutely no fact to this whatsoever.
There's no data that supports that at all. Okay, like It's actually even really hard to figure out who said that in the first place.
It's just one of those things that's passed on and repeated so often that everyone just takes it as fact.
Whereas like neurobiologists and neuroscientists have done, you know, PET scans and MRIs and fMRIs and all of these different tests that show consistently that the humans use 100% of their brains all the time.
Okay. And so where this folklore came from and why it's repeated so often is a mystery.
Right. But it's taken as fact all the time. And so when we talk to Our customers when we talk to people in our company.
Sometimes people will just come back and say like, okay, well, you know, You know, here's this random fact that we all take is true.
Right. It's, it's really hard to argue with that.
First off, because so many people in the majority are repeating it But at the same time, it's good for us to question those assumptions from time to time and make sure that we are all on the same page.
So, I think that's one of the key challenges.
Like we, a lot of things we know like there's data. We don't even know at some point.
Okay, if there's a data error. We do we spend time to investigate the data.
Do we spend time to talk to our data SMEs or data producers who provide the data.
Again, timeline. Like we say, oh yeah, this data is there. Okay, let's go and pull it.
So we get the approval, get the data available to your team.
But what, what is the data. How is it produced. Do we see any data drop. Do we see any calculation that was done incorrectly.
Are we capturing all data, what we wanted our data misrepresented in some way that it's not suiting data science use cases.
That level, I agree, like we don't spend that much time that kind of looking into an impact.
I think there are some patterns that can be developed.
I think looking at more of use cases, maybe we can see what are the ways we can detect this pattern of erroneous data and how that can be included in a process like when we bring in the data for any consumption like How can that be avoided.
I think that would be a good way to kind of hear your perspective. Okay, how does it impact the model.
Now I feel the importance. Okay, what are the steps we have to take as a data engineering team to make sure that kind of impact is like not That's not impacting any of your data science modeling work.
Yeah, that's actually, that's a good point.
Also, because We'll talk about this in the context of another problem actually or another false assumption or false data that is repeated over and over again.
And it's in the methodology of how This statistic is is calculated.
So I think everyone has heard about in former generations that the life expectancy is shorter.
Right. So, I mean, that's just one of those things that is just taken as fact right and it's true.
The average life expectancy.
Is it's definitely younger than it is today. Right. However, if you look at the methodology of how that number is being calculated You have to question that right because the minimums and the maximums of lifetime like of lifespan have not changed for for humans, actually.
So, you know, from zero to like 110 or whatever people end up doing these minimums and maximums have not changed.
However, the average life expectancy takes into account infant mentor meant infant mortality rates.
Right. So infant mortality rates have changed quite a bit over the years.
And so if you have that many zeros dragging down your average It makes it look like, you know, we're not living quite as long.
Right. And a lot of different conclusions can be taken from that.
So the reason I bring this one up is in order to mitigate for that.
It's so important for us to be able to capture data that's like intermediary steps right We actually have to have a clear picture of each one of these different things.
So if we do have a calculated statistic. That we take as fact that it's important for us to capture that, but we also need to capture all the underlying data that was used in order to calculate that because some assumptions may change.
And those are assumptions that are really important for us to be able to draw conclusions in the future about what we're looking at.
Right. And so from data engineering perspective, you know, I'm sure you've heard.
Well, I know you've heard from me when you're like, what kind of data do you want to keep and I'm like, all of it.
Like, of course. Like, what are you asking me, of course I want all of it.
Right. Everything right Right. Well, I am understand that critical is when I looking at researching a more of a bias.
It clearly says like Get every data, data science can figure out what data they want.
If they need it, we can then use it in a way it makes meaningful information.
That's what we want to like get the data, all the data wanted in a way that's useful for data science.
Absolutely. What you talked about is the right way to say like, okay, if we know there's an error at some point, what level of investigation like data producers can do and what kind of investigation that data consumers can do and how we can mitigate that error in this data.
Yeah, absolutely. And so, yeah, I will continue to do that, by the way, we're every time you asked me how much data we want.
I always say as much data as you got for as long as you can keep it.
That's because We store more and we save more Right.
Right. We have plenty of data. So, So number five.
Let's move on to cause effect bias. And this is a really, really important one because this happens all the time that, you know, the big Big takeaway here is correlation does not imply causation.
And so the example I'm going to bring out here is one that I think is really easy for anyone to understand is that there is a very real mathematical correlation between Ice cream consumption and drowning deaths.
Okay.
Like that's something that you see all the time. Right. They are absolutely mathematically correlated and they're directly correlated as ice cream consumption goes up drowning deaths goes up.
Right. However, what this does not take into account is that this happens in summertime right when the temperatures are hot and people tend to eat more ice cream when it's hot outside As well as the fact that they tend to go to swimming pools and watering holes and, you know, they tend to swim more often in hot weather.
Right. So that's really the underlying factor. However, it would be ridiculous for us to say like, oh, drowning deaths are come have, you know, gone up.
So obviously our ice cream consumption is going to go up or wow our ice cream consumption has gone up.
Well, so obviously drowning deaths are going to go up as well.
Right. So we're missing the actual causation here as opposed to the correlation and correlation will not always follow each other.
It's just a it's an observation, if that makes sense.
Yeah, I agree. Yeah, I think I that's very good example.
I had a very real time example of like how that impacts and everyday life.
I look at the loyalty programming, we get a lot of loyalty member.
So I have a spending pattern. So you do analysis based on like, okay, all your analysis based on loyalty members who spend more.
And what are the features like our products, you can sell more, but you look at the bigger picture of loyalty members always going to buy your product often like whatever they want to do.
So you can't keep targeting, making products, suiting their needs, because they're going to buy four times, five times and you get good marketing done.
How about the other audiences who are not members here, still they can buy, how we target them.
So that way, you always look at the, okay, this is the effect.
Okay, let's, let's do it like outcome of it, I get from them, but you don't look at the broader picture of okay, members who are not loyalty members, but still they can influence your spending pattern, how that can be considered.
I think that that is like what this bias makes it.
I think I see a lot of impact from marketing use cases. Marketing and sales.
I hate to pick on them. But that's really where we see a lot of this kind of stuff.
Right. And I just want to point out that what you're talking about is also self selection bias.
Right. So a lot of examples fall into multiple biases. Right. Because people who self select to be in a loyalty program.
Right. You're not looking at a representative sample.
Right. But at the same time, you do see loyalty, what loyalty members, you know, spending at higher rates, but they may have anyway.
Right.
And you also see that also with discounts, right, that you may target a particular group with discounts and say, okay, we have a higher conversion rate after we gave this discount, but they may have been people who would have already converted.
Right. So whether or not they're in the loyalty program or their current customers that we know that are very loyal customers or something like that.
We may have not needed to give that that Discount to them in order to make them convert.
It's really, really easy to, you know, take that correlation and assume that the you know one actually influence the other Right.
Are you frozen.
Okay, I think I just lost Chandra just for everyone at home.
I realized that we are having some pretty bad storms here in Austin, Texas.
So this is a we're having some Internet and and power issues, but bear with us.
Hopefully Chandra will be back in just one second.
But on this same topic. I do want to mention that you can also have the opposite effect, by the way, where it's possible to have causation without correlation, which I know is very, very difficult for people to understand.
Right.
And so You know, you may see something along the lines of Well, the easiest example for people to understand is if you're looking at one one instance right you only have one data point that you're looking at.
Obviously, there's nothing else that you can you can you can compare it to.
So there is no correlation.
Right. But it's also possible that you're missing something in order to actually see that correlation, you're not looking at the right factors.
Right. And so it's It is quite possible to have causation without correlation, which I think is a little bit even harder for people to understand Because in a lot of ways, even in machine learning we we kind of assume that correlation causes causation in a lot of ways, and especially different machine learning models propagate that So sometimes that is a very easy, you know, assumption for us to be able to make, but it's one that we have to be very, very careful of and make sure that we don't have the circular logic that goes on where, you know, cause and effect are confused all the time.
All right. Welcome back. By the way, I was just A little, little fun over here in Austin, Texas today.
So I'm moving on. Let's talk about number six.
So sponsorship bias. Right. This is very much related to confirmation bias.
I just want to go ahead and call that out. Right. But you see this a lot. When sponsors come in and they expect a certain result right they have a tendency to try to influence the outcome.
And I think the most famous example of this that people are aware of is the tobacco industry right where they were trying to suppress data that showed that passive smoking.
Led to cancer. Right. And so, I mean, there are all these congressional hearings about it.
There are a lot of things that came out of this huge cover cover up and scandal and all these other things like that.
But this is very common right and for when you hear about, you know, whistleblowers who say that, you know, data is being suppressed or that data is being ignored.
And you also see this, by the way, when companies. A lot of times they ask people to sign non disclosure agreements, they'll suppress the the data around something and then silence everybody, you know, under threat of legal action if they if they do say something.
And so This one is a little bit, you know, more harsh.
I think in general cases, I would even extend this into some of the more malicious predatory examples I was talking about or systemic racism that gets perpetuated where you do have sponsors who will suppress or distort data in order to Support whatever outcome or whatever conclusion.
They're trying to push so I agree.
Like, that's a very good example. I think I had another very good example like we were discussing earlier about like documentaries like we watch a lot of documentaries.
We think they are documentaries like share some real pain and like what people are facing in their day to day life or different scenarios that happens in the environment.
And then we didn't realize, okay, is the documentary sponsored by a general research group or is it being sponsored by a company or any firm that are having a political motivation being in the documentary.
So that kind of shows. Okay, like Influencing that a message that's getting passed from the documentary.
So that is a kind of Very good possibility or look at do your research before you watch a documentary find out the source of the data which research group was producing documentary make sure That's the message that comes from them like authentic message coming from the research group or not.
One other is like silly example, I would say, I won't say like it's an impact.
I like to follow Tesla RT that's a site that tracks all the Tesla products and everything.
Right. You see a lot of buyers look towards Tesla products and Elon Musk.
They have a big fan following for Elon Musk. So you every message. It's always towards like Influencing Elon Musk figure like EV world he changed everything and Tesla is all about Tesla is good, but still, you see that influence comes in like which forum, you look at the data because that is sponsored by some like who are like Fans or fans were like Elon Musk fans so that each forum and you look at it.
Okay, you always look at the from the perspective like okay that person is biasing a message because he's getting a sponsor from the site who want to influence a message.
Right. Yeah, absolutely. And then, you know, I think that that also kind of leads into the next thing I was going to talk about is that one of the best ways to, you know, try to mitigate for this is actually to do double blind studies right that are more objective right by an outside Sponsor or something like that right some some outside agency, but that also actually doesn't stop the sponsor from suppressing those results right They can still skew the sample.
They can still skew the results. They can still suppress the results, but you can have at least a little bit more confidence in the results.
If you have something that is like a double blind study.
Right. But that's, I always think that that's really interesting.
And this one, like I said, can sometimes be a little bit more insidious right So definitely one to keep an eye on is like who's giving you that content and definitely question, especially things you read on the Internet.
Right. Is it sponsored content or is it, you know, From a from an impartial source right check your sources you hear that all the time.
These days, make sure that you're checking your resources.
It's, it's just so important to understand where that data is coming from and whose voice you're listening to.
So with that being said, we'll move on to number seven.
The omitted variable bias and this one is is related once again to cause effect because, you know, I just want to mention that.
Right. But the idea here is that you're missing some amount of data or you're missing some variable that is actually the relevant variable that is Is actually creating the effect.
Right. And so this is something that in machine learning, you know, to repeat what we're doing is we're looking at past data in order to be able to Determine what's going to happen in the future.
If we don't have a data source that actually is giving us the real The real factor right that's helping to make that decision.
We can still create a model, by the way.
Right. We like we can create models out of junk data all day long.
Right. It's not, you know, I harp on this all the time to that's never about the code.
It's always about the data. Right. We can produce any kind of model you want with any kind of crazy data that you have out there.
But the idea is for us to be able to craft models in such a way that is meaningful results and something that's going to generalize well right and actually be useful to the business problem that we're trying to solve.
But what can end up happening is that we can create a model based on all of this past data that is very accurate.
In fact, right. And we still actually missed the variable that is causing that to trigger appropriately.
Right. It just because we have all of these data sources in here doesn't necessarily mean that those are all of the data sources that are going to be necessary.
In order to actually understand what happened and why the decision was made.
And I think this is really hard for people to understand because In the other case, this data can still be, you know, either one step or two steps away from the real data or the real cause or the real variable that you're looking for.
So you can still Probably, you know, find that signal.
Right. And you can still even create a model that is accurate. But it doesn't necessarily mean that you found the thing right that a human would understand as the actual variable that caused that that decision to be made.
Does that make sense.
It's a little Yeah, no, no. That makes sense. Like, I know that key variable like set of variables that we omit like skew the results completely and completely like I said, like, biased addition, whatever we want to take.
I had a very funny example in data like we look at it.
Right. I go to Costco like Costco's Shopping lanes are huge.
I always when I go and stand in line. There are a lot of lanes, right.
I pick which lane to go, which lane goes quickly. Like, okay, let me pick the lane.
I want to So I look at a few things, right. Okay. But if the basket is full, like, how long is the line.
What is the average of the people that could stand in the line.
And then say, well, I make addition and go there. Then when I go and stand in line, I realize, oh my god, this line is taking longer.
Why did I come in the lane.
Then I realized, okay, the person like who's doing the building is a train like he takes long time to do it.
So I omit that one variable because that's human.
And it's like we don't look at everything we look at the information. What is like available at the point without looking at And that takes longer.
That's kind of me.
I think we humans interpret the data like in a way we look at the key variables that we think it's important for Making the decision and we omitted and that impacts the process like it starts from there and impacts that complete pipeline or data modeling process, whatever your team is engaged in Yeah, by the way, I love this example, by the way, because everyone can relate to it.
First off, but second off. I just want to point out that the data that you did look at in probably 90% of the cases would in fact help you determine which line would go faster.
Right, so It's, it's that that's sort of what I'm talking about is that like the accuracy on that kind of model using that data would probably be pretty good.
Right. But it doesn't necessarily mean that we found that variable that caused that effect.
Right. And so exactly what you're talking about. Is like that was based on my experience.
Like, okay, how it happened. So my assumption gets outdated quickly.
It's like what happened. I had like six months later I looked at it this outlook.
But people join and new variables get added, which I don't update my logic to use a new variables.
So that impacts like when I need to keep refreshing my algorithm to make sure like Looking for new data adding new variables to make sure the model is doing so that way we get to know like I can just as close as possible.
I just Sure. I'm inside cut down to two minutes standing in line.
Right. But there's also I just want to, you know, what you're talking about here also means that like these are factors like you know you don't know what you don't know, like, how would you know that you there's a training right there.
Right. How would you identify that And one of the, you know, more business kind of examples here that I think is really appropriate is Talking about something like churn.
Right. So we have churn models that allow us to determine who's going to move off of our platform, who are the customers are going to be leaving us right And we can create a churn model with like an 80% accuracy based on the data that we have.
Right, which is great. That's fantastic. That's good accuracy for a churn model.
It's something that we would want out in the wild and we would want to be able to use in order to make business decisions.
However, let's say that something happens that is completely beyond our control.
Right. Like, for example, one of our competitors goes bankrupt.
Right. And suddenly, all of these people either, you know, fly to our platform or are suddenly wooed away by another platform right that offer similar services.
And so this huge market event happens that we have no way of predicting whatsoever.
Right. That is something that we would never be able to detect and never be able to incorporate into our machine learning model, no matter how hard we tried.
Right. And so that is an omitted variable.
That's something that we would never be able to capture in our, in our past data.
So that's, I just wanted to mention that as like another example. I think people can relate to Yeah, that's a very good example.
And so number eight. Let's move on to survivorship bias.
So this one I think is, you know, pretty evident just by the name right is that there's a tendency to look at, you know, the survivors.
So if there are criteria that is, you know, imposed upon a data set.
Those that make it through all the different criteria, then, you know, those are the concentration of people that we're going to look at in order to understand, you know, What is happening or what the effect is.
Right. So this is a type, once again, of selection bias, but I wanted to call it out specifically because it's one that's a little bit harder for people to understand Because it seems natural to exclude data that you have no visibility into.
Right. And so It's really, really easy to say like, oh, well, we have a sampling error or we, you know, we applied this criteria or, you know, something like that.
But if you don't even have that data available whatsoever.
It's completely invisible to you. That would be a survivorship bias.
Right. And so, you know, that's I'm sorry. I, the example I really like to use for this one, by the way, are when they were doing some sort of analysis on people, the effects of people in the northeast when Hurricane Sandy hit.
Right.
It was really, really easy for them to pull data from Twitter, from Facebook, from social media, from all of these different things.
Right. Where they felt like they were getting a representative sample, but they really weren't.
Right. They were getting the people who still had electricity.
They were the people who still had Internet access.
They were the people who still had, you know, infrastructure to support them being able to give this kind of information.
They're the ones who survived.
Right. They're the ones who made it through the storm and were able to give information, the people who are hardest hit Were the ones that lost everything and they were completely left out of these studies.
Right. And unfortunately, this is also another one of those examples where we end up losing the effect of underrepresented populations that tend to be harder hit like especially people who are in poverty.
Levels of income, but they are more vulnerable to those sorts of events happening.
So it's a Like I said, it's an important one for us to be aware of.
So, yeah. That was a great example. I know it is all the changes you really get a partial information like it's tough to get the full picture.
I think I had a very good similar Real time like example like we look at the case studies like we will look at a lot of people look at doing case studies.
But once we always read it as like which are cases which are successful right which we know like this is a successful published by a big universities or Harvard or Carnegie Mellon or whoever publishes it.
But yeah, we look at the picture of like, okay. Which are cases of the same methodology, but they didn't, they failed or they were not published or they're not being aware to the broader audience like how that gets considered in our analysis.
We don't do that. I think that kind of say, okay, whoever Successful they get survived.
They include part of the analysis was kind of like what whichever use case like that a failure and they're not including the part of analysis that kind of against cues that like okay if the right process or data to use for me.
Yeah, absolutely. Yeah, perfect example. Let's see, we have about eight minutes left.
So we're going to kind of try to pick this up a little bit. Number nine.
I want to talk about the Hawthorne effect. And this is something that I think people are aware of that people act differently when they know they're being monitored So if they know that somebody's eyes are on them.
They do act differently than when they think nobody's looking and this is actually something that happens a lot.
I'm going to pick on sales, a little bit right If they tell people that, you know, we're trying to collect data in order to understand performance.
Right. Some people may make the assumption, for example, that that means that they're going to change the baseline.
Of like, what is the acceptable amount of sales.
So they've actually seen studies where that when they tell sales organizations that they're watching them and they're going to be, you know, Changing this baseline, they'll slow their sales on purpose in order to affect that new baseline.
So it doesn't, it won't hurt them. Right. It will be harder for them to do their job.
They don't want it to go up. Right. It's not very, it's not serving them at all.
Like, you know, amongst themselves to be able to change that baseline up.
Right. So they've shown that, you know, they can they can measure it. In fact, where when they know that people are looking at them.
The sales may decrease and when people with people.
No one is looking, it'll stay at a constant rate. Right, which is usually higher and so That's, that's just one of those things you see it over and over and over again.
Even when we're trying to get feedback for machine learning models when people know that we're going to be looking at the outcome and we know that we're going to be looking at What the feedback is they tend to give different answers than they would just off the top of their head or may not even realize that's an answer that we would be interested in.
Right. And then just real quick.
I did want to mention observer bias. It's related to Hawthorne and bias.
I'm sorry, the Hawthorne effect, but this is from the observers point of view that observers can inject their own bias into a Into a problem.
Right. So if they're looking at a situation and they're trying to determine what's happening.
They're bringing their own bias into the process right of what is happening and what they are able to observe And they may not.
And, you know, the perfect example here is that we don't necessarily can't read people's minds.
Right. So we have to make assumptions about what people are thinking and But on the other side of that.
It also means that we may You probably see this on feedback forms right where we define the criteria of like, okay, is this is, is it this, you know, Data not available.
Is it, you know, this is a bad instance. This is, you know, we'll pre define what something is.
And then when we have no criteria that fits the situation that we're in, you know, everything just goes into, you know, not applicable or like, I don't know.
Right. That is an observer bias right because we're already as an observer.
These are the defined, you know, effects or these are the defined Answers that we will accept and everything else gets pushed out.
Right. And so also going back to selection bias is much, much more likely that all of those don't know or unable to answer or whatever.
All of those things are going to get selected out of your analysis or or from your training set.
Right.
So it's really, really important for us to be able to, you know, Take that bias out as well.
And that's a hard one to do because it is hard to operationalize machine learning models, by the way, and sometimes To make it easier to get feedback.
We do have to pre define those things. It's very, very hard for people to understand what is the feedback that's necessary.
We can't just have, you know, a free form text field that people are putting whatever they want in there because we'll never figure out what it is.
But with that, we have a few minutes left.
I did want to, you know, I promised that we would stop for a minute for questions.
So I'm going to stop sharing my screen for just a moment and see if we have any questions.
No, it looks like we have no questions.
So, um, I will go back To This Presentation if Chrome will let me And share my screen again since we do have a couple more minutes.
There is another kind of bias that I did want to talk about if we had time.
And since we do. I'm going to go ahead and get into it.
But I did kind of like if we have a couple minutes. I just, I do want to talk a little bit about unconscious bias, actually.
So I'm going to skip over the mitigating bias because that's actually a huge topic that we could probably spend another hour talking about But the unconscious bias is one that I think is really important for us to talk about.
This is not necessarily something that You know, directly impacts analysis.
Right. It's not necessarily something that directly impacts Machine learning by this, but our unconscious biases of what the kind of stereotypes that we hold around with us that we're not even aware of can lead to any of these other biases that we're talking about.
Right.
So I do want people to be aware of this one that there are unconscious biases that all of us carry with us.
And there's not any shame around this. It's just It's just true as part of our human nature.
Everyone has them. And so there's the first Step in order to be able to mitigate this is for you to recognize what your biases are.
Right. And so I did want to at least mention that Harvard does a huge study in this area for implicit biases.
So I encourage everyone to go out there and take some of the bias tests that they have out there.
You'd be amazed at your results.
That, you know, you just may think that, okay, I don't have this problem or oh I of course I you know I would act this way in this situation.
And you find out very quickly that your brain is actually has some bias underneath the surface that you're just not aware of.
But in order to change that you have to be aware of them first.
So not to be afraid in order to find out like what are your biases.
Definitely go out, check out that website. See if there's, you know, anything that surprises you there.
But with that, and I just want to thank everybody for taking the time to listen to our entire presentation and Chandra, I'll let you, you know, say your goodbyes.
Thanks for this really giving opportunity for joining you for this discussion.
It's a very interesting topic. I learned a lot in the discussion.
I hope other listeners also learn more on this topic and make this data available to you for the modeling work that you're doing.
All right.
Well, thank you so much and looking forward to working with you more. All right.
Have a good day, everybody.