Open Source Happy Hour
Presented by: Katrina Riehl, Andy Terrel
Originally aired on April 15, 2021 @ 8:30 PM - 9:30 PM EDT
Join Katrina Riehl, the head of our data science team at Cloudflare, along with her long-time colleague and guest, Andy Terrel. Katrina and Andy are both members of the Board of Directors for a non-profit organization called NumFOCUS. We provide support for open-source projects in the scientific computing landscape, especially the data science arena. We will talk about open-source, what it is, the challenges, the successes, and how open-source interacts with the for-profit business world.
English
Interviews
Transcript (Beta)
Welcome to Cloudflare TV Lens. I just wanted to welcome everybody to our open source happy hour.
A warm welcome to everybody out there in Cloudflare TV Lens. And just to let you guys know, this is actually going to be a happy hour.
So please go ahead and if you want to take a moment to take a moment to go grab a frosty beverage.
We'll wait here for you. I'm actually totally kidding. We're going to keep talking, but you should still go get something to drink if you want it.
Or a cup of tea or something like that.
And please join us. We're going to have a nice conversation, just a nice relaxed conversation about open source and data science.
And this is my guest, Andy Terrel. Really, really glad you were able to make it.
And I'll let Andy go ahead and introduce himself. Thanks, Katrina. Appreciate it.
So, yeah, my name is Andy Terrel. I'm chief data scientist at Rex. It's a company that sells real estate on the Internet without as many fees as you might pay normally.
But I'm also the president of NumFocus, which kind of puts me in the role of the open source advocate.
And NumFocus, we help sponsor a lot of the open source data science platforms that people use or programs like NumPy, SciPy, Pandas, things like that.
And been doing that for about eight years. Very cool.
And then just to introduce myself, I'm Katrina Real. I'm leading the data science efforts here at Cloudflare.
And I've had a long history with Andy. Really, really glad he was able to make it.
We've worked together. We've worked together at NumFocus.
I should also mention that I'm on the board of directors for NumFocus, so we work together in that sense still.
I'm actually the treasurer of NumFocus.
So with that being said, we're both going to talk a little bit about NumFocus real quick.
Can you go ahead and just introduce a little bit more, like what NumFocus does?
Yeah. So NumFocus has a mission to promote scientific open source software and education about that software.
And we're really here to build better software for scientists so that the scientific community can continue to advance.
I think a lot of what we started with was a bunch of folks who had built out software in academia and realized there wasn't a way to fund it under the traditional academic model.
Since then, I mean, like this was basically nine, ten years ago that we started the organization.
There's been more models that have kind of come up, and Sloan and Moore Foundation have funded whole ideas of getting data science into the university models.
But at the time, there was basically no way to get funding for working on an open source project.
You got funding to write scientific papers and things like that.
So we were like, you know, it kind of stinks that we can't pay, you know, the people who built NumPy a paycheck and that they have to go work for somebody else and do this on the side.
So we started a way to, like, start getting money directly to the project.
And through that, I think we're running, you know, about a $4 million budget every year pushing corporations to give.
We run events here. It's called PyData. You're wearing our T-shirt.
And it's an education effort to get more people understanding the tools and understanding you can use these tools in production.
I guess in some sense, like, now, like, everybody uses them in production, but back in, you know, 2010, it was kind of still thought, like, oh, no, you have to have, like, IBM's tools or something like that.
But now, like, most of the data science tools are open source, and they're built by academics who were doing it on their side.
And now, like, it's reputable to take money from corporations or from grant -making institutions.
And we end up having on the order of, like, 30 or 40 independent programmers working on our projects.
We just serve as the fiscal layer to make sure that we can raise the money for them in an appropriate way.
Absolutely.
And we're going to dive into that just a little bit more. But you definitely hit on a couple of things.
We do have two different levels of affiliation with our organization.
We have fiscally-sponsored projects, and we also have affiliated projects.
And maybe you can kind of tell the – actually, I'm going to give you a pop quiz.
Can you name all of our fiscally-sponsored projects? I'll help you.
Oh, no. Geez, I need – I think the first question was, where are we drinking?
That was an easy one. Okay, no, we did skip over that part. Let's go back to that.
So what kind of cocktail goes with open source, Andy? Well, I was making – I heard about – so on Kyra's Doll and Molly's new podcast, Make Me Smart, I think, or whatever, they presented me with a batengo, which is apparently like a rum and Coke except it's tequila and Coke with lime and salt.
Although I made it wrong, and without the salt, it doesn't work nearly as well.
So batengo without the salt, it's all right.
All right. What do you have? Well, I just decided to keep it simple.
I've got some Austin Beer Works Petersbaker. It's any time ale, keeping it local, Austin Beer Works.
And so that's what I am enjoying on this Friday. It's been a week, so let me tell you.
If you go to Austin Beer Works, you've got to ask for the Mr.
Falco, where they mix – I think it's Peacemaker and their black – it's their black ale.
Really good mixed beverage there. Anyways. Okay, all of our projects.
All right, so let's see here. We have Num Pie, which I think was the project we started with, but they actually didn't sign for a long time because sometimes you have to – and going back to that physically sponsored versus affiliate project, the way I kind of put it is you have to have a little bit of governance to take money, and then Num Pie had governance, but they didn't have the signatures kind of put together, so it took them a long time to actually sign it.
And so it's like, oh, yeah, Num Pie, they're the first ones.
But actually our first one, I think, was Jupiter.
So Jupiter is another one. They were the first ones to receive first funding.
Thank you, Microsoft. They kind of got us up and launched, and they've grown very successful from that.
We have in that – actually, you might not know this, but we just signed Scikit-Learn today.
But they're a physically sponsored grant or grantee, not the full physical sponsor, but they joined today.
So even the list on the website is wrong. Then we have Pandas, for sure.
They were early on. Then we have – let's see here. We have Julia, which is the whole language, but we also have Jump in there.
We have Senpai. They were an early one.
Let's see here. You've got to take over now because I've got – because there's like Shogun.
I don't know if they're affiliate or – the problem is which ones are affiliate and which ones aren't.
Like you love them all. That's the trick, right?
So Condaforge, that's an important – right? Condaforge, great. Okay.
Okay. They're pretty – they actually were affiliate for a long time, and then they finally like blossomed enough.
I mean that's another typical trend is you'll start at one company or a couple of people.
And when you're just a couple of people, like are you really – like do you really need governance?
Do you need fiscal sponsorship, legal advice, all that stuff?
You probably don't, but you still want to be like part of like an organization that can give that to you.
And Bokeh is like one of our – they started with a few folks at Anaconda and then started growing, and now the team is quite bigger and it spans NVIDIA and Microsoft and Mozilla.
And so it became one that like grew.
And speaking of NVIDIA, Dask. That's a really big one right now.
Yeah, they just – they're pretty new too, yeah. They were affiliate for a long time.
Although Dask is a fun one because is it NVIDIA or is it – because now there's Saturn Cloud and Coiled.
So it's having its moment. So Ray had it, like we're talking about open source, like I think Ray is more the Apache world.
We don't have to talk about Apache. But yeah, so now Dask is having their moment where they're exploding in companies around these open source projects.
Psycadamage. That's an important one. I think that's enough. Let's go over to the affiliated projects.
I just kind of want to give a shout out to all of these people that are, you know, working on these things.
Oh, man, I pulled it up and now there's so many.
We missed X-Array, Blast, MathJax. MathJax is a much older project than we are.
Yeah, it's true. But they joined. They're like – they were at the American Mathematical Society and they're like, yeah, we found out you all actually do software and we just thought we'd fit a lot better over here.
Okay. Yeah, that was while I was still on the board.
It's all good. Yeah, so on the affiliated side, we got Scython.
I'll start off there. I thought Scikit-Learn was over on affiliated side, but it sounds like they're on fiscally sponsored now.
As of like hours ago.
Yeah, that's fantastic. Like literally it came across. I'm like, ah, finally signed that one.
All right. Cool. Let's see. Dash. That's a good one. Yeah.
Gensim is one that my team uses pretty often for natural language processing and topic modeling.
Geocandas.
GoNum is on here. We were talking about like a lot of times people come to us and they say like, oh, you're just Python.
I'm like, well, no, I mean we have R and Julia and now we even have Go in here.
Yeah, I think that's a really important thing to talk about just because the scientific community, right, the whole point of this is to open up science, right, and make it more accessible and to have people working together.
And so being tied to one language isn't really a good philosophy, right?
And so I like the fact that we represent a lot of different areas.
And I think that people forget sometimes that we're not just Python based.
I like Python a lot though. I just want to share the love of everybody else and like how awesome Python is.
So I agree with you. We need all the languages, but they just need to know how awesome Python is too.
Well, sure. How long have you been programming in Python?
Since 2004, I think. I guess it was really my second language.
It was like the first language I really like did big programs in, right? There was like some C and MATLAB and some other things before that.
But, yeah, back, I think it was Python 2.2.
Was it 2.2? Yeah, I think it was like 2.2. And like I don't like you go back and people fight about like 3.0, but I mean, you don't remember like 2.2, like things crashed without you knowing like why, like there was a stack trace in some part of the C interpreter that like just really wrecked your day.
Most people I know are like 2.5 and beyond. I don't know very many who are like before 2.5 these days.
You know one, right? Yeah, I mean a few, right? I actually started programming in Python in 1998, believe it or not.
Oh, wow. Yeah, I got a lot of people.
You, Guido, and Travis, and I think that's it. Exactly. The first Python had 15 people at it, right?
Wow. David Asher, Guido, and a bunch of other people, right?
So 15 people in the corner of a Marriott in D.C. That must have been an awesome time.
Yes. But, yeah, the point being that, I mean, Python obviously being an open, it's an open language, right?
So it's an obvious candidate for open source.
I think one thing that people kind of don't talk about a whole lot is that when you do have these open communities, this is really built on the backs of graduate students, right?
So this is how I got involved, by the way. It's like, yeah, graduate students for the win, right?
That's how most people get involved, right?
I had a whole two extra years of grad school just because I was the one running the code, you know?
It was great. I get to make $20,000 a year, two more years in a row, and no publications.
You should have street cred. What is that worth to you?
I cried a little last week when an intern asked me why it took me so long to get my PhD.
I was like, oh, thank you. Yeah, thanks for reminding me. That's funny because I actually feel like I got through my PhD faster because I was able to prototype things so quickly in Python.
Well, my PhD code was in C++, so maybe that was an issue.
Oh, that's poor. We had a SWIG binding. I mean, you remember SWIG, right?
Of course I do. David Beasley. Oh, my. So, yeah, that was a lot of fun, SWIG days.
Yeah, absolutely. But, yeah, kicking around old times. I definitely remember sometimes I tell people these stories about way back when we were using these things and back when we were importing numeric and having to write our own C extensions because there was, you know, nothing else that we could do or even sometimes even assembly or Fortran extensions in order to get speed out of Python.
And so now we have these incredibly powerful libraries that I think have, you know, kind of changed the ecosystem quite a bit, right?
Especially sometimes I guess where I'm going with this is that I wanted to talk a little bit about the fact that even as an open source project, you know, there is governance and there is also, you know, I think a really nice attention brought to bringing, you know, clean APIs and clean documentation and making things very user friendly.
And I feel like especially in the scientific software community, it's actually brought more people into the area and it's made it more accessible for people to do this kind of work.
Yeah, and I think when I was in grad school, like, you know, we'll pretend it was just a few years ago, even though it wasn't.
There was definitely a sense, there's a rite of passage in running all these systems, right?
And so I remember I was like invited to like, oh, you don't have to teach classes anymore.
You can just run like you're so like, I guess whatever was happening was the sysadmins were figuring out who was actually like doing a lot of stuff on their computers.
And like, I had taken over the Condor pool and was running on all the other grad students computers at the night and all that stuff.
And like, Andy, you're going to come up and be a sysadmin.
You don't have to teach, you can just be a sysadmin sort of thing.
And so like, then you were like in and you like, they were like all these neckbeard open source contributors.
I think that was like kind of an old school view of open source software.
Like you had to be in the club. And I think that if there's anything that's evolved in the last, I don't know, 10 years since, since that time, oh, I just gave away all of them.
But yeah. So in that, in the past decade is that open sources really embrace an inclusivity movement.
And, you know, yeah, I didn't know any, I don't know, like in the day, right?
Like today's the day.
Right. So there was like no black developers growing where I was at. And now like you look at, there's a whole community of like black developers inside either, you know, either whether it's not focused or the ecosystem and you see tons of contributors in Africa and actually China's coming up in contributors as well.
Like it used to be just like neck neckbeards in their basements.
Right. But, but that includes, but I think that's a direct result of that governance.
I think that, you know one thing I've found over the years is like people will talk about diversity inclusion and they're like, why aren't we more diverse?
Why aren't we all inclusive?
And like, I remember asking this question. I actually was a happy hour at the Sci-Fi conference that one year when I was running it and I might've had a few too many, too many drinks.
They didn't have a Botango, but a few beers. And I was like, why, why don't we have more diversity?
They're like, Hey Andy, have you ever asked, have you ever just like sat down and asked like 10 black people to come?
I'm like, no. Like why not? Like, I'm like, okay, well I don't know. Cause I don't know where to find them.
Like there's no coders. I'm like, well just, it doesn't matter how you find them.
Just go find them and ask. And so what we found like over the time is that, you know, number one, like feel like, Oh, they're not coming in, but you're not asking them to come in.
And a lot of the reasons they weren't even there is because they didn't feel welcome enough to even show up.
Because if you imagine, I mean, jumping on a mailing list and you barely have any time to do anything.
Fernando Perez, like he was a early founder of No Focus. And he likes to talk about the privilege of being an open source developer.
Cause I mean, you were privileged to have all this extra time that you could give away honing your craft.
And now like open source developers have been kind of like put on an ivory tower and like cherish is like, Oh, look at all the great things they're doing.
But back in the day it was just like, we were just bored.
We had extra time really. Or we just loved it so much.
But by helping teams develop a code of conduct governance, it really makes it clear that, Hey, this is a safe place to come work.
This is a safe place to come give your ideas.
And there's rules about why how we can give you feedback and why we will give you feedback.
And like that did that kind of like structure requires a movement.
It's not, it's not as easy just to make people do it.
And like one thing that we've done for is like, we have 50 codes basically now and coming and helping people understand like, look, code of conduct, I get it.
They're not exactly what you want. They're not like there's problems with them, but at the same time there's, you know, how do you want a less inclusive place or a more inclusive place?
Do you want to have something that where the rules are kind of stated or unstated?
Because when the rules are unstated and somebody doesn't play by the unstated rules, it it's very hard to say, please stop playing by the by the rules.
Right. And so in that, I think that the governance has come up a lot and like, how do you make decisions?
Who, how do you even define, like who a core contributor is?
How do you value contributions that are not just code, like people who report bugs or report or fixed documentation, things like that.
And so in many ways I see, NumFocus is my like my civil rights platform as well.
Like I want science, I care about scientists having the best tools, but I also care about making it very inclusive because I believe ideas come from everywhere.
And if you don't include our underrepresented groups and the majority of the world, really, then you're just going to have like very fragile software and very fragile ideas.
And so in my mind, when I was in my first year at NumFocus, And so when they go hand in hand, it was a NumFocus mission.
Absolutely.
And I just wanted to add on to that a little bit because, you know, I mean, I've already revealed how old I am, right.
Joining into an open source community, especially for me, you know, That was a time when there weren't a whole lot of women in computer science, frankly.
I was the only woman in my entire Ph.D. program.
This was a community that I found some people who were incredibly accepting and very inclusive and that were very happy to work with me and, you know, have patience for my gajillion questions that I had all the time.
And then there were people that weren't, right, and you didn't feel very welcome, right?
And that goes back to a lot of different things.
But we didn't have codes of conduct back then. I just want to point that out.
But at the same time, the community, I think, did a relatively good job of policing itself.
I'm really glad that we do have codes of conduct now.
I'm pretty sure that you and I have both had to enforce codes of conduct at PyData events before.
But I'm really, really glad that we have them in place.
And it's, you know, I think it goes a long way into creating that environment, right?
It's not just a, you know, open source software community. It's also just an open science.
The idea of keeping things open for everyone is a philosophy that I think jibes quite well with diversity and inclusion.
But another thing I was going to add to that was that, you know, I was on the diversity and inclusion subcommittee as part of NumFocus.
And one of the big complaints that we would receive sometimes also is from underserved areas that we would host events at these like super flash, you know, big companies, right, that were generally in areas that were not close even to like the neighborhoods that we were trying to serve in some of these events, right?
And there was also, you know, a certain amount of just discomfort at the idea of going to, you know, Facebook headquarters or, you know, Microsoft headquarters or things like that because it wasn't a part of the community.
And I always thought that was kind of interesting that, you know, the idea of like opening it up and making people feel comfortable is not just about invitations.
It's also about creating environments where people feel like they can be more comfortable as well.
Yeah, totally. We should do more in like public libraries.
That would be awesome. It would totally be awesome. I would love that. They don't have as good a cappuccino.
I mean, I tell you what, you go to the Microsoft downtown, like the cappuccino machine, it's hard to beat that.
So there's a little bit of that going on, you know.
Oh, sure. Absolutely. But yeah, since you did bring up governance, you know, this is one thing that I did want to get into is that I think governance is a really important part of, you know, in order to be accepted into NumFocus, by the way, you have to be, you know, judged on certain criteria.
So one of them is, you know, making sure that you have a governance model, making sure that you have a code of conduct, making sure that you're serving the scientific community.
You know, people are going to be nice, right, and inclusive.
And so I do want to talk a little bit more about governance because I think sometimes, you know, from the outside, some people don't understand how these projects can be so successful with just a loose collection of people, right.
But it's not really as loose as it may seem from the outside, right. So can you talk to us a little bit more about governance and how that works?
Yeah, I think that I always like to compare like a few different projects, the ones that kind of compares like you have SymPy, which is a mathematical symbolic mathematical library.
You have Pandas, which is a library, a data manipulation library, which is used by tons and tons of people.
And you have like Jupyter, which is a whole ecosystem, like both a visual system, but also a server, a running server and, and so on.
They've got so much stuff, they have a parallel computing system in there, they have a security protocol.
So everything is like, it's become the kitchen sink of like data science.
But the three of them, right, really vary in how they make decisions.
Jupyter has, you know, a steering committee, they have, I think they have 150 repos now.
But they have, you know, subcommittees within there.
And then they have like conference committees and a whole organizational structure, which is funny, because I think they they often, I get a lot, you know, I also get a lot of complaints from folks and like, Jupyter folks, I was like, oh, it moves so slow.
And I'm like, yeah, but look how many people like awesome things you've got.
And then you have like SymPy, which is basically their governance is written in a half a page document, which is like, yeah, Aaron controls who's committing, and here's the requirements he gives.
It's not quite that, but it's, you know, there's a here's a here's our base, how you become a committer.
And here's how basic decisions are made. We have no money, and we have no, like infrastructure.
So yeah, we don't have to decide anything about that. And so and then you have pandas, which is kind of in between the where they, you know, they have a server team, they have finance team, they have a few, and have a technical committee, and they're building relationships, but they're also like dealing with things like an ecosystem that's growing, like you have Geo pandas, and pandas data reader, and a whole bunch of different things out there that are like building on top of them.
And they're like, how do we make protocol, so they're kind of what I would call like the protocol building stage, if you will, but, but there's still very much a library that you kind of hands on use.
I guess I liken it very much to like, just like how businesses work, right?
Like you have your, your, I don't know if everyone knows how, but you have like LLCs, which like you want to start a business, you start an LLC, you go, you send off 200 bucks to the state government, and bam, you're a business, it takes it, you get a book to put meetings in.
But if nobody, nobody's ever going to check in unless you try to sell the thing, versus like the the C Corp, which you have to have a board and monthly meetings and like very rigorous kind of discussion about what you're doing.
And then if you go public, there's always other things just like that, as you grow, and you have more people and more assets and more resources available, you end up building more checks and balances inside your governance system, that kind of seems right.
And for the most part, what we end up doing is just having a we have our annual summit where we kind of come together and say, hey, what works, what doesn't work, they borrow ideas, they evolve their governance.
We have some templates we've written, I think, Sam Bryce, who was a Two Sigma wrote a bunch of he was visiting fellow for us, he wrote a whole templating system for building out governance.
It's on our blog. And so I think there's a lot there. But it's definitely kind of clear that these days, you can't just like have nothing.
You just put it, put it on GitHub, yay, we're an open source project.
So there's a whole thing, right?
So yeah, but I do think it's important also for people to understand that, I mean, even if you become, you know, part of NumFocus, like, we check to make sure that these basic things are in place, but we don't impose any kind of, you know, governance on it, it really is up to the project to make sure that they're functional.
But I do think that NumFocus also plays a role in continuity, right?
I mean, one of the worst things that can happen to an open source project is that everyone is sick of not getting paid, and they go on, and just, you know, do something else.
And there can be, you know, ugly handoffs, there can be some pretty strange forks that happen.
And you can also have, you know, major API changes for people who are reliant on the code, right?
As well, projects that die, right, and which is a huge risk for companies that are trying to support open source, or feel like they've made an investment in a project to suddenly have everybody that they're kind of counting on just kind of disappear, right?
So, which is worse, like, Python 2.7 dying, or, you know, Docker and like Rocket, like kind of madness, right?
It's, yeah, the open source ecosystem, it's funny, like, whenever my team kind of starts building a new thing, and they're like, Oh, look, we have this, this new thing on the Internet that we found.
I'm like, okay, so how many GitHub stars do they have?
How many contributors? When was the last patch? They're like, Why?
No, no, it just works. I'm like, No, it doesn't. It might work today.
But will it work? Like, in a year from now, when your servers on fire, and which reminds me of the way y'all are blocking repo.anaconda.org.
I thought I was gonna bring that up.
You like, we found this out yesterday, like, I thought we were gonna talk about that, Andy.
My sysadmin put Cloudflare's malware thing, and like, all of data science is malware now.
It's probably true. But you know, I'll point that out.
So CloudflareTV. Yeah. No. Did you file a ticket? How do I do that?
Is there a governance package? You could file a ticket, but yeah, I do believe that came across my, my computer yesterday as well.
Nice. So I mean, yeah, in that just shows you how there is actually a much larger ecosystem, like your open source project, like, if it gets consumed by condo or PyPI, or NPM packages or whatnot.
I mean, like, I got to mention, I get to mention, left pad the other day, whenever JavaScript developers are badmouthing Python, so I'm like, Oh, yeah, yeah, we stink left pad, left pad.
We all make mistakes, but this whole thing needs an ecosystem to like, actually support it up.
And I think we were one of the councils on we wrote a paper for a blog post, really, you know, you're an academic when you call a blog post a paper, but you know, so we wrote a paper on the like, the levels of open source, or levels of technology in general, and you kind of think of just like a city, you have like houses, which might have one or two people in them, and you have maybe an apartment complex, which will have hundreds of people or hundreds of families in it.
And then you might have a street, which would have lots of complex housing complexes, and they'll have a hospital on the street and whatnot.
And the infrastructure needed to service the whole street versus just a house is hugely different.
And the same way, in our ecosystem, we as a community num focus, I like to say we're kind of that middle layer went out the bedrock, like Linux is a, you know, it's an open source cranky magic Linux went down.
I mean, Linux, we have a whole foundation, like, it's, you know, $200 million foundation that's really running, like, all the all the like, decisions around how our infrastructure works at the base server layer.
And num focus, we tend to sit on top of that. And then like a lot of like, you know, every two person GitHub library could bubble up someday and become the next Heartbleed, right?
So it's, yeah, absolutely. And I think that that's also something I was going to ask about was, I always get this question from a lot of, you know, especially new developers, or graduate students who like to, you know, that are a little masochistic, I guess, they always wonder how they can take their closed source, you know, projects, and convert them into open source projects, right.
And so I think, you know, maybe talking a little bit about like, what are some of the characteristics of a healthy open source project?
And how would you make that transition?
And just do rm.git, like, put post it, right? Then we just say that that's not how you do it.
So I think the first thing I would say react to that is that you there's a difference between posting your code as open source and creating an open source community.
A good example is like, I was, when I was at University of Texas, we had Kazushige Goto, who wrote the Goto Blas library, had recently left and actually had his office was great, like the master's office, and fastest library for linear algebra at the time was written by one person kind of just a marvel of like, engineering.
So it's kind of what it kind of was 90,000 of assembly code, 90,000 lines of assembly code, right?
So like, all right, well, who's going to support this?
And we like went around the computing center and like, yeah, not me, not me.
And so I was like, well, why don't we just slap a license on and see where it goes?
So we did that. We basically went through our process of asking like, who owns this and went to the technology commercialization office and wherever you're at.
Like, I think a lot of developers or people don't realize this. Like, when you're a developer, you're often your time is bought.
You know, so if you're doing something at work, the person owns the copyright to that is your job.
So first off, did I write this?
And like, all right, so who owns the copyright? So the copyright owner is the one who's going to get the license.
So you have to go figure that out.
So we figured that out. We slapped a BSD license on it and we put it on the Internet.
Great. Turns out a group in China took it and developed OpenBLAST. And so now, and their first thing to do is just fix it so it like actually ran on a new processor.
Because the way the whole thing works is like it pinned to the processor and all fast.
And so like, the test suite wouldn't even work for like the latest Intel processor or something.
And so, you know, there's a group they wanted to be fast.
They wanted their supercomputer to be faster or like be as fast as possible.
They pull off Godot BLAST and start using it. Let's start giving back to the community.
And they, you know, at this point, like now, the funding, the CCI funding that we had was actually going to NumPy.
NumPy is a major user and they depend upon it.
So they've actually contributing patches to OpenBLAST.
But by building it up as a little bit more of a community, accepting patches, and then talking with the community a lot more, it actually blossomed as something that is usable by a lot more folks.
And I mean, different folks take different kind of approaches.
That's one approach of like a single, like it's probably pretty typical of like data science libraries where there's a few people building something and kind of going out.
But then you have like the TensorFlow model, for example, which is like a concerted effort by a company to come in and like make a better product.
And they want people to use that product in order to use their heart, their software or their systems or whatnot.
So you have TensorFlow where Google comes in and spends a lot of money making a very nice place.
I think Iron Python was an example from the old days where Microsoft was trying to make us use Iron Python so that we could use more programming on Microsoft systems.
And in that, I think that there's kind of that community driven, like very organic approach, grassroots, if you will, in that kind of really company supported, mandated approach.
And I think both work in a certain capacity, but they also do different things.
I think, for example, one of our projects is Tejano.
And for the large part, TensorFlow is a re -implementation of Tejano.
And when Tejano had a hard time supporting, all the people are using it, right?
So they said, look, here's TensorFlow and all the deep learning stuff is taking off.
And we don't want to have to support those folks. We want to write papers.
We want to figure out the new technologies. And so, hey, let's deprecate Tejano and we're just going to use TensorFlow.
And now you have kind of like a whole explosion of more tensor-based deep learning libraries.
I think the other side of it is like companies use this.
And I guess what I would kind of call a big movement in the early open source days was to open source something and get some ownership between companies so that you didn't have to pay a license fee.
And the first one of these was like Apache, right? The ACTPD library, right?
And so, you know, that was built, it was funded large part by like folks like IBM so that, you know, Microsoft IIS would kind of go away, right?
And the strategy worked really well, right?
It works really well to get people to jump onto your library by giving away free.
But these days, like I think the projects we work with, we want to share scientific ideas.
You can write down a mathematical proof or you can like give somebody a library to play with it and show it.
And like even today, like you're seeing much more like the tend towards like you have to give me something that shows that your software works, at least in the data science world.
I'm not sure it's completely true in all domains of science, but for us, it's like it's not true unless you actually have a model that does it, right?
Yeah, absolutely. You touched on a couple of things. I just want to make sure that the audience understands like, so CZI, we've talked about that a couple of times.
That's the Chan Zuckerberg initiative, right? So that's a lot of funding for open source projects and CZI grants go out.
NumFocus helps administer those grants and make sure that open source projects get funded.
And then also we mentioned the Moore Foundation.
So that's Gordon and Betty Moore Foundation. They do a lot of funding for different projects, which I think is really cool because I mean, we do have these different sources of funding that are coming in.
Because you do have, you know, places like, you know, our platinum sponsors, for example, like, you know, Microsoft and Facebook and all these other people that are making investments in open source.
You also have nonprofit money coming in to help fund open source.
But in the same vein, you know, you're not going to become a millionaire, right, like writing open source.
So I think that sometimes, I think it's really important sometimes to talk about like, how do projects bring in money, right?
I mean, we mentioned a couple of different things. But, you know, how, like, I think every project in and of itself kind of decides like, who's going to get paid?
Or what are the initiatives? Or what are the features that we can actually pay for?
And what are the other ones that are just kind of, you know, things people can pick up, and they're going to, you know, contribute and so and how you build that community around it.
Yeah, it's funny, because I didn't even like this point is a good point you're bringing up.
And I think of like private foundations, like Moore and CZI and Sloan, these are all private foundations that give out money.
So like the, like Intel, like Gordon Moore, found Intel use that money to start Moore Foundation and Chan Zuckerberg, Facebook to the Chan Zuckerberg.
And, I mean, it's kind of going back to like Carnegie, Carnegie for a long time, we started Carnegie Mellon and those sorts of things, and the Carnegie Foundation.
So that they, those private foundations have a particular like, strategy, which is to go in and give money just like the Gates Foundation is a good example of this of like, they wanted to eradicate some diseases, right, I think malaria.
And they saw like, look, if we like put in a bunch of capital, this is achievable in a like, five year time horizon or whatnot.
And we can actually go out and do it.
And so maybe there's some social good, it's public benefit, and so on.
And when we started to focus on focuses on 501c3, which is different from like, the Linux Foundation is a 501c6.
So c6, so yeah, IRS tax code, whoo, need some need some more alcohol to talk about that.
I am the treasure. It's all good. Go ahead.
The tax code, like has like, like six or seven different types of organizations, you can have the c3 is the public charity and the c6 is a is a trade group, right.
So often, you'll have trade groups that can come together to pull money to put in to build a common pool of resources under a charity status, but as a trade group, and then then they can do a certain activities such as like sell, sell things on top of it, self services and things like that.
Whereas except in their money can come from all private institutions.
We're a public institution, which means that anything that's not relevant to our mission has to be declared differently and all that sort of stuff.
And there's always a legal, but it also means that we can get money from the public and they have tax donation.
And we can, but we have to get a lot of our money from our public, we can't get all of from private.
And so there's like different strategies of how you employ capital will do good work.
And so in that like corporations, we knew they had a bunch of public money, they're willing to give away the things.
It turns out, I don't know that they have as much money for public charities and software as they do for private like trade groups, which was didn't know much about that law back then.
But now it's kind of like, okay, that's interesting.
But then there's, but on the flip side, we can apply for grants from the government much easier, right?
So we can apply to grant like NSF type grants and whatnot, on the research organizations from public money.
And so there's kind of a balance there, which you can and can't do.
And you get into space and you realize like, alright, so you're a, how are you going to make money as a developer?
Okay, so academics, like they go off and get grants, and they get their, their name kind of on a big library, and they get well known for doing that.
Right? Companies will get, you know, sometimes it's just public charity.
And often, it's like, they want that library to like work, because they're, they have a risk on top of it.
And you look at like Tidelift is the company that's like, selling insurance on the fact that is your, your open source, how are you going to be supported, and they use that to give back to the communities and things like that.
And so like, there's a whole risk aspect to like giving to these organizations, that's kind of what we're tapping into.
Then there's companies like, you know, Docker, which like, oh, we're going to build an open source thing, build a community, and then launch a company on top of it.
For us, like a lot of folks, you know, who work with us, they're just doing grants from either private institutions or NSF type institutions, and receiving basically a paycheck, it's not usually crazy money or anything.
It's usually just, you know, like what your paycheck would be, but you get to do the thing you love, and you get to do it from your home and things like that.
And I think most open source developers, I know that they love to write this type of software.
And it's hard to find a job that like actually pays you to do what you love sometimes.
And so I think for them, you know, making an impact on science, I mean, I see all like, I guess I'm cited, I'm on the list of the SEMPI paper, and I see all the people using our code.
It's like, oh, like, yeah, I'm cited for water buffalo research. That's interesting.
How did I help water buffaloes, right? So it's like some interesting kind of back and forth, you see the impact you make.
But to get paid, like, there's a whole range.
And then there's the other kind of bit, like, yeah, if you want to make a quick, quick million bucks, I guess, is to write an amazing software library, and then get yourself consulting on it.
Right. So I think there's a lot of people I know who do that.
Yeah, I got it into the in NumFocus and Apache kind of do serve as this where we'll be the fiscal sponsor of the actual code.
So companies will be like, oh, great pandas is here and Dask is here. And Julia, there's another one.
Oh, but we can hire Julia computing or coil or Saturn or all these companies to do work on those libraries.
But the library itself is open. So we can continue to use it.
And that's, in some sense, companies see that as insurance against kind of what would you call it consultant where ransomware?
I know, both of us have had jobs like that, right, where we have companies paying us to work on open source software, right?
And sometimes it's a hard sell, right? To, like, convince somebody like, oh, hey, this money that you're paying us, we, you know, you're going to benefit from this library, but so is everyone else.
I was once paid by a oil company, and I just signed a paper that said I couldn't tell what library is using or which company, but in fact, and they wouldn't even let me touch a computer, right?
They basically did it wasn't zoom back then, but it was like they had like a phone call and a screen.
They were streaming to me and like saying, all right, how do we how do we fix this problem?
I'm like, Oh, my goodness. All right, quick buck, I guess.
Sure, why not? Well, in that case, they were a little bit afraid to even like, announce that they were using this library.
And like, yeah, every oil company uses this library.
Why are you guys being so cagey? Like, no, no, no, nobody does this.
Okay, sure. Like, one place that I worked at that I'm not going to name, but you can probably guess was like that as well.
So they didn't want any known affiliation with any projects whatsoever, even though some of us were even active contributors to projects.
So I think that goes back to like the evolution, the evolution of like the whole ecosystem is like, now you're a you have a set of governance rules.
And the company can trust that they can put money into you.
I think that a lot of like, why the first money came to Jupyter and, and, you know, focus was just, you know, Fernando and Brian were very, you know, compelling people to like support and they saw their vision.
And the Jupyter now like Jupyter has had some products built on top of that.
And there's like tons of Jupyter products all over the map.
And, you know, Microsoft wanted to like make them successful and then make them successful with Microsoft.
And so this was a really good way that they could give to the projects and help that project succeed, but it kind of came down to like trusting we trust a few individuals and now we build a system and you can trust the system and so on and so forth.
Yeah, absolutely.
And I think it's kind of so you just mentioned Jupyter and I did kind of want to pivot a little bit over to the data science arena.
Because I think that data science in particular has benefited from the open source community, probably more than anything, right.
So just to bring us up a level. This is like the rodeo arena or the Mad Max Beyond Thunderdome arena.
I want to want to make sure I've got does it matter.
I don't know. Rodeo, they all walk out. I'm just gonna say Thunderdome.
Thunderdome, all right, all right. Pytorch wins. But yeah, it's just, you know, this the stack is completely ubiquitous and in data science right everyone uses, you know, Jupyter NumPy, SciPy, Pandas, Matplotlib.
They also use R.
We're okay with those people, but they do exist. Oh, sure. On some teams, but It's not ours.
When you can tell me how to put an R model into production, then we can talk, but we'll You call Hadley over our studio and he does it for you, right?
Unacceptable. Okay, let's But anyhow, you know, and I think that because, you know, data science is a newer field, right, that became a kind of a bigger deal over over time and is now, you know, kind of in its heyday where everyone seems to want to be a data scientist and you see all these boot camps popping up and all these Things like that.
It's built on this, you know, pretty deep stack of open source software that you know most people are familiar with.
And I find it kind of interesting at this point that, you know, It was actually really hard to get adoption for Python tools back in like early 2000s right like incredibly hard.
In fact, and now you know it's all over the place.
And but what do you think, I mean, especially for data science arena. Like, what do you, where do you see that going and like, how do you see that relationship with open source developing over the years.
Yeah, I mean, it is kind of like interesting to see how it's like evolved and to the point where like, yeah, you can take a 10 week book boot camp and like come out as like a senior data scientist.
Right. I still it's It's a little bit frustrating when you're like, yeah, I took too long to get a PhD and like you did 10 weeks and like you think you know everything there is about The field, but the other day, like a lot of times those folks are actually, you know, they, they learn a lot of stuff and they can they can put a model into production very often because The tools have become pretty easy to use.
Now when the model goes wrong that they'll struggle.
And so what I'm seeing a lot of is that the The skill set is moving to a higher level.
And so to be a data scientist.
It's not just about getting code out. It's around, you know, understanding the models and really diving deeper into the actual problems you have to solve.
And it actually kind of, I see the what I the space is going in different ways.
And it's not really clear exactly where it's going to end up A good example. I was talking to a friend of finance and like, you know, I really need something that does more backtesting on my models that are in production.
And he's like, yeah, I mean, The bank has as like that because if we don't, we lose billions of dollars.
I'm like, oh yeah, and then I get like three emails that day. I'm sure that there's a bot listening to me because like data robot released.
I think they have humble AI, which is like their new monitoring system.
Then you have, I think is Oh, the domino.
I think they have the domino model model model monitor. And there's a whole bunch of like these like systems that are now being built on top of those open source codes because like Now it's you can't just hire like I mean I still I still put my models and flask and like Like kind of own the whole code myself and like inside my team is like, I think I had the intern group actually revolted when I tried to make them do that this week this summer at 20 interns and like Three of them come to me like this is ridiculous.
How are we supposed to do all this like, oh yeah, it used to be that's what you had to learn to like be a data scientist, you had to be able to put Your model into production on a kinesis or Kafka stream and like be responsive and log everything and all that sort of stuff and like The fields kind of whittled down to just getting the model and evaluating the model and things like that.
Which I think is great because it means more people can do it.
And it means that there's more adoption of the systems, but like the meta systems like the monitors, the benchmarks, all those sorts of things are kind of being consumed by the platforms.
And so you're seeing the same thing that happened with like AWS where AWS is amazing.
We have great new SAS services.
I don't have to think about Linux and managing Hard disk failure and things like that.
But at the same time, like, well, but if I need to get my own kernel and like mod the kernel.
Like, how do I do that on AWS and so it becomes harder to modify and Because the skill set of doing that is like dwindling.
And so in that I see a lot of like as you become ubiquitous that your tools and everyone has it becomes commoditized and then people lose the skills of like modding out the lower level software.
Absolutely true. And, you know, even going further back than that.
I mean, you know, my, my PhD was a concentration in artificial intelligence before we even had all these libraries with all these, you know, implementations of algorithms.
We actually had to design the algorithm and implement it too.
Right. So the engineering cost on data science has just Gone, you know, way, way, way down.
We still have the data wrangling part. And like you said, the deployment part that are now our big headaches, but the modeling part like that's a whole lot easier for us.
Right. Yeah, so An intern asked us, Why would I ever iterate over 1000 items in an array.
He's like, I'm on. Oh, no. Algorithms. Yeah, it's like becomes it becomes a special skill.
And I think that's the thing that used to be like, yeah, I'll go was just what you did like every, every one was new.
I heard Tony whore give a talk is the Or triples and he's founder of quicksort.
So how do you find quicksort is like, well, I wrote down First thing I could think of a sorting and that's bubble sort and that was in squared.
And then I wrote the second thing down, I could think of and became quicksort you're like Okay, so you're just like conversant in algorithms all day, every day and like you just like iterate through and like these sorts of things just become pretty natural and like having a pivot and like pivoting around things was just a natural skill that you had to do as an algorithm writer and if you think about registers worked and how you shift around bits on a In memory or on a linear disk, you had a you had to be thinking like that all the time.
So I think that's a skill that's being lost.
Which, you know, but at the same time, like I was gonna mention like, oh yeah, the like things like GPT three come out and that now I don't even need a JavaScript developer, because we can just like Tell it to like make our layout for us.
I love you JavaScript developers. I just, you know, we have to keep up the, you know, back and Sure, we got to keep it spicy.
But I just real, real quick on a little time check.
I did want to leave a few minutes for questions and we're kind of, believe it or not, we're kind of winding down on our, on our happy hour.
So I'm just going to check in real quick and see if we have anything that's come in.
But Not seeing anything right now.
But in that case, actually, do you mind talking a little bit about GPT three.
And so this is open a eyes new System that just came out and it's it's pretty freaking cool, by the way.
Do you feel comfortable kind of talking more about it.
Well, it's just, I mean, not, not more than just saying like, you know, it's a new language model that, you know, is You know what people are talking.
I guess the other day someone mentioned you need more ethics and AI and like every time one of these language models.
Can I like, oh my gosh, it's scary how good it is and generating Human like content and so GPT two is another one that's like already you see people building a Programs out of it.
I could build a whole JavaScript layout or could build a whole Google search engine off of it or build.
I think what I see a medical Right diagnoses piece of it. So very quickly, we're coming to the part where you can give it a bit of information and some Corpus or like like all a GitHub of and it can you can then describe what you want to build like a button with a widget.
And it'll actually spit out the tax that makes that thing happen or answers the medical question, whatnot.
And they like I have no idea.
Like, I haven't had a time to jump in the the actually see the model or understand the architects or anything like that.
But it's one of these deep learning Language models.
I think it on the evolution GPT to last year, and I believe the writer of that had Major ethical qualms about it coming out.
They didn't want to even release the model until they had a way to detect if somebody was using the model in the deep fake kind of realm.
And so like, it's just like Anyway, so the best way to detect if it's something's fake is to use the algorithm algorithm that said whether it was generated and things like that.
And so once they had that kind of worked out, they kind of released it.
I believe that GPT to came from Google, but maybe it came from open AI.
I'm not, I don't Know because I know three is for sure from open AI.
Yeah, yeah, totally. So even their open AI is a fun one where was like it was a nonprofit, then it'd be turned into a company.
And so like even the space of like, I don't know exactly why they did that, but All of a sudden, like you see like, all right, you're all companies are making things that like look like humans very quickly.
So it is an interesting time we live in and it'll be You know, in the world of that I do a lot of ads and personalization of ads and like, oh, great.
I can like Actually like have an ad that is personalized to you in plain written English on a server that was generated very quickly without having someone to pick up the phone to or something.
And so very like it's super applicable to everything people do every day, but It's a Kind of a new, a new shot at this and we'll see how how well I've already had one of my friends is a philosopher who writes about deep learning.
He's sent it over. It's like, oh my god, what's going on here. Explain to me what this This translation layer is like, oh no.
Don't answer that email. He's a good guy.
He wrote a bunch of a bunch of things about how crows think and giving crow like animal cognition and whatnot.
And then he started getting into deep learning.
I'm like, oh, when you get a read about a philosopher read a philosopher talking about your work.
He's sitting there like I don't think that's what's going on.
But okay, man. We actually only have a couple minutes left. I didn't want to at least take a couple minutes to talk a little bit about pie data.
And what that is just so people know that pie data is the educational arm of numb focus and the main Thing that I think most people are familiar with are the pie data events.
So we do live events where we get speakers together and the community together and we all talk about You know, things people are working on new libraries, all sorts of things like that, bringing developers and users all together and sharing knowledge and unfortunately in the age of the pandemic.
We are not able to do any of our live Live events right now and which is really sad because they're really a lot of fun and I always learn a lot.
Andy and I are usually, you know, speakers, at least a few of them.
So that's Kind of fun and getting to know the community, but we do have a pie data global event that's coming up this fall.
That it will be online.
And so I think that's an opportunity for everyone to, you know, especially communities that don't normally get to connect.
I think it might be a great opportunity that like, I mean, because we have a You know, a pie data, you know, Poland and a pie data LA and a pie data Austin and a pie data, New York.
This will be a chance to bring everybody together from all these different communities.
So that we can all share knowledge there. But you want to say a couple words about that before we wrap up.
Yeah, I did is great is a Idea to the conference really started with Travis all upon Peter Wang scratching the edge of trying to get Guido to Put Good support for compilers and an external libraries into the PIP and I don't know what was PIP back then.
But yeah, so it became an event.
We talked to Guido. It's like, oh, you guys have a real problems. He spawned a whole community out of it.
And now we have events all over talking about all things Python and Julia and are Awesome.
Well, I just want to thank you so much for joining me.
It's been really, really nice. I hope that you get to, you know, experience Cloudflare TV some more.
But with that, I just want to wrap up and thank everyone for watching and I hope you enjoyed your open source happy hour.
Right, thanks again.
Bye.