Open Source Happy Hour
Presented by: Katrina Riehl, Andy Terrel
Originally aired on July 31, 2020 @ 5:30 PM - 6:30 PM CDT
Join Katrina Riehl, the head of our data science team at Cloudflare, along with her long-time colleague and guest, Andy Terrel. Katrina and Andy are both members of the Board of Directors for a non-profit organization called NumFOCUS. We provide support for open-source projects in the scientific computing landscape, especially the data science arena. We will talk about open-source, what it is, the challenges, the successes, and how open-source interacts with the for-profit business world.
English
Interviews
Transcript (Beta)
Welcome to Cloudflare TV Lens. I just wanted to welcome everybody to our open source happy hour.
A warm welcome to everybody out there in Cloudflare TV Lens. And just to let you guys know, this is actually going to be a happy hour.
So please go ahead and if you want to take a moment to take a moment to go grab a frosty beverage.
We'll wait here for you. I'm actually totally kidding. We're going to keep talking, but you should still go get something to drink if you want it.
Or a cup of tea or something like that.
And please join us. We're going to have a nice conversation, just a nice relaxed conversation about open source and data science.
And this is my guest, Andy Terrel. Really, really glad you were able to make it.
And I'll let Andy go ahead and introduce himself. Thanks, Katrina. Appreciate it.
So, yeah, my name is Andy Terrel. I'm chief data scientist at Rex. It's a company that sells real estate on the Internet without as many fees as you might pay normally.
But I'm also the president of NumFocus, which kind of puts me in the role of the open source advocate.
And NumFocus, we help sponsor a lot of the open source data science platforms that people use or programs like NumPy, SciPy, Pandas, things like that.
And been doing that for about eight years. Very cool.
And then just to introduce myself, I'm Katrina Real. I'm leading the data science efforts here at Cloudflare.
And I've had a long history with Andy. Really, really glad he was able to make it.
We've worked together. We've worked together at NumFocus.
I should also mention that I'm on the board of directors for NumFocus, so we work together in that sense still.
I'm actually the treasurer of NumFocus.
So with that being said, we're both going to talk a little bit about NumFocus real quick.
Can you go ahead and just introduce a little bit more, like what NumFocus does?
Yeah. So NumFocus has a mission to promote scientific open source software and education about that software.
And we're really here to build better software for scientists so that the scientific community can continue to advance.
I think a lot of what we started with was a bunch of folks who had built out software in academia and realized there wasn't a way to fund it under the traditional academic model.
Since then, I mean, like this was basically nine, ten years ago that we started the organization.
There's been more models that have kind of come up, and Sloan and Moore Foundation have funded whole ideas of getting data science into the university models.
But at the time, there was basically no way to get funding for working on an open source project.
You got funding to write scientific papers and things like that.
So we were like, you know, it kind of stinks that we can't pay, you know, the people who built NumPy a paycheck and that they have to go work for somebody else and do this on the side.
So we started a way to, like, start getting money directly to the project.
And through that, I think we're running, you know, about a $4 million budget every year pushing corporations to give.
We run events here. It's called PyData. You're wearing our T-shirt.
And it's an education effort to get more people understanding the tools and understanding you can use these tools in production.
I guess in some sense, like, now, like, everybody uses them in production, but back in, you know, 2010, it was kind of still thought, like, oh, no, you have to have, like, IBM's tools or something like that.
But now, like, most of the data science tools are open source, and they're built by academics who were doing it on their side.
And now, like, it's reputable to take money from corporations or from grant -making institutions.
And we end up having on the order of, like, 30 or 40 independent programmers working on our projects.
We just serve as the fiscal layer to make sure that we can raise the money for them in an appropriate way.
Absolutely.
And we're going to dive into that just a little bit more. But you definitely hit on a couple of things.
We do have two different levels of affiliation with our organization.
We have fiscally-sponsored projects, and we also have affiliated projects.
And maybe you can kind of tell the – actually, I'm going to give you a pop quiz.
Can you name all of our fiscally-sponsored projects? I'll help you.
Oh, no. Geez, I need – I think the first question was, where are we drinking?
That was an easy one. Okay, no, we did skip over that part. Let's go back to that.
So what kind of cocktail goes with open source, Andy? Well, I was making – I heard about – so on Kyra's Doll and Molly's new podcast, Make Me Smart, I think, or whatever, they presented me with a batengo, which is apparently like a rum and Coke except it's tequila and Coke with lime and salt.
Although I made it wrong, and without the salt, it doesn't work nearly as well.
So batengo without the salt, it's all right.
All right. What do you have? Well, I just decided to keep it simple.
I've got some Austin Beer Works Petersbaker. It's any time ale, keeping it local, Austin Beer Works.
And so that's what I am enjoying on this Friday. It's been a week, so let me tell you.
If you go to Austin Beer Works, you've got to ask for the Mr.
Falco, where they mix – I think it's Peacemaker and their black – it's their black ale.
Really good mixed beverage there. Anyways. Okay, all of our projects.
All right, so let's see here. We have Num Pie, which I think was the project we started with, but they actually didn't sign for a long time because sometimes you have to – and going back to that physically sponsored versus affiliate project, the way I kind of put it is you have to have a little bit of governance to take money, and then Num Pie had governance, but they didn't have the signatures kind of put together, so it took them a long time to actually sign it.
And so it's like, oh, yeah, Num Pie, they're the first ones.
But actually our first one, I think, was Jupiter.
So Jupiter is another one. They were the first ones to receive first funding.
Thank you, Microsoft. They kind of got us up and launched, and they've grown very successful from that.
We have in that – actually, you might not know this, but we just signed Scikit-Learn today.
But they're a physically sponsored grant or grantee, not the full physical sponsor, but they joined today.
So even the list on the website is wrong. Then we have Pandas, for sure.
They were early on. Then we have – let's see here. We have Julia, which is the whole language, but we also have Jump in there.
We have Senpai. They were an early one.
Let's see here. You've got to take over now because I've got – because there's like Shogun.
I don't know if they're affiliate or – the problem is which ones are affiliate and which ones aren't.
Like you love them all. That's the trick, right?
So Condaforge, that's an important one, right? Condaforge, great. Okay.
Okay. They're pretty – they actually were affiliate for a long time, and then they finally like blossomed enough.
I mean that's another typical trend is you'll start at one company or a couple of people.
And when you're just a couple of people, like are you really – like do you really need governance?
Do you need fiscal sponsorship, legal advice, all that stuff?
You probably don't, but you still want to be like part of like an organization that can give that to you.
And Bokeh is like one of our – they started with a few folks at Anaconda and then started growing, and now the team is quite bigger and it spans NVIDIA and Microsoft and Mozilla.
And so it became one that like grew.
And speaking of NVIDIA, Dask. That's a really big one right now.
Yeah, they just – they're pretty new too, yeah. They were affiliate for a long time.
Although Dask is a fun one because is it NVIDIA or is it – because now there's Saturn Cloud and Coiled.
So it's having its moment. So Ray had it, like we're talking about open source, like I think Ray is more the Apache world.
We don't have to talk about Apache. But yeah, so now Dask is having their moment where they're exploding in companies around these open source projects.
Psycadamage. That's an important one. I think that's enough. Let's go over to the affiliated projects.
I just kind of want to give a shout out to all of these people that are, you know, working on these things.
Oh, man, I pulled it up and now there's so many.
We missed X-Array, Blast, MathJax. MathJax is a much older project than we are.
Yeah, it's true. But they joined. They're like – they were at the American Mathematical Society and they're like, yeah, we found out you all actually do software and we just thought we'd fit a lot better over here.
Okay. Yeah, that was while I was still on the board.
It's all good. Yeah, so on the affiliated side, we got Scython.
I'll start off there. I thought Scikit-Learn was over on affiliated side, but it sounds like they're on fiscally sponsored now.
As of like hours ago.
Yeah, that's fantastic. Like literally it came across, I'm like, ah, finally signed that one.
All right. Cool. Let's see. Dash. That's a good one. Yeah.
Gensim is one that my team uses pretty often for natural language processing and topic modeling.
Geocandas.
Go numb is on here. We were talking about like a lot of times people come to us and they say like, oh, you're just Python.
I'm like, well, no, I mean we have R and Julia and now we even have Go in here.
Yeah, I think that's a really important thing to talk about just because the scientific community, right, the whole point of this is to open up science, right, and make it more accessible and to have people working together.
And so being tied to one language isn't really a good philosophy, right?
And so I like the fact that we represent a lot of different areas.
And I think that people forget sometimes that we're not just Python based.
I like Python a lot though. And it's like I just want to share the love of everybody else and like how awesome Python is.
So I agree with you, we need all the languages, but they just need to know how awesome Python is too.
Well, sure. How long have you been programming in Python?
Since 2004, I think. I guess it was really my second language was like the first language I really like did big programs in, right?
Like there was like some C and MATLAB and some other things before that. But, yeah, back I think it was Python 2.2.
Was it 2.2? Yeah, I think it was like 2.2. And like I don't like you go back and people fight about like 3.0, but I mean you don't remember like 2.2, like things crashed without you knowing like why, like there was a stack trace in some part of the C interpreter that like just really wrecked your day.
And so like, yeah, everybody, like most people I know are like 2.5 and beyond.
I don't know very many are like before 2.5 these days. You know one, right?
Yeah, I mean a few, right? I actually started programming in Python in 1998, believe it or not.
Oh, wow. Yeah, I got a lot of people. You, Guido, and Travis, and I think that's it.
Exactly. The first Python had 15 people at it, right?
Wow. David Asher, Guido, and a bunch of other people, right? So 15 people in the corner of a Marriott in DC.
That must have been an awesome time. Yes. But, yeah, the point being that, I mean, Python obviously being an open, it's an open language, right?
So it's an obvious candidate for open source. And, you know, I think one thing that people kind of don't talk about a whole lot is that, you know, when you do have these open communities, you know, this is really built on the backs of graduate students, right?
So this is, you know, how I got involved, by the way.
It's like, yeah, graduate students for the win, right? Like we got to, you know, that's how most people get involved, right?
I had a whole two extra years of grad school just because I was the one running the code, you know?
So it's like, it is great. I get to make $20,000 a year, two more years in a row and no publications.
You have street cred. What is that worth to you? I cried a little last week when an intern asked me why it took me, why it took me so long to get my PhD.
I was like, oh, thank you. Yeah. Thanks for reminding me. That's funny because I actually feel like I got through my PhD faster because I was able to prototype things so quickly in Python.
Well, my PhD code was in C++, so maybe that was an issue.
Oh. That's poor. We had a SWIG binding. I mean, you remember SWIG, right?
Of course I do. David Beasley. Oh, my. So, yeah, that was a lot of fun.
SWIG days. Yeah, absolutely. But, yeah, kicking around old times. I definitely remember sometimes I tell people these stories about way back when we were using these things and back when we were importing numeric and having to write our own C extensions because there was, you know, nothing else that we could do or even sometimes even assembly or Fortran extensions in order to get speed out of Python.
And so now we have these incredibly powerful libraries that I think have, you know, kind of changed the ecosystem quite a bit, right?
Especially sometimes I guess where I'm going with this is that I wanted to talk a little bit about the fact that even as an open source project, you know, there is governance and there is also, you know, I think a really nice attention brought to bringing, you know, clean APIs and clean documentation and making things very user friendly.
And I feel like especially in the scientific software community, it's actually brought more people into the area and it's made it more accessible for people to do this kind of work.
Yeah. And I think when I was in grad school, like, you know, we'll pretend it was just a few years ago, even though it wasn't.
There was definitely a sense, there's a rite of passage in running all these systems, right?
And so I remember I was like invited to like, Oh, you don't have to teach classes anymore.
You can just run like you're, you're, you're so like, I guess whatever it was, it was happening was the sysadmins were figuring out who was actually like doing a lot of stuff on their computers.
And like I had taken over the Condor pool and was running on all the other grad students computers at the, at the night and all that stuff.
I'm like, Andy, you're going to come up and be a sysadmin.
You don't have to teach.
You can just be a sysadmin sort of thing. And so like, then you were like in and you like they were like all these neckbeard open source contributors.
I think that was like kind of an old school view of open source software.
Like you had to be in the club.
And I think that if there's anything that's evolved in the last, I don't know, 10 years since, since that time, Oh, I just gave away all of them.
But yeah. So in that, in the past decade is that open sources really embrace an inclusivity movement and you know, yeah, I didn't know any, I don't know, like in the day, right?
Like today's the day, right? So there was like no black developers growing where I was at.
Right. And now like you look at, there's a whole community of like black developers inside either, you know, either whether it's not focused or the ecosystem and you see tons of contributors in Africa and they're actually China's coming up in contributors as well.
Like it used to be just like neck neckbeards in their basements.
Right. But, but that includes, but I think that's a direct result of that governance.
I think that, you know one thing I've found over the years is like people will talk about diversity inclusion and they're like, why aren't we more diverse?
Why aren't we all inclusive? And like, I remember asking this question.
I actually was a happy hour at the Sci-Fi conference that one year when I was running it and I might've had a few too many, too many drinks.
They didn't have a botango, but a few beers.
And I was like, why, why don't we have more diversity?
They're like, Hey Andy, have you ever asked, have you ever just like sat down and asked like 10 black people to come?
And I'm like, uh, no. Like, why not? Like, I'm like, okay, well, I don't know.
Cause I don't know where to find them. Like there's no coders and like, well just, it doesn't matter how you find them.
Just go find them and ask.
And so what we found like over the time is that, you know, number one, like feel like, Oh, they're not coming in, but you're not asking them to come in.
And a lot of the reasons they weren't even there is because they didn't feel welcome enough to even show up.
Because if you imagine, I mean, I'm jumping on a mailing list and you barely have any time to do anything.
Fernando Perez, like he's, he was a early founder of No Focus.
And he likes to talk about the privilege of being an open source developer.
Cause I mean, you were privileged to have all this extra time that you could give away honing your craft.
And now like open source developers, I mean kind of like put on an ivory tower and like cherish is like, Oh, look at all the great things they're doing.
But back in the day, it was just like, we were just bored and we had extra time really.
Or we just loved it so much. But by helping teams develop a code of conduct governance, it really makes it clear that, Hey, this is a safe place to come work.
This is a safe place to come give your ideas. And there's rules about why, how we can give you feedback and why we will give you feedback.
And like that did that kind of like structure requires a movement.
It's not, it's not as easy just to make people do it.
And like one thing that we've done for is like they, we have 50 codes basically now and coming and helping people understand like look, go to conduct.
I get it. They're not exactly what you want. They're not like there's problems with them, but at the same time there's, you know, how do you want a less inclusive place or a more inclusive place?
Do you want to have something that where the rules are kind of stated or unstated?
Because when the rules are unstated and somebody doesn't play by the unstated rules, it it's very hard to say, please stop playing by the by the rules.
Right. And so in that, I think that the governance has come up a lot and like, how do you make decisions?
Who, how do you even define, like who a core contributor is?
How do you value contributions that are not just code, like people who report bugs or report or fixed documentation, things like that.
And so in many ways, I see NumFocus is my like my civil rights platform as well.
Like I want science, I care about scientists having the best tools, but I also care about making it very inclusive because I believe ideas come from everywhere.
And if you don't include our underrepresented groups and the majority of the world, really, then you're just going to have like very fragile software and very fragile ideas.
And so in my mind, they go hand in hand. Absolutely.
And I just wanted to add on to that a little bit because, you know, I mean, I've already revealed how old I am, right.
Joining into an open source community, especially for me, you know, that was a time when there weren't a whole lot of women in computer science.
Frankly, I was the only woman in my entire PhD program.
And, you know, this was a community that I found some people who were incredibly accepting and very inclusive and that were very happy to work with me.
And, you know, have patience for my gajillion questions that I had all the time.
And then there were people that weren't right. And you didn't feel very welcome.
Right. And that goes back to a lot of different things, but we didn't have codes of contact conduct back then.
I just want to point that out. And, but at the same time, the community, I think did a relatively good job of policing itself.
I'm really glad that we do have codes of conduct now.
I'm pretty sure that you and I have both had to enforce codes of conduct at PyData events before.
So, but I'm really, really glad that we have them in place and it's, you know, I think it goes a long way into creating that environment, right.
You know, it's not, it's not just a, you know, open source software community.
It's also just an open science.
It's the idea of keeping things open for everyone. It's a philosophy that I think jobs quite well with diversity and inclusion.
And but another thing I was going to add to that was that, you know, I was on the diversity and inclusion subcommittee as part of NumFocus.
And one of the big complaints that we were received sometimes also is from underserved areas that we would host events at these like super flash, you know, big companies, right.
That were generally in areas that were not close even to like the neighborhoods that we were trying to serve in these, in some of these events.
Right. And there was also, you know, a certain amount of just discomfort at the idea of going to, you know, Facebook headquarters or, you know, Microsoft headquarters or things like that, because it wasn't a part of the community.
And I always thought that was kind of interesting that, you know, the idea of like opening it up and making people feel comfortable is not just about invitations.
It's also about creating environments where people feel like they can be more comfortable as well.
Yeah, totally.
We should do more in like public libraries. That would be awesome. It would totally be awesome.
I would love that. They don't have as good a cappuccino. I mean, I tell you what, you go to the Microsoft downtown, like the cappuccino machine.
It's hard to beat that. So there's a little bit of that going on. Oh, sure.
Absolutely. But yeah, since you did bring up governments, governance, you know, this is one thing that I did want to get into is that I think governance is a really important part of, you know, in order to be accepted into NumFocus, by the way, you have to be, you know, judged on certain criteria.
So one of them is, you know, making sure that you have a governance model, making sure that you have a code of conduct, making sure that you're serving the scientific community.
You know, people are going to be nice, right.
And inclusive. And so I do want to talk a little bit more about governance because I think sometimes, you know, from the outside, some people don't understand how these projects can be so successful with just a loose collection of people.
Right. But it's not really as loose as it may seem from the outside.
Right. So can you talk to us a little bit more about governance and how that works?
Yeah. And I think that I always like to compare like a few different projects.
The ones that kind of compares like you have SimPy, which is a mathematical symbolic mathematical library.
You have Pandas, which is a library, a data manipulation library, which is used by tons and tons of people.
And you have like Jupiter, which is a whole ecosystem, but like both a visual system, but also server running server and and so on.
They've got so much stuff. They have a parallel computing system in there.
They have a security protocols and everything.
Like it's become the kitchen sink of like data science. But the three of them, right.
Really vary in how they make decisions. Jupiter has, you know, a steering committee.
They have, I think they have 150 repos now, but they have, you know, subcommittees within there.
And then they have like conference committees and a whole organizational structure, which is funny.
Cause I think they, they often, I get a lot, you know, I also get a lot of complaints from folks and like Jupiter folks.
I was like, Oh, it moves so slow. And I'm like, yeah, but look how many people are like, awesome things you've got.
And then you have like Senpai, which is basically their governance is written in a half a page document.
And it was just like, yeah, Aaron controls who who's committing.
And here's the requirements he gives.
It's not quite that, but it's, you know, there's a, here's a, here's our base, how you become a committer and here's how basic decisions are made.
We have no money and we have no like infrastructure. So yeah, we don't have to decide anything about that.
And so, and then you have pandas was kind of in between the, where they, you know, they have a server team, they have finance team, they have a few, and they have a technical committee and they're building relationships, but they're also like dealing with things like an ecosystem that's growing.
Like you have Geo pandas and pandas data reader and a whole bunch of different things out there that are like building on top of them.
And they're like, how do we make protocol?
So they're kind of what I would call like the protocol building stage, if you will.
But, but there's still very much a library that you kind of, hands on use.
I guess I liken it very much to like, just like how businesses work, right?
Like you have your, your I don't know if everyone knows how, but you have like LLCs, which like, you don't want to start a business.
You start an LLC, you go, you send off 200 bucks to the state government and bam, you're a business.
It takes, you get a book to put meetings in, but if nobody, nobody's ever going to check in unless you try to sell the thing versus like the C Corp, which you have to have a board and monthly meetings and like very rigorous kind of discussion about what you're doing.
And then if you go public, there's always other things just like that, as you grow and you have more people and more assets and more resources available, you end up building more checks and balances inside your governance system.
That kind of seems right. And for the, for the most part, what we ended up doing is just having a, we have our annual summit where we kind of come together and say, Hey, what works, what doesn't work?
They borrow ideas, they evolve their governance.
We have some templates we've written. I think Sam Bryce, who was a two Sigma wrote a bunch of, he was visiting fellow for us.
He wrote a whole templating system for building out governance.
It's on our blog. And so I think there's a lot there, but it's definitely kind of clear that these days you can't just like have nothing.
Usually you just put it, put it on GitHub. Yay, we're an open source project.
No. So there's a whole thing, right? So, yeah. But I do think it's important also for people to understand that.
I mean, even if you become, you know, part of num focus, like we check to make sure that these basic things are in place, but we don't impose any kind of, you know, governance on it.
It really is up to the project to make sure that they're functional. But I do think that num focus also plays a role in continuity, right?
I mean, one of the worst things that can happen to an open source project is that everyone is sick of not getting paid and they go on and just, you know, do something else.
And there can be, you know, ugly handoffs.
There can be some pretty strange forks that happen. And you can also have, you know, major API changes for people who are reliant on the code.
Right. As well, projects that die, right. And which is a huge risk for companies that are trying to support open source or feel like they've made an investment in a project to suddenly have everybody that they're kind of counting on just kind of disappear.
Right. Which is worse, like Python 2.7 dying or, you know, Docker and like Rocket, like kind of madness.
Right. It's, the open source ecosystem.
It's funny. Like whenever my team kind of starts building a new thing and they're like, Oh, look, we have this, this new thing on the Internet that we found.
I'm like, okay. So how many GitHub stars do they have? How many contributors, when was the last patch?
And they're like, well, no, no, it just works. I'm like, no, it doesn't.
It might work today, but will it work like in a year from now when your server's on fire?
And which reminds me of the way y'all are blocking repo.anaconda .org.
I thought I was going to bring that up. You like, we found this out yesterday. You're like, I thought we were going to talk about that Andy.
My system input Cloudflares, a malware thing and like all the data science is malware now.
It's probably true, but you know, I'm going to point that out.
So Cloudflare.tv. Yeah. No.
Did you file a ticket? Uh, how do I do that? Is there a governance package? You could file a ticket, but yeah, I do believe that came across my, uh, my computer yesterday as well.
So, so I mean, yeah, in that just shows you how there is actually a much larger ecosystem, like your open source project.
Like if it gets consumed by condo or PI PI or, um, MPM packages or whatnot.
I mean, like, uh, I got to mention, I get to mention left pad the other day, whenever JavaScript developers are bad mouthing Python.
So I'm like, Oh yeah, we are. We stink left pad, left pad.
We all make mistakes, but this, this whole thing needs an ecosystem to like actually support it up.
And I think, um, we, we, uh, one of the councils on, we wrote a paper for a blog post, really.
You know that you're an academic when you call a blog post a paper, but you know, uh, so we wrote a paper on the, like the levels of open source or levels of technology in general.
And you kind of think of just like a city, you have like houses, which might have one or two people in them.
And you have maybe an apartment complex, which will have hundreds of people or hundreds of families in it.
And then you might have a street, which would have lots of complex housing complexes and they'll have a hospital on the street and whatnot.
And the, the, uh, infrastructure needed to service the whole street versus just a house is hugely different.
And the same way in our ecosystem, uh, we as a community num focus, I like to say we're kind of in that middle layer when out the bedrock, like Linux is a, you know, it's an open source.
Can you imagine if Linux went down?
Um, I mean, Linux, we have a whole foundation, like it's, um, you know, a $200 million foundation that's really running like all the, all the like decisions around how our infrastructure works at the base server layer.
And then focus, we tend to sit on top of that.
Um, and then like a lot of like, you know, every two person, GitHub, uh, library could bubble up someday and become the next Heartbleed.
Right. So it's, uh, Absolutely. And I think that that's also something I was going to ask about was, um, I always get this question from a lot of, um, you know, especially new developers or graduate students who like to, you know, that are a little masochistic, I guess.
They always wonder how they can take their closed source, um, you know, uh, projects and convert them into open source projects.
Right. And so I think, you know, maybe talking a little bit about like what are some of the characteristics of a healthy open source project and how would you make that transition?
And just do rm.git and like post it.
Didn't we just say that that's not how you do it. Oh, dang it. So I think the, the, the first thing I would say, react to that is that you, um, there's a difference between posting your code as open source, um, and creating an open source community.
Um, a good example is like I was, when I was at, uh, university of Texas, we had, uh, Kajishige Goto who wrote the Goto Blas library had, um, recently left and actually had his office, which was great.
Like the master's office, um, and, um, fastest library for linear algebra at the time it was written by one person, kind of just a marvel of like engineering.
Um, so it was kind of, but it kind of was 90,000 of assembly code, 90 ,000 lines of assembly code.
Right. So like, all right, well, who's going to support this.
And we like went around the, the computing center and like, yeah, not me, not me.
And so I was like, well, why don't we just slap a license on and see where it goes.
Um, so we, we did that. We basically, you went through our process of asking like who owns this and went to the, um, technology commercialization office and wherever you're at.
Like, I think a lot of developers or people don't realize this.
Like your developer, you're often your time is bought, you know?
So if you're doing something at work, the person owns the copyright to that is that is your job.
Um, so first off, did I write this? And like, all right, so who owns the copyrights that they're the copyright owners, the one who's going to get the license.
So you have to go figure that out.
So we figured that out. We slapped a BST, BST license on it. And we put it on the Internet.
Great. Um, turns out, uh, the, uh, a group in China, um, took it and developed open blast.
And so now, um, and their first thing to do was just fix it. So it like actually ran on a new processor because the way the whole thing works is like it pinned to a processor and all fast.
And so like the, the test suite wouldn't even work for like the latest Intel processor or something.
And so, you know, there's a group, they want it to be fast.
They want their supercomputer to be faster or like be as fast as possible.
They'd pull off going to blast and start using it.
Let's start giving back to the community. And, um, they, uh, you know, at this point, like now, uh, the funding, the last CZI funding that we had, um, it was actually going to NumPy and NumPy is a major user and they depend upon it.
So they've actually, uh, contributing patches to the open blast.
Um, but by building it up as a little bit more of a community accepting patches and then talking with the community a lot more, it actually blossomed as something that is usable by a lot more folks.
Um, and I mean, different folks take different kind of approaches. That's one approach of like a single, like a, it's probably pretty typical of like data science libraries where there's a few people building something and kind of going out, but then you have like the TensorFlow model, for example, which is like a concerted effort by a company to come in and like make a better product.
And they want people to use that product in order to use their heart, their software or their systems or whatnot.
Um, so you have TensorFlow where Google comes in and, and spends a lot of money making a very nice place.
I think, uh, iron Python was an example from, uh, from the old days where Microsoft was trying to make us use iron Python so that we could use more program on Microsoft systems.
Um, and in that I think that there's kind of that community driven, like very organic approach grassroots, if you will, or in that kind of really company supported mandated approach.
And, um, I think both work in a certain capacity, um, but they also do different things.
I think, for example, one of our projects is Tiano and for the large part, TensorFlow was a re -implementation of Tiano.
And when, when the Tiano had a hard time supporting all the people are using it.
Right. So they said, look, here's TensorFlow, um, and open, you know, all the deep learning stuff is taking off and we don't want to have to support those folks.
We want to write papers.
We want to figure out the new new technologies. And so, um, Hey, let's, let's deprecate Tiano and we're just going to use a TensorFlow.
And now you have kind of a, like a whole explosion of more, uh, to the tensor based deep learning libraries.
I think the other side of it is like companies use this. And I guess, and what I would kind of call like a big movement in the early open source days was to open source something and get some like ownership between companies so that you didn't have to, um, pay a license fee.
And like the first one of these was like Apache, right.
The, um, ACTP, uh, ACPD library. Right. And so, you know, that was built, it was funded large part by like folks like IBM.
So that, you know, Microsoft IIS would kind of go away.
Right. Um, and the strategy work worked really well, right.
It works really well to get people to jump onto your library by giving it away free.
Um, but these days, like, I think the projects we work with, we want to share scientific ideas and we want, like, you can write down a mathematical proof or you can like give you a, give somebody a library to play with it and show it.
And like, even today, like you're, you're seeing much more like the, the 10 towards, like you have to give me something that shows that your software works, at least in the data science world.
I'm not sure it's completely true in all domains of science, but for us, it's like, it's not true unless you actually have a model that does it.
Right. Yeah, absolutely. I, you touched on a couple of things.
I just want to make sure that the audience understands like, um, so CZI, we've talked about that a couple of times.
That's the Chan Zuckerberg initiative. Right.
So that's a lot of, um, funding for open source projects and, um, CZI grants go out.
Um, Numfocus helps administer those grants and, uh, make sure that, uh, open source projects get funded.
And then also we mentioned the Moore foundation. So that's Gordon and Betty Moore foundation.
They, um, do a lot of, um, funding for different projects, which I think is really cool because I mean, we do have these different sources of funding that are coming in because you do have, you know, um, places like, you know, our platinum sponsors, for example, like, you know, Microsoft and Facebook and all these other people that are making investments in open source, but you also have nonprofit money coming in to help fund open source.
And, um, but in the same vein, you know, you're not going to become a millionaire, right.
Like writing open source. So I think that, uh, sometimes, uh, I think it's really important sometimes to talk about like how, how do projects bring in money.
Right. I mean, we mentioned a couple of different things, but, um, you know, how, like, I think every project in and of itself kind of decides like, who's going to get paid or what are the initiatives or what are the features that we can actually pay for?
And what are the other ones that are just kind of, you know, things people can pick up and they're going to contribute.
And so, and how you build that community around it.
Yeah. It's funny because I didn't even like this point is a good point.
You're bringing up in the, um, I think of like private foundation.
So like more and CZI and Sloan, these are all private foundations that give out money.
So like the, um, like Intel, like Gordon Moore, found Intel use that money to start, um, more foundation and Chan Zuckerberg Facebook to the Chan Zuckerberg.
And, um, I mean, it's kind of going back to like Carnegie, Carnegie for a long time, we started Carnegie Mellon and those sorts of things, um, and the Carnegie foundation.
So they, those private foundations have a particular like, uh, strategy, which is to go in and give money.
Just like the Gates foundation is a good example of this, of like, they wanted to eradicate some diseases.
Right. I think malaria, um, and they saw like, look, if we like put in a bunch of capital, this is achievable and like five-year time of horizon or whatnot.
And we can actually go out and do it.
And so maybe there's some social good, this public benefit and so on.
Um, and when we started to focus on focuses of 501C3, which is different from like the Linux financials of 501C6.
So C6, I said, yeah, IRS tax code.
Ooh, need some, need some more alcohol to talk about that. I am the treasurer.
It's all good. Go ahead. The tax code like has like, like six or seven different types of organizations you can have.
And the C3 is the public charity and the C6 is a, is a trade group, right?
So often you'll have trade groups that can come together to pull money, to put in, uh, to build a common pool of resources under a charity status, but as a trade group, and then, then they can do a certain activities such as like, uh, sell, sell things on top of it, um, self services and things like that.
Um, whereas except their money can come from all private institutions.
We're a public institution, which means that anything that's not relevant to our mission has to be declared differently and all that sort of stuff.
And there's always a legal, but it also means that we can get money from the public and they have tax donation.
Uh, and, um, we can, but we have to get a lot of our money from our public.
We can't get all of them private. And so there's like different strategies of how you employ capital will do good work.
And so in that like corporations, we knew they had a bunch of public money.
They were willing to give away to things.
Um, it turns out I don't know that they have as much money for public charities and software as they do for private, like trade groups.
Um, which was didn't know much about that law back then, but now it's kind of like, Oh, okay.
That's interesting. Um, but then there's, but on the flip side, we can apply for grants from the government much easier.
Right? So we can apply to grant like NSF type grants and whatnot on the research organizations from public money.
And so there's kind of a, a balance there was you can and can't do.
And you get into the space and, um, you realize like, all right, so you're a, how are you going to make money as a developer?
Okay. So academics like they go off and get grants and they get their, their name kind of on a big library and they get well known for doing that.
Right. Companies will get, you know, sometimes there's just public charity and, um, often it's like they want that library to like work because they're, they have a risk on top of it.
I mean, you look at like Tidelift is the company that's like selling insurance on the fact that is your, your open source, how are you going to be supported?
And they use that to give back to the communities and things like that.
And so like, there's a whole risk aspect of like giving to these organizations.
That's kind of what we were tapping into. Um, then there's companies like, you know, Docker, which like, Oh, we're going to build an open source thing, build a community, and then launch a company on top of it.
Um, for us, like a lot of folks, you know, who work with us, they're just doing grants from either private institutions or NSF type institutions and receiving basically a paycheck.
It's not usually crazy money or anything. It's usually just, you know, like what your paycheck would be, but you're going to do the thing you love and you're going to do it from your home and things like that.
And I think most open source developers, I know that they love to write this type of software and, um, it's hard to find a job that like actually pays you to do what you love sometimes.
And so I think for them, you know, making an impact on science.
I mean, I see all like, I guess I'm cited.
I'm on, I'm on the list of the SEMPI paper and I see all the people using our code and it's like, Oh, like, yeah, I'm cited for water Buffalo research.
That's interesting. How did I help water Buffalo's? Right. So it's like some interesting kind of back and forth.
I mean, you see the impact you make, um, but to get paid, like there's a whole rain and then there's the other kind of bit like, yeah, if you want to make a quick, quick million bucks, I guess is to write an amazing software library and then go for self consulting on it.
Right. So I think there's a lot of people I know who do that, who got it into the, in no focus and Apache kind of do serve as this, where we'll be the fiscal sponsor of the actual code.
Um, so companies will be like, Oh, great. Pandas is here. And, and Dask is here.
Um, Julia, there's another one. Um, Oh, but we're going to hire Julia computing or coil or Saturn or all these companies to do work on those libraries.
Um, but the library itself is open so we can continue to use it.
You know, that's in some sense, uh, companies see that as insurance against, uh, kind of, what would you call it consultant where ransomware type, I guess.
Um, I know both of us have had jobs like that, right.
Where we have companies paying us to work on open source software.
Right. Um, and sometimes it's a hard sell, right. To like convince somebody like, Oh, Hey, this money that you're paying us, we, you know, you're going to benefit from this library, but so is everyone else.
Right. I was once paid by a oil company and I just signed a paper and said, I couldn't tell what library is using or which company, but in fact, and they wouldn't even let me touch a computer.
Right. They, they basically did. It wasn't zoom back then, but it was like, they had like a phone call and a screen.
They were streaming to me and like saying, all right, how do we, how do we fix this problem?
And I'm like, Oh my goodness. All right.
Quick buck, I guess. Sure. Why not? Well, in that case, they were a little bit afraid to even like announce that they were using this library.
And I'm like, yeah, every oil company uses this library.
Why are you guys being so cagey? I was like, no, no, no, no.
Nobody does this. Okay. Sure. Sure. One place that I worked at that I'm not going to name, but you can probably guess was like that as well.
So they didn't want any known affiliation with any projects whatsoever, even though some of us were even active contributors to projects.
I think that goes back to like the evolution, the evolution of like the whole ecosystem is like, now you're, you have a set of governance rules and the company can trust that they can put money into you.
I think that a lot of like, why the first money came to Jupiter and no focus was just, you know, Fernando and Brian were very, you know, compelling people to like support and they saw their vision and the Jupiter.
Now like Jupiter has had some products built on top of that.
And there's like tons of Jupiter products all over the map.
And you know, Microsoft wanted to like make them successful and then make them successful with Microsoft.
And so this was a really good way that they could give to the projects and help that project succeed.
But it kind of came down to like trusting.
We trust a few individuals and now we build a system and you can trust the system and so on and so forth.
Yeah, absolutely.
And I think it's kind of, so you just mentioned Jupiter and I did kind of want to pivot a little bit over to the data science arena because I think that data science in particular has benefited from the open source community probably more than anything.
Right. So just to bring us up a level, this is like the rodeo arena or the Mad Max beyond Thunderdome arena.
I want to make sure I've got, does it matter?
I don't know. Rodeo, they all walk out. I'm just going to say Thunderdome.
Pytorch wins.
But yeah, it's just, you know, this, the stack is completely ubiquitous and in data science, right?
Everyone uses, you know, Jupiter, NumPy, SciPy, Pandas, Matplotlib.
They also use R. We're okay with those people, but they do exist.
Oh, sure. On some teams, but it's... Not ours. When you can tell me how to put an R model into production, then we can talk, but we'll...
You call Hadley over at R studio and he does it for you, right? Unacceptable.
Okay. But anyhow, you know, and I think that because, you know, data science is a newer field, right?
That became kind of a bigger deal over time and is now, you know, kind of in its heyday where everyone seems to want to be a data scientist and you see all these boot camps popping up and all these things like that.
It's built on this, you know, pretty deep stack of open source software that, you know, most people are familiar with.
And I find it kind of interesting at this point that, you know, it was actually really hard to get adoption for Python tools back in like early 2000s, right?
Like incredibly hard, in fact. And now, you know, it's all over the place.
And, but what do you think, I mean, especially for data science arena, like, what do you, where do you see that going?
And like, how do you see that relationship with open source developing over the years?
Yeah. I mean, it is kind of like interesting to see how it's like evolved and to the point where like, yeah, you can take a 10 week boot camp and like come out as like a senior data scientist.
Right. It's still, it's kind of like, but it's a little bit frustrating when you're like, yeah, I, yeah, it took too long to get a PhD.
And like, you did 10 weeks and like you think you know everything there is about the field.
But at the end of the day, like a lot of times those folks are actually, you know, they, they learn a lot of stuff and they can, they can put a model into production very often because the tools have become pretty easy to use.
Now, when the model goes wrong, they'll struggle.
And so what I'm seeing a lot of is that the, the skillset is moving to a higher level.
And so to be a data scientist, it's not just about getting code out.
It's around, you know, understanding the models and really diving deeper into the, the actual problems you have to solve.
And it actually kind of, I see the, what I, the space is going in different ways and it's not really clear exactly where it's going to end up.
A good example, I was talking to a friend in finance and I'm like, you know, I really need something that does more backtesting on my, on my models that are in production.
And he's like, yeah, I mean, the bank has, has like that, because if we don't, we lose billions of dollars.
I'm like, oh yeah.
And then I get like three emails that day. I'm sure that there's a bot listening to me because like data robot released, I think they have humble AI, which is like their new monitoring system.
Then you have, I think is, oh, the domino.
I think they have the domino model model monitor. And like, there's a whole bunch of like these like systems that are now being built on top of those open source codes.
Cause like now it's, you can't just hire, like, I mean, I still, I still put my models in flask and I'm like, I'm like kind of own the whole code myself.
And like, and so I was telling my team, it was like, I think I had the intern group actually revolted when I tried to make them do that this week, this summer at 20 interns.
And like three of them come to me like, this is ridiculous.
How, how are we supposed to do all this? I'm like, oh yeah, it used to be.
That's what you had to learn to like be a data scientist. You had to be able to put your model into production on a Kinesis or Kafka stream and like be responsive and log everything and all that sort of stuff.
And like the, the fields kind of whittled down to just getting the model and evaluating the model and things like that, which I think is great because it means more people can do it.
And it means that there's more adoption of the systems, but like the meta systems, like the monitors, the benchmarks, all those sorts of things are kind of being consumed by the platforms.
And so you're, you're seeing the same thing that happened with like AWS where AWS is amazing.
We have great new SAS services. I don't have to think about Linux and managing a hard disk failure and things like that.
But at the same time, like, well, but if I need to get my own kernel and like mod the kernel, like how do I do that on AWS?
And so it becomes harder to modify and change because the skillset of doing that is like dwindling.
And so in that, I see a lot of like as you become ubiquitous, your tools that everyone has, it becomes commoditized and then people lose the skills of like modding out the lower level software.
And so absolutely true.
And, you know, even going further back than that, I mean, you know, my, my PhD was a concentration in artificial intelligence before we even had all these libraries with all these, you know, implementations of algorithms.
We actually had to design the algorithm and implement it too.
Right. So the engineering cost on data science has just gone, you know, way, way, way down.
We still have the data wrangling part.
And like you said, the deployment part that are now our big headaches, but the modeling part, like that's a whole lot easier for us.
Right. Yeah. I had a front end intern ask us why would I ever iterate over a thousand items in an array?
I was like, I'm on a, Oh no. Algorithms like, yeah, it's like, it becomes a, it becomes a special skill.
And I think that's the thing that used to be like, yeah, algorithm was just what you did.
Like every, every one was new.
So I heard Tony Hoare give a talk. He's the, Hoare triples and he's founder of quicksort.
So I asked him, how did you find quicksort? And he's like, well, I wrote down, first thing I could think of a sorting and that's bubble sort.
And that was N squared. And then I wrote the second thing down I could think of and it became quicksort.
You're like, okay. So you're just like conversant in algorithms all day, every day.
And like, you just like iterate through them. And like these sorts of things just become pretty natural.
And like having a pivot and like pivoting around things was just a natural skill that you had to do as an algorithm writer.
And if you think about how registers worked and how you had to shift around bits on a, in memory or on a linear disc, you had a, you had to be thinking like that all the time.
And so I think that's a skill that's being lost, which, you know, but at the same time, like I was going to mention like, oh yeah, then you have like things like GPT three come out and that now, and I don't even need the JavaScript developer because we can just like sell it to like make our layout for us.
I love you JavaScript developers. I just, you know, we have to keep up the, you know, back and front.
Sure. We got to keep it spicy. But I just real, real quick on a little time check.
I did want to leave a few minutes for questions.
We're kind of, believe it or not, we're kind of winding down on our, on our happy hour.
So I'm just going to check in real quick and see if we have anything that's come in.
But not seeing anything right now, but in that case, actually, do you mind talking a little bit about GPT three?
And so this is open AI's new system that just came out and it's, it's pretty freaking cool by the way.
Do you feel comfortable kind of talking more about it?
Well, it's just, I mean, not, not more than just saying like, you know, it's a new language model that, you know, is, you know, what people are talking, I guess the other day, someone mentioned you need more ethics and AI and like every time one of these language models come out, you're like, Oh my gosh, it's scary.
How good it is at generating human like content. And so GPT two is another one that's like already you see people building a Programs out of it that could build a whole JavaScript layout or could build a whole Google search engine off of it or build.
I think what I see a medical right diagnoses piece of it. So very quickly, we're coming to the part where you can give it a bit of information and some corpus or like, like all the get hub of, and it can, you can then describe what you want to build like a button with a widget.
And it'll actually spit out the tax that makes that thing happen or answers the medical question and whatnot.
And they, like, I have no idea, like, I haven't had a time to jump in the, the, and actually see the model or understand the architects or anything like that.
But it's one of these deep learning language models.
I think it, on the evolution of GPT two last year.
And I believe the writer of that had major ethical qualms about it coming out.
They didn't want to even release the model until they had a way to detect if somebody was using the model in the deep fake kind of realm.
And so like, and it's just like anything you, so the best way to detect if it's something's fake is to use the algorithm algorithm that said whether it was generated and things like that.
And so once they had that kind of worked out, they kind of released it.
And I believe that GPT two came from Google, but maybe it came from open AI.
I'm not, I don't, I thought it was open AI because I know GPT three is for sure from open AI.
Yeah. Yeah, totally. So even there, open AI is a fun one where it was like, it was a nonprofit.
Then it'd be turned into a company.
And so like, even though the space is like, I don't know exactly why they did that, but all of a sudden, like you see like, all right, companies are making things that like look like humans very quickly.
And so it is an interesting time we live in and it'll be, you know, in the world of, I know I do a lot of ads and personalization of ads and like, Oh great.
I can like actually like have an ad that is personalized to you in plain written English on a server that was generated very quickly without having someone to pick up the phone to or something.
And so it's super applicable to everything people do every day, but it's kind of a new, a new shot at this and we'll see how, how well I've already had, one of my friends is a philosopher who writes about deep learning.
He's sent it over. It's like, Oh my God, what's going on here?
Explain to me what this, what this translation layer is.
I'm like, Oh no, I don't know. Don't answer that email. He's a good guy.
He wrote a bunch of, a bunch of things about how crows think and giving crow like animal cognition and whatnot.
And then he started getting into deep learning.
I'm like, Oh, when you get, when you read about a philosopher, read a philosopher talking about your work, you just sit in there like, I don't think that's what's going on, but okay, man.
Okay. We actually only have a couple of minutes left.
I did want to at least take a couple of minutes to talk a little bit about PI data and, and what that is.
Just so people know that PI data is the educational arm of, of numb focus.
And the main thing that I think most people are familiar with are the PI data events.
So we do live events where we get speakers together and the community together.
And we all talk about, you know, things people are working on, new libraries, all sorts of things like that, bringing developers and users all together and sharing knowledge.
And unfortunately in the age of the pandemic, we are not able to do any of our live live events right now.
And which is really sad because there are really a lot of fun.
And I always learn a lot. Andy and I are usually, you know, speakers at least a few of them.
So that's kind of fun.
And getting to know the community, but we do have a PI data global event that's coming up this fall that it will be online.
And so I think that's an opportunity for everyone to, you know, especially communities that don't normally get to connect.
I think it might be a great opportunity that like, I mean, cause we have a, you know, a PI data, you know, Poland and a PI data LA and a PI data, Austin and a PI data, New York.
This will be a chance to bring everybody together from all these different communities so that we can all share knowledge there.
But you want to say a couple of words about that before we wrap up?
Yeah, I think it was great.
It was a, I needed the conference really started with Travis Oliphant, Peter Wang, scratching the itch of trying to get Guido to put good support for compilers and external libraries into the PIP.
And I don't know if it was PIP back then, but yeah, so it became an event.
We talked to Guido. It's like, Oh, you guys have a real problem.
He spawned a whole community out of it. And now we have events all over and talking about all things, Python and Julia and R.
Awesome. Well, I just want to thank you so much for joining me.
It's been really, really nice. I hope that you get to, you know, experience Cloudflare TV some more, but with that I just want to wrap up and thank everyone for watching.
And I hope you enjoyed your open source happy hour.
All right. Thanks again.