Summary
One of the most persistent challenges faced by organizations of all sizes is the recording and distribution of institutional knowledge. In technical teams this is exacerbated by the need to incorporate technical review feedback and manage access to data before publishing. When faced with this problem as an early data scientist at AirBnB, Chetan Sharma helped create the Knowledge Repo project as a solution. In this episode he shares the story behind its creation and growth, how and why it was released as open source, and the features that make it a compelling option for your own team’s knowledge management journey.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Chetan Sharma about Knowledge Repo, an open source framework for managing documentation for technical users
Interview
-
Introductions
-
How did you get introduced to Python?
- EE + CS/AI + Stats degrees
- Airbnb working on ML models
- Knowledge Repo itself
-
Can you describe what Knowledge Repo is and the story behind it?
- We started seeing interviewees use ipython notebooks, thought they were great
- Wanted to push more people to use notebooks, but they weren’t very shareable, vettable
- Existing notebook hosting services weren’t very good, and weren’t built for people who aren’t data stakeholders. It was especially poor with images, annoying cell blocks
- Made a simple post processor to remove cell blocks, push the images to s3, and host on flask
- Once we were pushing notebooks into a Github repo for hosting on a flask app, so many things became possible
- Review cycles
- Shareability / collaboration features
- Indexing / searching
- Concurrently, great work was happening on developing internal R packages / python libraries to provide consistent, branded aesthetics
-
What are some of the approaches that teams typically take for recording and sharing institutional knowledge?
- Copy and paste to google docs, slides
- Facebook was using facebook photo albums
- untrustworthy, not discoverable, divorced from the code
-
What are the unique requirements that are introduced when attempting to record and distribute learnings related to data such as A/B experiments, analytical methods, data sets, etc.?
- Reproducibility is a big one
- Making sure the learnings are trustworthy (good data? no bugs?)
- Distributing widely, across the org and across time
- Experimentation
- Experimentation is at the end of a research-design-build-measure cycle, strategic analysis is often before
- Capturing all of the context
-
Can you describe how the Knowledge Repo project is architected?
- Repositories: a store of posts, most commonly a github repo
- Markdown as original lingua franca, eventually a KR specific “KR post” concept (which is still basically markdown)
- Post processors
- Convert whatever upstream file to markdown / KR post (Jupyter notebook, R Markdown, markdown were the original ones)
- Handle images and other large assets, usually pushing them to cloud storage
- Evolved to handle PDFs, googledocs, keynotes
-
What were the motivating factors for making it available as an open source project?
- It was such a common problem. Even incredibly sophisticated data teams at Uber, Facebook, etc. were begging us to share the system.
-
What is the workflow for creating, sharing, and discovering information in an installation of Knowledge Repo?
- Create a github repo for hosting strategic analysis
- Use the KR script to create a stub/template for whatever format you’re working in
- Do your work in Jupyter, etc.
- Instead of using github scripts (git add) use knowledge scripts (knowledge add), which is basically the github scripts with postprocessors
- Do typical Github workflows
- See the result in the hosted knowledge repo app
-
What are some of the options available for extending or customizing an installation of Knowledge Repo?
- More postprocessors! google docs, presentations, UX research, anything can be done in KR with a simple postprocessor to turn it to markdown/images/PDF
- Tying the system to your internal data tools. For example, an experimentation system like Eppo or whatever you use for marketing campaigns
-
If you were to start over today, what are some of the ways that you might approach the solution to knowledge management differently?
- Think of it more holistically:
-
What are the most interesting, innovative, or unexpected ways that you have seen Knowledge Repo used?
- UX research
- Writing up guide for acquihiring
- Demonstrating of capabilities, data framework
-
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Knowledge Repo?
- Strategic analysis needs to be elevated, this leads to paradigm changes
- Organization problems are helped by tools like KR: eg. promotions
- Meeting people’s tools/workflows where they are is powerful
-
When is Knowledge Repo the wrong choice?
Keep In Touch
Picks
- Tobias
- Chetan
- Underrated cooking ingredients: chickpea flour, butter fried kimchi (in grilled cheese, nachos)
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. And do you want to get better at Python? Now is an excellent time to take an online course. Whether you're just learning Python or you're looking for deep dives on topics like APIs, memory management, async and await, and more, our friends at Talk Python training have a top notch course for you. If you're just getting started, be sure to check out the Python for absolute beginners course. It's like the 1st year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving.
Go to python podcast.com/talkpython today and get 10% off the course that will help you find your next level. That's python podcast.com/talkpython. And don't forget to thank them for supporting the show. Your host as usual is Tobias Macy. And today, I'm interviewing Che Sharma about Knowledge Repo, an open source framework for managing documentation for technical users. So, Che, can you start by introducing yourself? Yeah. Hey, Tobias. Yeah. My name is Che.
[00:01:51] Unknown:
I'm the founder and CEO of EPO, next gen experimentation platform. Although, today, we'll be talking about my previous experience where I was at Airbnb and led the team that created a system called the Knowledge EPO. So, you know, I know each other a little bit. I kinda worked across stack in data science from machine learning to analytics and then data tools including Knowledge Repo, And then kind of went to a bunch of earlier stage companies, most recently Webflow, kind of running the same playbook, being the first second data scientist, seeing how impact happens.
Kinda cool to see that Tapas Technology at all are starting to
[00:02:27] Unknown:
do getting more and more attention. And do you remember how you first do you remember how you first got introduced to Python?
[00:02:33] Unknown:
My background was electrical engineering undergrad, and then I took a lot of CS and AI classes, so that kinda naturally exposed me a little bit to Python. But at Airbnb, my first 2 years were on production machine learning. A lot of that world happened in Python even then. Now I started doing a lot of Python stuff. And then towards the end of Airbnb, when I started working on the knowledge repo and open sourcing it, that I, you know, I got to learn a little bit around, you know, application element in Python.
[00:03:01] Unknown:
And so in terms of the knowledge repo project itself, as you mentioned, do you helped to create it while you were at Airbnb. It's an open source project that anybody can use and apply for their own purposes. But I'm wondering if you can share some of the story behind how it came to be and some of the early days of using it at Airbnb and some of the motivations for releasing it as an open source project?
[00:03:25] Unknown:
Yeah. Absolutely. So, you know, when I think of the knowledge repo, I think of it as really a platform for magnifying the impact of strategic analysis. Like and when I say a strategic analysis, it's the sort of thing that is like the type of research and investigations that might suggest, like, radical or, you know, powerful changes to an organization, such as a whole topic area that's underserved or some lever for impact that's underutilized. And I think that, especially the way the data landscape has kind of evolved, there's been a steady march towards an emphasis on reporting.
Just saying we want core metrics. You can slice them by a couple things. And the good news is as that stuff gets much more automated, it leaves room for analysts to do really strategic things such as, you know, figuring out if there's some marketplace liquidity problems or figuring out if there's some opportunity for impact in some novel product area. But it has a much more humble beginnings than knowledge repo at Airbnb. It really just started out from we were growing the Ergmi data science team. We were probably, let's say, 8, 10 people at the time. And we started noticing that some of our inter interviewees would do their interview with IPython notebooks. Back then, it was called IPython.
And when we see it, it just kinda was a nice experience. It was, like, very easy to follow what they did. The charts were all in there. The code was right in there. Like, if we're trying to review their work, it was a really nice environment. And so at a off-site, you know, we'd have these periodic data team offsites. The head of data at that time, Sky Riley Newman, was just saying, like, I think this is a really good way of doing data work. I would love for our team to do it more holistically. But we just kept noticing that while IPython is a good project for an individual doing work, it was just not very well suited for organizations because it was hard to share. It was hard to collaborate.
So we started thinking around what can we do to just be able to share an IPython notebook. That was the basic thing. It's just we want to share an IPython notebook. And the answer to that was, well, we can get it on GitHub, and that might be shareable. But what do you need to do to get it on GitHub? There's a little bit of post processing. Gotta get the images somewhere because you don't wanna post the images on GitHub. And so we just made a light wrapper, put the images on s 3, and did a bit of post processing, and put them on GitHub. So that was the v 0 knowledge info is IPython's at GitHub. From there, once we started realizing there was just a lot of interesting things we could do, we just made whipped up this quick little Flask app that was like, okay. Let's render the things on GitHub so that, you know, have these good URLs you can share. It's it doesn't require GitHub accounts to view it and kinda ingest this underlying set of posts.
And once you go that route, suddenly the floodgates opens. Right? Suddenly it's like, we can do really interesting social features. We can add the idea of metadata, of subscriptions, a bunch of other things. So it kind of started out from humble roots of post processing and add Python the books on GitHub, and then it turned into a much wider workflow.
[00:06:35] Unknown:
Given the time frame, this was on the order of, what, 7 or 8 years ago?
[00:06:40] Unknown:
Yeah. This was, like, 2013, 2014. So yeah.
[00:06:44] Unknown:
About 8, 9 years ago. And for people who are listening now, when you say, I wanna be able to share a notebook, they say, well, you know, there's a half dozen different approaches to that. And, you know, there's binder. There's Google Colab. There's Hex. There are I don't even know how many different products that are out there now. But at the time that you were working, none of those things existed yet. Jupyter Notebooks were still fairly nascent. I don't think it had even been called Jupyter yet. And so I'm wondering if you can talk to some of the sort of complexities that you had to deal with in order to be able to manage the sort of rendering of these notebooks, particularly given that notebooks are generally a very interactive experience, but my guess is that you were looking mainly to just be able to say this is the rendered notebook with the analysis embedded in line and some of the sort of technical choices that you had to make to be able to make that a tractable problem?
[00:07:41] Unknown:
Yeah. And it was kinda cool. I would say that, like, we were able to make some pretty good progress. You know? Like, it was complex but doable. I would say the main things we had to figure out were, 1, you gotta put when you have a large size assets such as images or videos, whatever, you gotta put them somewhere and, like, place can't be GitHub, so Cloud Storage is a natural solution. Again, I find my notebooks. They weren't really designed to introspect the underlying artifacts or whatever. Like, if you suggest put an icon up on GitHub, you'll see all these, like, markdown cell, like, brackets and stuff like that that make it very hard to read, and so we had to kind of post process and take those out.
And then the last thing is we had to just make sure that the notebook actually runs through. Because as you might have seen even today with Jupyter Notebook, it's quite easy to you run a few cells at the top and then you run some cells at the bottom, then you change the ones at the top, and suddenly there's a code fork within the same notebook. So what we would do is we basically require that notebooks ran through. Like, you could just say, run the notebook and that it worked before you actually uploaded it. The other piece of the complexity were actually below the foundation of the knowledge repo. We benefited from a lot of other work streams that were happening at the time. So, for example, being able to work with data warehouses where all the data lives from a notebook environment was you have to make these SQL connections, you gotta manage that, and it would have been a bit annoying, but we had actually kind of in a separate work stream.
Some people had developed our packages and Python packages that were specific for Airbnb so that a data scientist could just use them and just automatically be hooked up into our, you know, Hive and Presto. So that was an enormous benefit. And then the other big benefit was there was already underway this effort to centralize and vet a set of canonical teams. So there would now be 1 table that was Airbnb bookings, and it was owned by the data engineering team, and you can just kinda always assume it was there. So from a replicability standpoint, if you were going back to the model repo and you want to run 1 of those notebooks, it wouldn't be the case that the underlying table was just gone because it was a effervescent thing. There was now these kind of long standing assets.
So there are a bunch of pieces that had to be in place for knowledge output to really take off that, you know, we fortunately saw kind of all developing at the same time.
[00:10:04] Unknown:
Separate from the complexities of dealing with Jupyter and IPython Notebooks, the other interesting angle to this is just the overall problem of being able to collect and share and disseminate institutional knowledge. And there are any number of different systems out there for that. And I'm curious if you could talk to some of the approaches that teams might typically take for being able to approach that problem and some of the specific requirements that are introduced when you're dealing with a technical team?
[00:10:37] Unknown:
Yeah. Absolutely. So it was actually a really fascinating thing with knowledge repo is that we started off when we developed this workflow, we felt pretty good about it, and we started just getting in touch with other data teams in the Valley at the time. So this was Uber's Heyday, Dropbox, Lyft, all these others, and we kinda just went and talked to our data teams and asked, like, how do you handle the sharing problem? And it was really fascinating because some of the most sophisticated data teams in the world were doing some, like, very, very basic things around knowledge sharing that none of them were happy about. Like, everyone was like, this is this is terrible. But but things you might see are just, like, screenshotting things into Google Docs, maybe kinda copy and pasting the code below that.
I remember at Facebook at the time, the way they did knowledge sharing was to take data analysis and put them in Facebook photo albums. So they would just, like, copy charts into that, and you can kinda carousel through it. You know, they were really into dogfooding it. I've seen people kind they some of them were using kind of Confluence Wikis. There there's kind of a variety of just kind of using the existing collaboration tools, But the problem we kept that kept happening there is that if you're gonna do something like work off a strategic analysis or in general, anything in the data world that is affecting decision making. Like, trust is really this underlying foundational piece that needs to be established. And so you get into this question of, like, what does it take to trust an analysis?
Well, 1, you should make sure it's pretty useful because if it cannot be replicated, then, you know, you can't really get it and make sure it's good. 2, we would hope that someone looks it over because as I'm sure most of your audience knows, data work if it's bad, data work doesn't just lead to something crashing or some, like, very visible thing that is bad. It's just wrong. It's just wrong in ways that are subtle. Like, you threw out some data when you really shouldn't have naturally changes the result. And so just getting some sort of review process on top was really helpful. You know, the sort of stuff I think data people sometimes lose sight of is that, like, aesthetics and consistency and brand and easy shareable URL. There's a certain usability and aesthetic side that actually continue to breed trust. So we actually in our own Python packages for me, we, like, create color palettes so everyone could be using the same Airbnb color palette and the same kinda line width, all this sort of stuff.
So, yeah, what we'd see in a lot of other companies is they would be putting this sort of work in, like, really random places in ways that could never be found again. They would be in some Google Drive in a place that unless you had the link, you'll never find them again. When you found them, you had no idea if it was good or bad, whether it was, like, Scratchpad where someone dumped something or whether it was actually important artifacts should've been passed around. You couldn't replicate it because the code was inevitably forked from the result.
What ended up happening is people would just start over.
[00:13:34] Unknown:
And so in terms of the use cases and workflow of applying knowledge repo and introducing it to a team, getting everybody onboarded, and then sort of growing its utility beyond the bounds of a single team. I'm curious if you can talk to some of the design elements that you built in to make that an enjoyable experience and some of the ways that the early days had helped to produce feedback about ways that you could improve the project and ways that you're able to popularize it within Airbnb and beyond?
[00:14:12] Unknown:
The way I think around pretty much every tool in getting buy in is that what is the amount of friction to adopt the tool, and what are the benefits to adopting the tool? And we worked very hard on both sides. 1 of the things that I think was a really crucial decision, and this kinda came out of the the founding story as I mentioned, but, like, we met people at the tools they wanted to use. We didn't ask them to change the tools they wanted to use. They wanted to work in Jupyter, Python Notebooks, in R Markdowns, and we just let them use those things. So that was really helpful because if it doesn't require a radical change in workflow, that's great. The other thing is that we made the CLI toolkit of the all the command line commands the essentially mirror images of the GitHub commands.
Instead of doing GitHub add, it was just knowledge add. And that sort of process also was, like, very small change in your workflow, basically, minimal change in your workflow. You had just stuff you're gonna do anyway. That really helped. And then on the other end, to make it more powerful is that once you did your work, you had this thing that really could broadcast around the org. It could reach people on the other side of the building. It could reach people in the future. Just the stuff you did had this inordinate effect. In addition to, like, literally more eyeballs and shareability, just having URL to share around, it just gave visibility up and down the organizational hierarchy where, like, you know, data science managers had so much better of a idea of what their team was doing.
They could share it up and down their chain. Just so much more visibility on what you were doing. So those are kinda, like, direct ways of getting adoption was that from a bottoms up standpoint, it was a very comfortable toolkit that didn't involve much workflow change. From a top down standpoint, you just got so many benefits from distribution as well as your visibility to people who handle promotion cycles and stuff like that. And in terms of the actual implementation,
[00:16:13] Unknown:
I'm curious how you architected the system to be able to make it maintainable, being able to handle the useful workflows around maybe access control or the sharing capabilities, being able to possibly version entries, being able to understand sort of what the publication workflow looks like and just all of those aspects of the sort of technical underpinnings of the system?
[00:16:39] Unknown:
Yeah. So this is, again, another piece of leveraging tools where they are. You know, there's a great tool out there for handling version control and review cycles and all that, and it's called GitHub. So I think there's a version of this story where we try to handle much more of that internally, but it was just like, look. GitHub is greater review cycles. Notebooks are great for a isolated learning environment, lead to very reproducible work. We're just gonna essentially provide the glue to make this workflow all work. From a version control standpoint, it it was like there were some things we had to do around if you have images and you avoid overriding previous ones, if you want true version control.
But by and large, it was we got a lot of leverage by just utilizing the existing tools.
[00:17:27] Unknown:
In terms of the evolution of the project, I'm curious how much of the initial ideas and assumptions and design have survived to where it is now and just some of the ways that it has
[00:17:44] Unknown:
grown and changed in terms of its scope or capabilities from when you first started working on it? Definitely grown a lot when we first started working on it. Again, it was just IPython to GitHub. But I think the ways it involved before we started open sourcing it was to just lean into just kind of known product levers for increasing distribution. Like, it was so much of around was just how do we get more eyeballs on these things. And for that, I remember 1 of our inspirations was looking at Quora, who had all these, like, tags and, you know, searches and search by author, search by topic, whatever. You know, I'll say that the idea of tagging, I learned a lot about, is that it's hard to get good governance on tags. We had to build all these features to, like, batch edit tags and curate them. And to this day, I'm not confident in tagging as, like, a very great organizational principle, but the things like search, just having a search bar where you could go through and see all the stuff and easily find it, and then everything you see is a vetted reproducible thing. That was really good. Another thing was being able to see all the work by a given person. That was really helpful because you could see, like, you know, what did this person do or who are all the people who worked on Airbnb in China or something like that. So this sort of, like, being able to query the system to say, has anyone worked on this?
Was in general a really powerful thing. And then other things we did was, again, very, like, kinda common sense stuff. Like, we created weekly email digest of, like, what is some great work that came out recently. We let people subscribe to authors or topics or whatever. So when something went out, it could be delivered through their email, And that had actually another interesting effect that because we could email people these things, suddenly, they could read analysis on their phones. And that was also a really nice thing because if you're some person who's pretty busy and then you're just kinda in between things and you just kinda wanna read this interesting article the data team produced, that's really cool. You know, if you think of how a lot of explainers or news articles would ever get disseminated, you're probably reading them not when you're, like, sitting down at the laptop constantly don't work. You're probably reading them in idle time. So that was a nice way to kinda get more and more distribution.
[00:19:56] Unknown:
Another interesting aspect of this project is that has survived so long, particularly in the face of a lot of the evolution in the data ecosystem where a lot of this sort of collaboration and discoverability of the different data assets and analyses is being subsumed by things like data catalogs and dedicated platforms for this collaboration. And I'm curious what you think are some of the ways that the knowledge repo kind of stands on its own and what allows it to survive in this ecosystem where there are so many more options available for collaboration,
[00:20:34] Unknown:
notebook sharing, data discovery, things like that. I'll say the things that I think helps nonrepa out is that 1, you know, it's not like a deep complex thing. In some ways, it's just this simple elegant thing that, you know, gets you a lot of these features you want. The other thing I think is that the fundamental abstractions are pretty well thought out where the idea is you have this thing that you want everything to end up in, and by and large, it's markdown. Like, for the longest time, the end result was a markdown post. Although now in the latest iteration, it's like an actual post concept inside of the system.
But where things come from, like, can vary. All that matters is that you have a post processor that ends up in a certain format. So with the IPython notebook, the post processor, you know, takes out those markdown cells and, you know, puts images up on cloud storage if you want. Same thing with our markdowns. But we also built a tool that, like, if you have a keynote presentation, as a lot of our UX research team was doing presentations, you can kinda just post process it, turn it into a bunch of images, and then you can now have a knowledge post that's a big set of images or whatever and lets you sift through that stuff. So just the simple thing of saying there's there are places where people do their original work that's native to them, that's fits their persona, that generates these knowledge artifacts. We have post processors that let them fit the knowledge repos expected format, and that's kind of it. The whole point is just bring things into a common standard, and I think being open source kinda helps with that. For people who are
[00:22:14] Unknown:
adopting knowledge repo, what are some of the options for being able to extend and customize its capabilities and workflows?
[00:22:23] Unknown:
Big 1 is just being able to ingest from different places. Because the the thing is just, like, you have these different workflows that generate knowledge artifacts. You wanna get them in a common place. It kinda out of the box comes supporting, like, our markdown and Python things, but anything that can be turned into essentially markdown or text is usable. So, for example, if you have some internal tool that people work out of for doing this sort of work, if you just have some sort of, like, print function or whatever, you could extend it and say, okay. This stuff is gonna end up in knowledge repo. If you I saw that people had made a kind of Google Doc ingestor, where if you have some sort of Google Doc thing, you can get it in there, then it's, like, indexed, then it has authors on it and, you know, tags or whatever.
That's all there. So I think the great thing about it is that your team has whatever ways that they are doing work and generating strategic analysis, and then you have this kinda common paradigm to just bring them under 1 roof. So in some ways, because it's so agnostic around how you produce knowledge that it ends up being pretty flexible and extensible for various companies.
[00:23:30] Unknown:
In terms of the open sourcing aspect of it, you mentioned that it was something that happened later on into its life cycle. I'm curious what the motivating factors were for that decision and some of the work that was necessary to be able to extract out some of the assumptions that were specific to Airbnb and make it something that is generally applicable and consumable outside of those organizational boundaries?
[00:23:55] Unknown:
The open sourcing thing was a really fascinating story. So I've mentioned before, we spoke to all these other companies to figure out how they were doing work. And we did it just because we wanted essentially to write a blog post and kinda share how we think about the problem and just have a kind of more informed view. And what we found was that, like, wow. People hate their current solutions. And they all were like, wait. You're gonna can you please open source this? Like, we would love to use it. And so it was just such good, like, I don't know, product market fit or whatever that there was just so much demand. We're like, this is a great opportunity to do something really cool.
And, no, I think in general Airbnb, 1 of my favorite parts about the institution is that, like, we embrace opportunities to do something really cool. That's why I think why you see so many great projects spinning out of there, like airflow, whatever, is that, like, everyone in that building is, like, very entrepreneurial and just thought of Airbnb, not just as a place to, like, make Airbnb, the company's stock price go up, whatever. I was like, this is a place to do cool stuff. And so there was a lot of openness to, like, putting open source tools out there, putting blog posts out there, just kinda doing original things. The thing I definitely learned while open sourcing knowledge, I thought, is that if you as a person, as a author of a project, want to open source something, it's a lot easier when you know going into it that you wanna open source something. The amount of retrofitting to get something to be generalizable is pretty daunting. So it involved, like, a lot a lot of late nights. And I think a lot of major credit has to go to, you know, the other people on the team. People like Matthew Wardrop, who I think said Netflix, and Nikki Ray who's, kinda working on education stuff. Dan Frank who's back at Coinbase. It was a really, really great team and that I loved working with to be able to do that retrofit.
[00:25:37] Unknown:
Looking back at it now, if you were going to start the same project with the same overarching goals of being able to manage sharing and distribution of knowledge and discoverability of analyses, what are some of the ways that you might approach the solution differently, and what are the pieces that you think have stood the test of time that you would keep?
[00:26:00] Unknown:
I would say it was so, like, such a small step function to get to the vision of what it ended up being. Right? So, like, we didn't, like, think of knowledge in this very holistic sense that I do now, which is that a true knowledge base just captures all the things that organization has learned. And from there, you'd say, you know, what have we learned? Who has worked on this sort of stuff? And And when you think of it in that sense, there's just a lot of things that generate knowledge out there. Like, I'm working on an experimentation platform right now, and I actually think that experimentation is 1 of the biggest knowledge generating work streams out there. And, you know, we are going to be building something knowledge specific for experimentation as a part of that. Learning by doing.
There's also UX research teams who are doing essentially very similar things as the people doing strategic analysis, but just the different kind of paradigm. We built, like, again, a solution for them to take presentations and put it on there, but, you know, it wasn't, like, a first class citizen the way we we imagine the projects. Yeah. There's a lot of ways you generate knowledge, and so I would say that thinking in this kind of more holistic way would be something I would think a lot around. In some ways, I kind of wonder there's so many companies out there building tools that generate insights, And if there was some kind of common medium to bring together, that would be a cool thing, and may that's sort of the promise of Knowledge Repo. Is that, like, you know, maybe you are a email marketing tool that, like, figures out what really good campaigns are or whatever or maybe an experimentation tool or a collaborative notebook. If there was some way to bring house this stuff together in true knowledge base, then that ends up being really powerful.
But, you know, I think knowledge bases is like, they happen as a result of great tooling that then generates a lot of knowledge, and then you wanna index it somewhere. So I think there's a bit of a chicken egg thing.
[00:27:46] Unknown:
And in terms of the applications of the knowledge repo, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:27:55] Unknown:
Yeah. AirMovie itself, people have really embraced it and started writing up a lot of things in there. I remember 1 person wrote, I'm a he is the guy who's a very senior manager there. He was like, I just tried to acqui hire a company. Here's what that process was like. It turned into kind of a general documentation of it. That was outside of the scope of data analysis. Another thing we started seeing, and this is actually very common, was just demonstrating a Python framework or our framework. So here's some cool visualizations I made with this new tool that it was sort of like explainer around different frameworks. And then I mentioned before UX research of people was like, look. UX research generates knowledge in the same way that data science does. Like, as a matter of fact, interweaving them is a really powerful thing.
Another thing we we experimented with was trying to come up with kind of curated meta analysis. So things like we spent a year investigating this topic. We generated all these artifacts. So what do we know? Like, at a high level, like, given all that, where are we? So we experimented that a little bit and used that as the basis of an off-site. So, yeah, there's been quite a lot of that. Like I said, you know, the promise of knowledge base is that it can consume a lot of different types of knowledge that gets generated, and it was cool seeing people experiment in that direction.
[00:29:11] Unknown:
In terms of your own experience of working with and building and using the knowledge repo, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:29:21] Unknown:
See, I'm not sure if it's a challenge as much as it's an opportunity. Is that strategic analysis I I mentioned before. I think it is very kinda, like, underutilized, under credited thing where data science teams, like, which I think are increasingly becoming analytics engineering teams are so focused on serving basic data points to people that investigations into growth loops or second order effects or things that just haven't been modeled by data yet. Like, you know, we don't have the same tables underlying it. We're just doing this investigation to do it. Like, this sort of stuff starts to get really it becomes invisible, and then it becomes harder resource.
Whereas at Airbnb, for example, I still remember this guy, Bar Efraff here on the marketplace team. He did all these investigations into price preferences of customers and compare them to prices of listings and demonstrated that customers have really dynamic prices that changes by season, and these hosts were just having these static prices that never changed across the year. And that basic research turned into the broader, like, pricing recommendation algorithm of Airbnb. And to this day, there's still a lot of people working on that problem. That was a very big paradigm shifting thing. Another thing we started investigating was at 1 point Airbnb had this issue where requesting postability to say yes or no to guest led to some discriminatory behavior.
And that led to a whole work stream around, like, how do we make post acceptance rates balanced by race? And that was also something that required a lot of deeper research to figure out how we could do it, and then that followed up with experimentation. So I think that data teams should think a little bit around no 1 has a view of the underlying story that's only visible by data except for data teams, and so you should see what we can do to make space to do these underlying strategic analysis because that's actually what leads to, like, very, very large scale change.
[00:31:24] Unknown:
For people who are interested in being able to collect and store and categorize and distribute their learnings and general institutional knowledge. What are the cases where knowledge repo is the wrong choice and they'd be better served with, you know, a Wiki or 1 of these notebook sharing tools or a data catalog or things like that?
[00:31:45] Unknown:
Wikis and stuff like that are really good if you have some living doc that's what I'd say. It's like an onboarding guide. It's the sort of thing where, like, you don't really care about what was true in the past. You only care about what is the current way to do things. So, you know, we had people contribute those sort of things to knowledge our code, but I think that if you want to just see, like, what is the current workflows, what should I do, then, you know, Wiki's that's pretty much what they're designed to do. So I think that's pretty good. The other thing that the knowledge repo is really focused on is being able to link the process of generating knowledge with the output of generating knowledge. And sometimes you just don't care about process. There are various things you put in Google Docs. It says here's, like, a product spec or whatever I just read up. So, yeah, I think there are some things around, like, what is the level of context you need to understand a piece of knowledge? What do you need to understand around the way it was manufactured?
The trust in it, was it reviewed or not? Like, if you don't need it reviewed, if you don't need to have visibility in the process, you don't need any broader context, then, yeah, there's a lot of off the shelf tools.
[00:32:50] Unknown:
In terms of the future of the knowledge repo, I recognize that you're not as heavily involved in it these days, but I'm curious what you see as some of the potential future or useful directions that it could go or ways that you recommend people get engaged with and contribute to it?
[00:33:09] Unknown:
Yeah. Like I said, I think the best thing about the knowledge repo is it can be this operating tool agnostic way to coalesce all the different knowledge and system. For example, today, we have all these different collaborative notebook tools. We have, you know, other data tools like EPO doing experimentation or things doing term analysis or people running marketing campaigns or whatever. An organization is really helped out by being able to centralize this stuff, and there's just a lot of situations in which putting it all into Google Docs and Notion or whatever is just not really the answer. So I think that there's a great future by companies kinda continuously utilizing the fact that it's open source to contribute new ways to get other pieces into the system. I think the Pine Sky dream is that, you know, all of us data tool start ups embrace that we're gonna use the same knowledge base for our output, and then organizations and customers win. But we'll have to see if it actually happens. In the meantime, it's a very accessible tool chain.
It's pretty easy to get up and running and add to it. I think at the very least, you can achieve that dream internally at your organization.
[00:34:16] Unknown:
Are there any other aspects of the knowledge repo project or the overall problem space of knowledge aggregation and distribution that we didn't discuss yet that you'd like to cover before we close out the show?
[00:34:27] Unknown:
Yeah. We touched on it very lightly, but I think it's really, really worth emphasizing. And this also comes at you know, my inspiration for doing that experimentation is that once you're above the fold of the wire in terms of data work where this is now stuff that's, like, very visible in the organization, it's no longer part of the foundation or, you know, reliability system, when it's just like the stuff that is being presented to the org, data work officially becomes part of culture, a part of organizational problems as much as technological problems.
And when you're trying to solve for not just, like, what products we build, but also, like, how does the way we're doing work fit the organization I'm in, then things like visibility, trust, distribution, all that really matter. You know, 1 of the things I mentioned was that the knowledge repos story is, in some ways, intertwines with the hierarchy starting to get established Airbnb. Like, the fact that there were so many data scientists, there were data science managers, there were, you know, product people, everything. And so being able to as a, let's say, a director working on marketplace liquidity, being able to pull up some kind of IC data scientist work from across the org and say, hey. I think this actually means some really interesting things for us. Maybe just talk to that person or whatever is a really powerful thing. And then, you know, when it comes to rewarding people who actually uncover these diamonds in the rough, it ends up being a really powerful thing.
So there's a lot around like, if you're trying to drive impacted data, there are people up and down the hierarchy. There are people across different skill sets, product, engineering, design, whatever, being able to provide an experience that's successful, indexable, searchable, and trusted, it just has a lot of second order effects in terms of the broader data themes brand.
[00:36:20] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose learning guitar. So my son has been interested in that for a little bit, started taking some lessons, and so I decided to get a guitar and start learning along with him. So it's a fun thing to do, and it's always useful to be absolutely terrible at something. Good to remind you that, you know, while you may have things that you are adept at, there's always something that you're gonna have to learn from scratch. So for anybody who's looking for
[00:36:58] Unknown:
a hobby that is interesting and challenging, it's definitely worth picking up an instrument. And so with that, I'll pass it to you, Jay. What do you have for picks this week? I love the guitar things. Right? Because I have a guitar right behind me in this video. Listeners won't be able to notice. But, yeah, I completely agree. I always think with anything that you consume, like, whether it's food or music or whatever, like, it is so filling to approach it as a creator and just become take your hand and actually create music or whatever. So that's something I think a lot about with my kids is that once they start, like, loving something, like, learn how to create it. And so with that regard, for my pick, I'm gonna pick a different type of thing which is cooking. Like, especially these days as a start up founder and young parent, I feel like I've appreciated cooking as a hobby much more.
And my recent kick is cooking with these 2 ingredients. 1 of them is chickpea flour based on flour in Indian cuisine. You know, I used a lot growing up for making pakoras and stuff like that, but I've started to realize it's just this really versatile thing. We make, like, chickpea flour omelettes and buffalo, cauliflower with chickpea flour and stuff like that. Super good. And my other kick is using kimchi more widely. So So kimchi grilled cheese, kimchi nachos was big hit over the Super Bowl past weekend or whenever it was. So that's my latest picks.
[00:38:19] Unknown:
Awesome. Well, thank you very much for those recommendations. The chickpea flour omelet definitely sounds very filling and high protein breakfast. So thank you for taking the time today to join me and share your experience working with Knowledge Repo and your perspectives on the overall problem space of knowledge collection and aggregation and distribution. It's definitely a very worthwhile endeavor, so I appreciate all the time and effort that you and everyone that you've worked with put into that, and I hope you enjoy the rest of your day. Alright. Thanks, forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.
And visit the site of python podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Che Sharma and Knowledge Repo
Che's Journey with Python and Early Days at Airbnb
The Genesis of Knowledge Repo
Challenges and Solutions in Sharing Notebooks
Approaches to Institutional Knowledge Sharing
Design Elements and User Adoption
Technical Implementation and Evolution
Knowledge Repo's Longevity and Unique Value
Open Sourcing Knowledge Repo
Reflections and Future Directions
Innovative Uses of Knowledge Repo
Lessons Learned from Building Knowledge Repo
When Knowledge Repo is Not the Right Choice
Future Directions and Community Involvement
Final Thoughts on Knowledge Aggregation
Closing Remarks and Picks