Summary
Investigative reporters have a challenging task of identifying complex networks of people, places, and events gleaned from a mixed collection of sources. Turning those various documents, electronic records, and research into a searchable and actionable collection of facts is an interesting and difficult technical challenge. Friedrich Lindenberg created the Aleph project to address this issue and in this episode he explains how it works, why he built it, and how it is being used. He also discusses his hopes for the future of the project and other ways that the system could be used.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode today to get a $20 credit and launch a new server in under a minute.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at podcastinit.com/chat
- Registration for PyCon US, the largest annual gathering across the community, is open now. Don’t forget to get your ticket and I’ll see you there!
- Your host as usual is Tobias Macey and today I’m interviewing Friedrich Lindenberg about Aleph, a tool to perform entity extraction across documents and structured data
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what Aleph is and how the project got started?
- What is investigative journalism?
- How does Aleph fit into their workflow?
- What are some other tools that would be used alongside Aleph?
- What are some ways that Aleph could be useful outside of investigative journalism?
- How is Aleph architected and how has it evolved since you first started working on it?
- What are the major components of Aleph?
- What are the types of documents and data formats that Aleph supports?
- Can you describe the steps involved in entity extraction?
- What are the most challenging aspects of identifying and resolving entities in the documents stored in Aleph?
- Can you describe the flow of data through the system from a document being uploaded through to it being displayed as part of a search query?
- What is involved in deploying and managing an installation of Aleph?
- What have been some of the most interesting or unexpected aspects of building Aleph?
- Are there any particularly noteworthy uses of Aleph that you are aware of?
- What are your plans for the future of Aleph?
Keep In Touch
Picks
- Tobias
- Friedrich
- phonenumbers – because it’s useful
- pyicu – super nerdy but amazing
- sqlalchemy – my all-time favorite python package
Links
- Aleph
- Organized Crime and Corruption Reporting Project
- OCR (Optical Character Recognition)
- Jorge Luis Borges
- Buenos Aires
- Investigative Journalism
- Azerbaijan
- Signal
- Open Corporates
- Open Refine
- Money Laundering
- E-Discovery
- CSV
- SQL
- Entity Extraction (Named Entity Recognition)
- Apache Tika
- Polyglot
- SpaCy
- LibreOffice
- Tesseract
- followthemoney
- Elasticsearch
- Knowledge Graph
- Neo4J
- Gephi
- Edward Snowden
- Document Cloud
- Overview Project
- Veracrypt
- Qubes OS
- I2 Analyst Notebook
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project to hear about on the show, you'll need somewhere to deploy it. So check out Linode with 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale. Go to podcast in it.com/linode today to get a $20 credit and launch a new server in under a minute. And visit the site at podcastinit.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. And go to podcast in it.com/chat to join the community and keep the conversation going.
[00:00:50] Unknown:
Registration for PyCon US, the largest annual gathering across the community, is open now. So don't forget to get your ticket, and I'll see you there. Your host is usual as Tobias Macy. And today, I'm interviewing Friedrich Lindenberg about Aleph, a tool to perform entity extraction across documents and structured data. So Friedrich, could you start by introducing yourself?
[00:01:08] Unknown:
Yeah. Hi. My name is Friedrich Hintbeck. I'm a programmer. I work with a group called the Organized Crime and Corruption Reporting Project, which is a network of investigative journalists, across, I think, 50 countries. And I build technology that tries to enable reporters to do better digging into documents and data.
[00:01:27] Unknown:
And do you remember how you first got introduced to Python?
[00:01:30] Unknown:
Oh, yeah. I, just in in university, I did an internship for a search engine company. And I it was a Java shop. I was supposed to write Java. And, basically, my team lead, I think, took took pity on me and said, here's a book on Python. You should also learn a real programming language. And so I spent, like, a month kind of learning Python. It was really fun.
[00:01:50] Unknown:
And so can you start by giving a bit of an overview about what the Aleph project is and how it got started and your motivation for building it?
[00:01:59] Unknown:
Yeah. Sure. So I'm really interested in how technology can help reporters and especially kind of investigative journalists do their job better. And so I've been spending the last couple of years kind of working for different reporting networks and kind of teams of, of journalists and trying to kind of work on investigations, but then also build infrastructure technology that would help kind of facilitate those those investigations in the future. And so a couple of years back, I was working on an investigation about the profile of Australian mining companies in Africa, and we were trying to download every single filing of a mining company in in Australia.
And so it turns out that that's, like, 400 gigabytes of documents, and they're all scanned, and you have to run OCR, and you have to kind of make them searchable. And so I started working on this on this toolkit. Initially, it was called kind of document sifting toolkit and then I decided some some nice name was in order, so I called it Aleph.
[00:02:53] Unknown:
And I know that in the documentation, the name is actually from a science fiction story. So I don't know if you wanna give a bit of the background for how you came across that as the naming choice for this tool.
[00:03:05] Unknown:
It's a little bit of a pun. So, basically, the olive in in a in a short story is is a is a is an object or a place in Buenos Aires underneath a staircase somewhere in a in a normal house. And if you lie down on your back and look at it right at just the right angle, then you can see every place on earth with perfect clarity. So it's like this this magical point that gives you access to all other points in the universe. And in a way, it's a little bit of a pun on on talking and the palantir.
[00:03:33] Unknown:
You mentioned that you spend the majority of your time working in investigative journalism and helping to build tooling and resources for people working in that field. So can you discuss a bit about what it means to be an investigative journalist as opposed to someone who does,
[00:03:52] Unknown:
I guess, sort of more standard journalism, if there is such a thing? I think kind of the distinction between an investigative journalist and a normal journalist, that's that's sometimes a little bit fuzzy. But I think the people that I work with mostly are people who do kind of cross border, reporting where they're looking into large scale corruption. So, the countries we look at, often they're run by autocrats or by organized crime groups that essentially kind of take over the country and try and extract as much money and wealth from it as they can and obviously to the massive detriment of the populations.
So whether you're talking about a country like Azerbaijan, which has massive oil wealth and then basically is ruled by 4 families or so, or whether you're looking at at several kind of African countries where kind of the same principle is in play. What we're trying to do is kind of dig into, okay, who who actually is in power, how are they using that power to extract wealth, and how are they then hiding that wealth through mostly kind of international offshore finance.
[00:04:49] Unknown:
And in terms of the skill set for investigative journalists, are they generally somewhat, technically adept or is it just a sort of broad range of skill sets as with any other discipline?
[00:05:02] Unknown:
I think they are mostly extremely persistent people. So they're people who if you give them 2, 000 pages of printed paper, they will just kind of sit down, crack open a bottle of red wine, and go through each page of it and kind of make sure that they read the footnotes. And so as data users, they're obviously kind of really interesting people. Also, there's only about 3 and a half thou, I would say, people in the world that do this type of reporting. So it's a very small group, which is also a privilege when you're making software for them. But I think in terms of technical skill, obviously, very few of them are versed in using, let's say, command line tools or databases like SQL. I think there's a there's a few that have trained themselves in in in in writing SQL queries, but that's that's more rare.
I think many of them have kind of a very broad skill set from, like, accounting to interview techniques to, a little bit of data analysis and some some do a little bit of scraping as well. Given the fact that their work requires such a broad range of interactions
[00:06:01] Unknown:
and working with these various data sets and trying to follow a given story or trail. I'm curious how Alif fits into their workflow and some of the other tools that they might use alongside Alif to either serve different purposes or to supplement the work that they're doing with the data that they can retrieve from the ALIF system?
[00:06:23] Unknown:
Yeah. I think kind of the the fundamental activity is always you you have some some some ideas, some lead about what you think might be a connection, and then you're trying to kind of work out that connection, try and find other kind of connections, let's say, between a politician in 1 country and a piece of real estate in another country or a company in another country, or a procurement contract. And so you're trying to make these connections and then you're trying to prove these connections in a way that you can actually kind of defend that as as being being irrefutable. And, so often, this turns out to be a problem about kind of, yeah, getting the right documents, being able to make the right connections between these, these documents and datasets.
But it also turns out to be a very collaborative problem. So it's pretty routine for us now to have 60 or 70 different reporters from maybe 30 or 40 countries in 1 investigation. And so a lot of the tools that we actually use are around kind of collaboration. So we have Wiki systems, we use signal as a secure messenger, and tools like that that help us to kind of coordinate everyone and make sure that everyone is kind of working on the same strand. There's also a bunch of really interesting other websites. So for example, OpenCorporates is a really cool offering and where they where they try and make accessible as many company databases as possible.
And, obviously, there's, like, a normal toolkit. I think OpenRefine, which is kind of a a former Google product, is is is used quite a lot. And then there's, yeah, standard stuff like Excel.
[00:07:54] Unknown:
Outside of investigative journalism, it seems that Aleph, given that it's useful for being able to create these entity graphs and do entity extraction for structured and unstructured data, I'm curious if there are any instances of its use outside of the primary problem domain that you built it for that you have seen or that you're aware of.
[00:08:16] Unknown:
Yeah. I think it's actually a really interesting tool for other domains as well. Obviously, there's a lot of kind of business domains that are that are related to investigative journalism, such as kind of anti money laundering precautions or in in in in legal kind of context, ediscovery, where lawyers have to go through massive heaps of documents. And so, actually, I'm kind of curious. We've seen a few cases where people are interested in using it, a few cases where people actually adopted it, but I'd actually really like to see more pickup from from kind of commercial use cases. My my quiet dream is that 1 day there might be even a a company that kind of offers consulting services around it. The group that I work for is a nonprofit. We're really bad at capitalism, but I think we've built a really solid product. And, so I I I I wish there was a company that would offer kind of commercial support for it and maybe then make some money off it and reinvest that into development.
[00:09:12] Unknown:
And so can you describe some of the major components of Aleph and how the system fits together in terms of the software and system architecture?
[00:09:23] Unknown:
Oh, yeah. Sure. So I think Aleph is primarily, it's a search tool, right? So everything feeds into a search index. In the first situation of Aleph, what we did is we built a document search tool. So we would take documents of different formats, PDF files, emails, all that kind of material, and try and get that into a search index as quick as possible and then provide a user interface to query it. Then what we realized is that, actually, a lot of the data that we wanted people to have access to isn't in documents, but it's in some kind of structured format, whether it's a CSV file or a SQL database. And so we added the notion of of an entity graph to it so that we now have kind of our own domain language where we talk about companies, people, procurement contracts, sanctions, all these kind of entities that that come up in our investigations.
And what's happened now, what's kind of interesting is that the architecture has kind of fallen on its head where basically, everything becomes an entity and even documents themselves become entities. So a document is just a special type of the same thing that's also a company or or a company director or a contract. So for for a journalist, what that ends up meaning is that they just type in a term that they're interested in interested in and and find out all the all the entries, whether it's a document or a company record that match those. In terms of what what the the process looks like, basically, we have 2 different ETL processes for this reason. So we have 1 that kind of takes structured data and basically just transposes it into our domain model, into our ontology. But and then we have a second process that obviously has to account for all these different file types.
And what's really interesting is with both of these processes, you can kind of pick out what I'm calling selectors, essentially. Right? So little data points like phone numbers, email addresses, names of people and companies, IBAN, bank numbers, stuff like that that actually is really useful in terms of providing the connectivity between the material so that an email from 1 leak can actually connect to a document in another leak and a company record in a third dataset just by by means of having the same phone number in it.
[00:11:32] Unknown:
And in terms of the documents and the data that Aleph uses for being able to extract this information and form these connections, I'm curious what the, different supported data formats are and some of the challenges of being able to transfer some of these document types, like, as as you mentioned, scanned PDFs into something that is workable and that you can process more easily as just a data artifact?
[00:12:03] Unknown:
Yes. We're trying to obviously support a broad range of kind of document types. There's a really good toolkit from Apache Tikka that I think a lot of people use, but we've decided that we also want to provide really good preview capability. And since that's not kind of that easy with Tika, we've basically ended up doing our own Python based kind of content extraction toolkit. And the basic idea behind that is that we're trying to reduce as many document types as we can to kind of a set of fundamental document types. Right? So for example, any word document or word perfect file that you give us, we'll convert it to a PDF and then have, like, 1 PDF extractor that takes out the content from that. Similarly, for example, if you if you give us an Excel file, what we do is essentially we we turn it into conceptually a folder of CSV files for each sheet in the in the in the Excel file. A big thing for us obviously is, is emails. So we have silly numbers of email formats that we support, whether it's like Outlook, Thunderbird, or whatever kind of other thing we see in our in our source data. Then also, yeah, media support where we don't actually do very much yet with audio and video files, but that that's something that I'd really like to explore in the future.
[00:13:17] Unknown:
1 of the challenges with these different data formats and document types is that I imagine there's a certain amount of special casing for things that are just more free form pro style documents versus, bulk data dump that is already in a structured format and trying to resolve those into the same systems and the same processes for being able to pull out interesting information?
[00:13:44] Unknown:
Yeah. I think there's a there's a certain amount of kind of structural differences. Right? So, for example, we also scrape a lot of websites, and then we get, let's say, every every issue of a government newsletter, like the Federal Register. But then you have document leaks, and they're obviously super important to kind of preserve the the structure of the leak as is because often people have folder structures that that are, meaningful and would help, our journalists to kind of analyze what's in there. 1 very practical problem that we have is that often we we receive, like, entire, entire disk dumps of people and then we need to kind of filter out all the, the system files. Right? Because you don't wanna import 17 copies of Windows into your system.
And more recently, I've accidentally imported a copy of Spider Man. So so kind of that's that's like a part of partly it's it's a preprocessing, step and partly it's also the system itself that tries to clean that up. There's other kind of cases where it's it's just very hard to build a system that is is efficient both in terms of importing, let's say, 20, 000 really small email files, but then also is capable of importing a 36 gigabyte, Outlook email dump kind of archive because these are obviously very different scaling needs, and that's something that we have a lot of lot of issues with, but we're working working towards resolving. What we're what we're doing more and more is kind of building out services that run alongside the main app. And so we have, for example, a document conversion service that's running standalone and converts documents from whatever format comes in into PDF. Or we have entity extractors that are running as services because they're they they have, like, a lot of memory need.
And, we're also trying to kind of put more and more of the of the individual document extractors into into kind of packages so that if if 1 of these file formats has some interactive component in it, it can't actually end up exploiting our system.
[00:15:41] Unknown:
And in terms of the actual entity extraction and entity resolution, that that's a problem that I've always been fairly interested in as far as how you are able to infer that maybe a given phone number is associated with a particular person and then being able to resolve different mentions of that person, particularly if there are things like they use the initials in 1 document, the full name in another document, reversing first and last names, introducing middle names, having nicknames for different people, and being able to resolve those into a single data object that you can then link into this larger graph to associate with maybe a company or a financial transaction or things like that. So I don't know if you can talk about how you manage that within Aleph. How much of that is automated versus having to have a human in the loop for being able to resolve these entities into a single data item and things like that?
[00:16:37] Unknown:
Yeah. I think that's really where the structured data component of Aleph comes in and provides a lot of value. So for example, we try and load as many, many structured databases as we can get our hands on, whether that's company registries, procurement databases, sometimes even voter databases. And in in in many of these, obviously, you have these types of kind of details on names of individuals, on their national ID numbers in some countries, phone numbers, etcetera, etcetera. So that's like a really good kind of reference set in some cases where, when you then find a phone number in in in a in a document later on, you can kind of refer back and disambiguate, through those sets what what the person or company this was referring to. Obviously, we have to always be super careful with regards to how much interpretation we do automatically because, fundamentally, what we're doing is we're providing evidence to journalists. So we don't want to do too much inference, without telling the reporter because otherwise, we might kind of give them a more solid piece of piece of data than they actually can get from the source data. So that's like a a constraint that we always have. In terms of actual entity extraction, what we're doing is we're relying on kind of open source entity extraction toolkits.
We're basically using 2 of them at the same time. We've got, 1 thing called polyglot, which is a Python library that I think is unmaintained at the moment, unfortunately. But it it I think it was someone's PhD thesis, and it has an amazing range of languages supported. And because we work in a lot of countries that have their own their own languages and often alphabets, That's a big concern for us. And then we also use spaCy, which is a beautifully engineered piece of software, that provides really good support for European style languages.
And 1 thing that's actually proven to be to be super useful as well is we just run a bunch of regexes on on all the documents that are coming in. Right? So for phone numbers or for for email addresses, for bank identifiers, and stuff like that. It's really just the easiest solution is to just, run a regex and then run some kind of validation afterwards to see whether whatever you dragged in is a is a valid phone number or is a valid email address. So that's that's kind of how we how we take documents. Basically, we send them through at least these 3 different stages of entity extraction, and then we try to clean it up and kind of throw away the low frequency matches and kind of keep only those that that come up a few times, ending up with, I'm hoping, pretty clean kind of entity taggings.
[00:19:13] Unknown:
What have you found to be some of the most challenging aspects of building ALIF and maintaining it and trying to raise awareness and help support it with these various journalists who are trying to do their day to day work?
[00:19:30] Unknown:
I mean, a lot of our real problems are about scale and about, working in a nonprofit environment. So, obviously, we're working with a really small team. I think there are 3 other people working on on Aleph. And so, we have, like, 1 front end person, 1 ops person, and, some people who do do multiple roles and also other other aspects. Like, for example, cloud computing is is really expensive, and it's hard actually for for an NGO to justify that cost. And so I'm I'm actually, like, for for me, this is this this sounds kind of obvious in hindsight, but I'm kinda surprised by how quickly you get into actual kind of scaling problems. I think at the moment, we have, we have roundabout20, 25 terabytes of raw data indexed into the system.
That's already turning into a into a real challenge to run on on on on our infrastructure and to, to serve out at a at a reasonable latency.
[00:20:24] Unknown:
And in terms of the overall flow and life cycle of the data that you use within the system, I'm wondering if you can just talk through the, steps and the systems that are traversed for when somebody goes from uploading a document into Aleph all the way through to, surfacing it as part of a query. And like I said, just the different steps and systems that it goes through to, go from the beginning through to its final use?
[00:20:55] Unknown:
Sure. I mean, so we we try to do all of our data imports through our API at this point so that there's a clear separation. A big concern for us, obviously, is security. So, we need to make sure that everything goes through some layer of authentication and authorization, obviously, so that nobody can, can access documents that they're not entitled to see. Then what would happen is we would send it through, a package called the ingestors. That's a separate kind of Python project in the in the same, organization that we're running. And the ingestors then kind of branch out. And basically, what they're doing is they're auctioning off the file saying, Hey, I've got a file and its filename extension is so and so, and this is like the first 5 bytes. Who wants to parse it? And then all the parsers kind of go and say, hey, I can do that. Then what happens is that, depending on what kind of thing you have, you might have to send it through through a document conversion. So we we we actually kind of keep LibreOffice running in in a in a server service, and we might also have to send it through OCR. So we're using Tesseract as an OCR service.
And, then you're basically in the at the end of that process when the right kind of ingestor has been has been running, and you end up with some kind of abstract model of of a document, with all the text extracted And then what And then what happens is we send it through kind of the the the analysis part. So, entity extraction, looking at dates and locations. 1 of the things that we're doing actually is we're we're taking location tags from the entity extractors. We're then sending them through geonames to see what country they might be in. And so we're essentially guessing what country a document is about. So if you have a document that mentions, let's say, Berlin 5 times, then we're gonna guess that this document might have something to do with Germany.
And so we're kind of enriching the document more and more. And, then, eventually, what we're doing is we are translating it into, into what we call follow the money. And follow the money is essentially the ontology that we're using for our our data. So kind of describing it as an email object or as a tabular object or as a page document, which is kind of the fancy way of saying PDF. And then it goes into Elasticsearch. So we were running a pretty pretty nice Elasticsearch cluster. And when you're actually doing a query against it, most of the time, what you're what you're the only thing that you're hitting is Elasticsearch. Basically, every user that's using the site has a number of kind of roles associated with them, whether you're logged in, whether you're part of some investigative project, whether you're, you're part of some organization that we have a special relationship with.
And given what kind of roles you have, you then kind of see different parts of the index and get different results for your searches.
[00:23:51] Unknown:
And is there any build in support for being able to do, aggregations or exports of information as the result of a search query so that you can then maybe package that up for use by someone else who's taking part in a different stage of the investigation or for being able to forward to somebody as a piece of reference data to facilitate a conversation or something like that?
[00:24:16] Unknown:
Yeah. That's actually something we're working on actively right now. So, what we have at the moment is kind of streaming APIs for doing exports of of of data. But what I'm really interested in is kind of how do we how do we actually facilitate a process where we're building kind of investigation specific knowledge graphs. So let's say you have an investigation where you're looking into, let's say, 20 20 politicians in your country, then you would, you would be able to run them all against our API, would get tentative matches, then you would have to go through, like, an interactive process of of reviewing the potential matches, kind of confirming some of the identity of of those matches.
And then you could you could even go out and build that out step by step so that you eventually build out an even wider net of kind of who are the, these politicians connected to, who are the people that they're connected to connected to, etcetera etcetera. So you're kind of building out this domain graph. At the moment, we we have some kind of command line tooling that we're playing with on this. And 1 of the real problems is kind of how do you present this to a journalist. Because at some point, this information becomes honestly hard to squeeze into a, an Excel file or something like that because it's very graphy by nature.
And then the question becomes, like, how do you how do you actually present this graph information to to normal users? There's a couple of tools like Gephi, which is an open source graph analysis tools tool. And obviously, there's Neo 4J for kind of loading graph data that we're exploring as as possible kind of back ends for this. And then the question is kind of how do you how do people drive
[00:25:50] Unknown:
investigative insight from these from these kind of graphs. And another question I have as far as people who are using the system is the question of any sort of user experience design or feedback for making it easy for end users to be able to do the job that they need to do using Alif without having to have any sort of intimate knowledge of how the system operates or, how to do any sort of deep technical analysis of the relationships between the different entities and things like that?
[00:26:25] Unknown:
I mean, we're trying to keep the user interface pretty simple. So a lot of the more advanced kind of graph functionality that we actually have internally, we're not not even exposing at the moment because we haven't yet found an easy enough way of presenting it. It would be quite easy to just show 1 of those massive network diagrams and let people kind of deal with it, but that's not the idea. At the moment, we're kind of keeping it to a relatively simple kind of search metaphor and then have some kind of cross linking and bulk cross referencing functionality in there. That's actually kind of a really convenient thing for reporters. So if you give us, let's say, a list of all the of all the palmitarians or kind of, oligarchs in your country, then we just have the ability to run that against all the other datasets and we give you these kind of comparison sheets saying, hey, there's, like, 5 matches possibly with the Panama Papers and there's 20 matches with the land registry of New York and and all these different datasets so that you kind of, as journalists, you could just kind of go through and say, yeah, this is interesting or no, I don't care. And so we're we're kind of slowly trying to trying to build out the features so that every the stuff that actually exists in our user interface, that it makes sense to to investigate a journalist relatively quickly.
And we're lucky in the sense that we're actually working in the same building as, as many of our users. And so often, if you if you produce horrible UX, you get shouted at over lunch and that's that's really convenient. And then there's obviously another layer of kind of more bespoke data analysis that requires some kind of data wrangling and more more involved kind of data processing. And that's something where our team then kind of jumps in and becomes part of an investigation and helps do this on on a specific data set so that, we're able to kind of produce more more specific kind of reports almost, based on the on the data and based on on on some kind of probabilistic graph.
[00:28:23] Unknown:
And what have been some of the most interesting or unexpected lessons that you've learned or experiences that you've had in the learning
[00:28:34] Unknown:
a lot about kind of learning a lot about kind of doing software development under time pressure. Investigative journalism is a thing where investigation is going and sometimes there are, there are kind of external constraints on it and stuff needs to happen at a certain velocity. And so, it's not a great place to do development where you're saying, hey. We're doing this in in in quarter 3 or we're doing this in, like, 6 month. More often than not, it's it's it's better to kind of come up with a good prototype and then kind of solid it out, as you go along. We've been trying to kind of maintain good good code quality in the face of that, but it's kind of a really interesting challenging environment because often when you get a request, for for a new feature or a new piece of functionality from a reporter, that request is kind of, hey, can I have that in the next week rather than, like, can we do that in the next big release?
That's, like, 1 really interesting thing. Another kind of slightly ironic thing is that I I found myself a very active customer of Edward Snowden. So kind of reading all these different weird documents about how intelligence services actually design their software has proven incredibly useful, and I didn't know that I would be, a consumer of Snowden documents in that sense. And, so it's really interesting also trying to think about how do you how do you connect information in in this slightly, odd and and very, very focused way that, I guess, only investigative journalists and intelligence agencies would do.
[00:30:06] Unknown:
And what are your plans for the future of ALIF? I know that when I was reading through some of the documentation in the Wiki, it mentioned that you have some design ideas for a the for the next major release, but I'm wondering if you can talk through a bit of that and just your general plans of where you'd like to see Aleph go in the future both technically and in terms of use case.
[00:30:29] Unknown:
I think our general goal is to capture more and more of the reporters' actual activity on the site because then we can make better recommendations to them as to what to look at next. So what I'd like to kind of work on a lot is kind of building out this idea of kind of case management on the on the system itself where, let's say, you're working on a on an investigation and you have, like, 3 other colleagues that are working on it maybe in different countries, that you you are able to say, hey. I've got, these documents here. I'm uploading them to a case file. I've also run these search queries on the larger corpus, and I've decided in in these search queries, these are good results, and these are irrelevant results so that you're even able to kind of share that type of kind of analytical work that you've already done with your colleagues.
And then, also that you're able to eventually share little sketches of, let's say, a network diagram of who's been involved in a particular case, and how they are linked, or a timeline of how how did how did these different events take place. So a lot of this is kind of UI work that we're we're facing, that we need to enable through the right kind of back end support. In general, like, other than that, we we we have a lot of challenges, as I said, around scaling. A big concern for us is always security. How do we kind of keep this, keep this system in in a way that, yeah, we we can we can honestly tell our reporters that their information is safe in there.
We have a lot of leaked data that we want want to really kind of be be responsible with because it affects private individuals. And so, yeah, there's a lot of different concerns that we we we hope to hope to essentially evolve the system on. But I think kind of the biggest the biggest task for us going forward is really to build this into a more interactive tool that people can kind of say, okay. Hey. This is what I know so far. Computer, what do you think I should look at next? And that that the system really can kind of provide these little subtle hints of, like, yeah, you've got these 2 people in your in your investigation already. Did you know that they have company together in in the Bahamas? That that kind of kind of cleverness, I think we've got the right the right data model for. We now need to we need to find the right interaction patterns to actually show it to a user and, and and and get their feedback on it so we can also train the system internally.
[00:32:46] Unknown:
And what are some of the other types of tools or systems that someone might use in place of Aleph or that you would see as being in the same sort of problem space as what you're doing with Aleph? And if you can just do a bit of sort of compare and contrast with how you're approaching the problem versus, how how these other projects are and some of the decision making that might be involved when somebody's selecting which system to use?
[00:33:13] Unknown:
Sure. I mean, there's a there's a whole community of people, I think, making software around investigative reporting. I would say there there's there's 1 big wing that kind of is focused on on security, and there's another big wing that's focused on data analysis. So within the data analysis wing, I think there's this project like document cloud that a really fantastic way of sharing documents within an organization and then also with your readers. There's a project called the overview project that is is is a really great way of kind of analyzing messy, textual documents and and finding kind of patterns within them.
They're doing kind of topic modeling and stuff like that. There's also, the JS kind of toolkit for mining leaks. That's very similar to what we're doing. Then there's kind of also the the security set of tools that are required for doing doing investigative reporting. So there's a lot of tools like, Signal Messenger or VeriCrypt for encrypting files on your computer, even stuff like cubes OS where you're building an entire kind of containerized operating system that I think are are really interesting for for for reporters to use. Unfortunately, they often kind of, yeah, still have usability issues where people can only use them when when when it's really kind of trouble time rather than on a routine basis.
And of course, there's also, like, a a third sector which is kind of, I don't know, industry products. Right? So there's, IBM, and they have a thing called II 2 analyst notebook, which is really cool, but, basically, anyone who works in journalism or NGOs is priced out of it. I mean, there's there's, like, tools that are used in industry by analysts, whether it's intelligence or banks or the security industry. And and those are often kind of very evolved, but the pricing model of them basically makes them completely inaccessible to our reporters. I think 1 of the things that we're really trying to do is to make sure that any journalists in the world, whether they're working at the New York Times or whether they're working at a 3 person citizen journalism initiative in in Kyrgyzstan, can can use these tools and can can can hold a powerful to account.
[00:35:16] Unknown:
And are there any other aspects of ALIF or the work that you're doing and just investigative journalism in general that you think we should discuss further before we close out the show?
[00:35:28] Unknown:
I think, you've got a lot of really interesting points. For us, obviously, the the big point is working on this also as an open source project. I'm really interested to hear what other people might be interested in using this for and how we could kind of amend or extend it to to make it useful to other people. And also, we were definitely looking for other people to help us, improve our product, whether it's on document import, whether it's on security, whether it's on entity extraction. These are all really cool things to work on, and we'd love to get help from others.
[00:36:01] Unknown:
Well, for anybody who wants to follow the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a library called Mechanical Soup that combines the requests and beautiful soup libraries together to make it easy to do automated browser interaction for non JavaScript websites where it automatically has statefulness and sess session management built in so that you can essentially treat your library script as a browser object for being able to traverse a website through something like login and then being able to, pull down the the HTML document and parse information out of it. So, I've been using it for a script that I wrote for handling pulling out information from the RSS
[00:36:50] Unknown:
feed for my podcasts to post in different places, and it, made things a lot easier there. So definitely worth checking out. And with that, I'll pass it to you, Fridush. Do you have any picks this week? Yeah. I mean, I'm I'm super dependent on the Python libraries and the Python community's ecosystem. So there's tons of libraries that we use. I just wanna mention, like, as like 1 random 1, there's phone numbers, which is fantastic in terms of it's literally called phone numbers. And it's fantastic in terms of, yeah, data cleaning. It can tell you for any particular number, whether it's a valid phone number, in in the given country that it's in. So, whether it's a phone number from Russia or a phone number from the US, it just know it has has all the metadata in it to know whether it's valid.
There's a second thing called p I c u that's incredibly valuable to us. It's basically a binding on the ICUC library, and what it does for us is transliterate. So when we have text that's in, let's say, Arabic, we can just say, hey. Turn that Arabic text into some Russian or Cyrillic text, and it'll just do that. That's part of Unicode, and it's really amazing. And the final 1 that I I just have to give a love shout out to is SQL Alchemy because I've I've been hacking on SQL Alchemy, for or like, I think I've been hacking with SQL Alchemy for for years now, and it's like a miracle of Python engineering.
They have they have this amazing Python core layer a SQLAlchemy core layer, and that's not not even the ORM, but just kind of database access. And,
[00:38:18] Unknown:
it's, I think, 1 of the most versatile and most amazing Python projects that I know. Alright. Well, thank you very much for taking the time today to talk about the work you're doing with Aleph and investigative journalism. It's definitely a very interesting project and looks like it's serving a very useful problem domain and is very technically interesting. So I appreciate you taking the time today. Thank you for that, and I hope you enjoy the rest of your day. Thank you. You too.
Introduction and Welcome
Interview with Friedrich Lindenberg
Overview of Aleph Project
Investigative Journalism vs. Standard Journalism
Aleph's Role in Investigative Journalism
Applications of Aleph Beyond Journalism
Technical Components of Aleph
Challenges with Document Formats
Entity Extraction and Resolution
Challenges in Building and Maintaining Aleph
Data Lifecycle in Aleph
Exporting and Sharing Data
User Experience and Interface Design
Lessons Learned and Experiences
Future Plans for Aleph
Comparing Aleph with Other Tools
Open Source and Community Involvement
Picks and Recommendations