Biopython with Peter Cock, Wibowo Arindrarto, and Tiago Antão

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports us on Patreon.

Your contributions helped to make the show sustainable.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out Linode at www.podcastinit.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your next app. Do you need to learn more about how to scale your apps or learn new techniques for building them? Pluralsight has the training and mentoring you need to level up your skills. Go to www.podcastinit.com/pluralsite

to start your free trial today.

You can visit the site to subscribe to the show, sign up for the newsletter, read the show notes, get in touch, and support the show. To help other people find the show, please leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

If you work with data for your job or want to learn more about how open source is powering the latest innovations in data science, then make your way to the Open Data Science Conference happening in London in October and San Francisco in November.

Follow the links in the show notes to register. Your host as usual is Tobias Macy, and today, I'm interviewing Peter Kock, Ubowo

Eirendrato,

and Thiago Entau about biopython,

a suite of Python tools for computational molecular biology.

Peter, why don't you start by introducing yourself first?

Hello, everyone. My name is Peter Cook. I started using Python back in about

2, 003, 2004 when I was introduced to it by a lecturer when I was doing a master's.

And then during my PhD, subsequently, I began using biopython and started contributing and eventually ended up as 1 of the main developers on the project.

And Bo, how about you?

Hi, everyone.

My name is Bo.

I I've been officially using Python since 2010, I think. I started actually working in the lab as in a biological, microbiological lab, but then I started gradually shifting towards a

more programming role. And now I'm a scientific programmer

in working in the Netherlands. But my contribution to Biopathion, I think, officially started around 2011,

if I'm not mistaken. That was my first commit, I think.

But, it's

still going on up to today. And Tiago, how about you? Hello, all. My name is Tiago. I'm currently a research scientist at the University of of Montana in US.

I started use using Python around 96.

My background is slightly different. I came from computer science to biology,

whereas most folks, I think,

go in in the other direction.

I did my first commit

to Biopython. I actually Peter might even remember better, but around 8, 9 9 years ago, I think.

So to start off the conversation,

I'm wondering if

1 or all of you could start by explaining what bioinformatics

is and highlight some of the different areas of research that are held under that umbrella.

So a lot of us joke that bioinformatics

is 90% file format conversion. We have lots and lots of computational tools that are designed to solve biological problems,

but there tends to be, an in venture own file format mentality. And

there are rival file formats doing similar things but slightly differently. So a lot of day to day work in bioinformatics is is

file format manipulation, and things like biopython can help with file format passing. So that's 1 of the strengths and 1 of the reasons many of our users

use Biopython and related tools.

And so it sounds like Biopython is primarily a glue between the multiple different data formats that are used by researchers

and,

people who are doing the analysis in the field?

It's a fair summary. There's a lot more in Biopython

beyond that. We have things for dealing with three-dimensional protein structures. We have clustering code

for more numerical work. But

what a lot of people use instead of what I use, you could describe as as Glue. Yes.

And do any of you know the full history of how Biopython first got started?

Not directly. So it was started before any of us 3 were directly involved.

Do you remember the year, anyone? I think I think 1998.

Yeah. 1998.

That was way

back. Back in the early days of bioinformatics,

a lot of the work was done using Perl because systems administrators were being roped into help with projects, and they, at the time, preferred and used Perl. But there are other languages and people preferring Python started wanted to do bio mathematics in Python, and that's how biopython really started. The same for sister projects

like,

Bio Java and Bio Ruby that came along a little later. So BioPerl is, I think, very slightly older than Biopython. We started, yeah, about late 1999.

And so given the long history that BioPhyton has, I'm wondering how the project has evolved over that time to meet the changing needs in terms of the both the research that's being conducted and the computational landscape that's available.

On a personal level, we've had lots of people come and go.

It tends to be that as people get promoted, they they no longer have the time to dedicate to the projects.

Whereas new trainee scientists coming in are more likely to be able to invest a considerable amount of time. And so there's a turnover in personnel. And as you say, technology has moved on. The Python language itself has moved on. We're now on Python 3. There's been a lot of maintenance work, cleanup work to to transfer us over from Python 2. And, yeah, interesting new technologies as well, like making sure that PyPI works on PyPI. I'm sure some of your listeners will know about PyPI, runs your Python code and

optimizes it as it goes. PyPy is an interesting project. Another

thing that I would like to raise is how how how the ecology of software

software on the Python side changes a lot. So if you are doing this in,

late in the late nineties,

you will your software package would be

working alone. Whereas, nowadays,

you have a massive ecology of software of very good software that you can interact with, say, Psypa,

Numpy,

SciPy,

Scikit

Learn. So it's more in in terms of of the ecology of software,

it's more like it's teamwork with many, many other many, many other packages. Whereas at the beginning,

it was really just a small,

a a a small isolated project with very, very few scientific libraries or charting libraries. So in that sense, I think

it it has been a massive revolution in terms of of the old,

Python ecology experience.

Yeah. I I guess, since

if if it's for 90, I think, you you can say that the Python ecosystem has changed a lot. I mean, there there's so much more

that the Python there is so much more Python libraries that you can use and more specialized ones to process various data these days. But if you look into the biopython project, you can actually see that,

there are still I I wouldn't say a remnant, but you can clearly see that this is a library that has

been developed quite over a long period of time. I think the development started even before PEP 8 was was introduced sort of like the de facto standard. That was a long time ago. I think GIP was not even around. I think the original versioning was done using CVS. Yes. CVS.

So that that was, that was the that was the that was the, in which, the buy Python project was started, I guess. But now it's it's it's completely in Git. It hosts in GitHub.

And then, we're modernizing part of the libraries a lot. So we're,

yeah, it's been it's been going on for a while. And on the science side, there have been plenty of changes too. From my point of view, the biggest has been what we call high throughput sequencing.

Yeah. More and more DNA and RNA data coming through, and

some of this is so high volume that you need specialist optimized programs to deal with, but, again, there's still plenty of glue work for interoperability.

Yeah. Indeed. I I think maybe this also hits 1 to 1 part of the library that was well, it's not the major part, but we do have some c code. Right? We do have some highly optimized

functions that are written in C that people, the users of the library can interface with Python. But that's also becoming more common these days. People write the

parts that need a lot of performance in c, and then you write a Python wrapper on it, and then you access via Python,

functions just as in a regular Python script. So

A big point, I think it's exactly that. So it has become a big data discipline. I know this is too much of a of a fancy word nowadays, but in this case, it's it's it's very true. So if you were were working 10 10 years ago, a floppy disk would probably have all the data that you could that that that that you would need. But now but nowadays,

we are talking about,

tens of terabytes of data that that need to be processed, cleaned up,

And it's it it it it becomes something that it it needs the most out of of your computer skills

to analyze the data that that that we have from from DNA sequencers.

And I'd say that that's probably a pretty large impact on the overall bioinformatics

field is the recent capabilities of being able to do gene sequencing

and more complex analysis of the actual underlying biological material.

Yes. That's been a major change, but there's plenty going on on top of that and things like the visualization plugins in various international databases and online APIs for connecting resources.

Plenty of change in in many, many areas. I would just like to make a point, probably should should be made,

some some time ago at the beginning. But,

bioinformatics

is a field that

different people,

can do completely different things. So

what I specialize in in my case is population genetics, which is completely different from what from what Peter does and it is completely different from what

Beau does. It's not just that we are

dealing with different

biological species.

We are dealing with very very different kinds of analysis tools and and different and different problems. And that that makes some of the work that we do, the specialized

computational biology work very different from user to user, and and there

are different areas in bioinformatics.

So people

some some people analyze

compare species

across species.

Other people

look at specific genes. In in my case, I I look at at data, across populations of of the same, species and so on and so forth. And there there are people that work with proteins,

and it's a massive field. And the ones of us that that that do the the the most

some of the research actually

have very different skill sets on the biological

side.

And I imagine that that must manifest as a bit of complexity

in

the biopython project trying to be able to cover such a broad range of requirements?

Effectively, it's, an umbrella of submodules,

which

would have had an original author and maintainer. So Theo is in charge of the population genetics part of BioPhyton, for example,

and it's semi independent from the rest of BioPhyton. And that kind of works for the most part. And the stuff that I do on the sequence side tends to be more connected because sequence data crops up everywhere. How does Biopython compare to the sibling projects implemented in Pearl and Ruby and Java? So the ones you've named, Biopython, Biopirl,

Bio Java, and Bio Ruby are all of a similar age and

have overlapping community membership. So those 4 together set set up what we call the Open Bioinformatics Foundation which initially was a non profit company to look after the servers and domain names it's now still a nonprofit, but it's part of software in the public interest. And we still have domain names, but now with less physical servers, it's more mostly Amazon stuff. But anyway, there are other

projects using the same naming scheme, like there's a BioGo, and I think someone's trying BioJulia. There's a Bio JS project. They have different biological

areas of of strength. Some are very broad like BioPerl and BioPython.

Others are are much more focused. So BioJulia, I think, is

very strong on the visualization side, but they're not

exactly competitors. We're all trying to do research in biology, and we will contribute ideas and test cases, and it's

a fairly positive attitude, not like the tongue in cheek rivalry between fans' particular programming languages. At least that's that's the attitude I try and take with this. I think in most major programming languages today, you can find, like, a de facto bioinformatics library indeed, but, I I think this also reflects

the amount of on type of people, I guess, working in bioinformatics because you can easily find Python programmers, Ruby programmers to an extent, people who are familiar with Java and everything. And if you're coming from different backgrounds, I think it's, it's not that difficult to find a library that you can work with or even you start can you can start contributing to. So for somebody who's using Biopython,

what does a common workflow look like when they're working with biological data? Touches on an earlier part of the conversation. Most people are doing very, very different things. Also workflow has many meanings in the field. It can mean an automated workflow like a a shell script or a YAML file or an XML file that describes a reproducible rerunnable,

repeatable pipeline for an analysis, the kind of thing that you might include in a scientific write up of your work. But I think there's there's no 1 size fits all answer for what a typical

day of biometrics analysis looks like. Unless chaos is, is is an acceptable answer, but there is lots and lots of of of variation,

as Peter is saying. So it's it's really it's completely different from user to user, and that sometimes is is problematic because 1 of the things that we are very concerned is about,

replication of of of results.

And

sometimes it's it's it's it's a bit frustrating that

things are are too ad hocish in a way. And I think it's worthwhile referring here a project,

which is the Galaxy project, which is implemented in in Python 2,

and,

which is an amazing project that that allows

precisely to have

reproducible

and standardized workflows.

So what are some of the most interesting or innovative uses of biopython that each of you are aware of?

That was probably the hardest question on your list. It crops up a lot behind the scenes.

Good infrastructure doesn't get noticed until it breaks.

And I think a lot of people use buying Biopython or Bioplala or whatever as as

infrastructure

just to to get something done. It isn't in itself

flashy or exciting. So I I struggle to come up with a good answer. I'm hoping 1 of my colleagues on the on the call can come up with a nice example. I also cannot pinpoint

an exact answer. I think well, maybe several, but, what I remember is that when I started writing some libraries, I I started getting some emails from people.

And then they say, well, I've been using this script in house. Have you been using this, your module in, for in our web service? And can you actually make this fix for us? Can you make this feature for us? So it's not

really prominent per se, as in you don't you don't see that immediately. But once you touch on the libraries, when you work on some feature,

you

get usually some in contact with people who are actually working on this. Again, I think this reflects on the various,

subfields or the various,

different

parts of bioinformatics that people are working on. I can I can actually give you some examples of of applications that I've worked with and some very, very

practical applications? For example,

we have used biopython and Python,

to help

in, in Yellowstone National Park to make decisions about,

about, bison

and about culling off bison, for example.

Obvious there is a big part,

that that is slightly

it's in in the Python world, but not biopython per se. It's about simulate simulate simulating genetics

and trying to have measures

of viability of populations

as a function of their size, and then you you can make management decisions on populations and and very very practical decisions were made with regards to the sizes

of the population

of Yellowstone bison. Another example

is, again, in the field of population genetics to,

to to trace the path of of of diseases

or resistance to diseases. For instance, you can use,

you can use,

biopython

and Python and and and these kinds of population genetics tools to to trace that resistance for malaria,

which is still 1 of the biggest killers in in in the world, tends to start in resistance to to to direct I mean, tends to start in Southeast Asia and then mostly to to expand from there to to the rest of to the rest of the world. Another another example

is when you get all these data from,

from sequencers

and again from the field of malaria,

you can,

try it, you you can use all these methods and and all these tools. Again, it's not just biopython. It's it's the all ecology of bioinformatics software

to, for example,

see what parts of the genome are are influenced

by in the case of, say, mosquitoes that transfer malaria,

which parts of the genome

are influenced by,

are are being are creating resistance to insecticides.

So with that information, then then then you you can, for instance, see

how effective a certain insecticide is and what effect what effect effects it's having on on the mosquito genome. So these are very very practical applications

that, that the the infrastructure

is there, and you need this infrastructure to help answer some of some of these problems.

That's a nice high level answer from Thiago with some high profile

examples.

I was thinking more little things like people on the mailing list or on Twitter or on Stack Overflow where they've had a problem and they've had a a short solution from a community member using my pipe. That kind of thing happens a lot, and I like that. Little steps too. Yeah. That's definitely 1 of the great things about open source software and open science in general is the ability for cross collaboration

that can have compounding effects on the quality

and,

capabilities

of the end result.

So what are some of the most challenging aspects

of developing and supporting biopython

both from a technological and a community level?

The very mixed environment that people run this software in. We try and support Windows and Linux and Mac. We occasionally have people on free BSD and other things as well. That has definitely got a lot easier in the last 10 years because of,

publicly minded companies like, Travis CI or

AppFare that provide,

essentially, free

continuous integration testing for open source projects. So we now have automated testing

for the Linux platforms and for for Windows.

And that's really great because typically and our developers will, for example, work on Linux, and they won't have easy access to Windows machines. So the user says something's breaking for me on Windows, it's not always straightforward to work out why.

So that's 1 challenge. Another we've touched on is the age of the project. Some of the

bits of the code are very old, and the person that originally wrote them are no longer actively involved.

Over time, we've we've updated things. We've we've done some screen cleaning, and we're we're trying to improve the unit test coverage, which helps in that situation greatly.

If I may add something on the technological level, I think it has also again to do with the fact that we parse,

various file formats a lot. And a lot of the specification of the file formats are actually not so well defined. And some of them are actually meant to be human readable, but they contain so much information that, it's still useful to be able to parse. So we write parsers, but then we still see some corner cases for specifications which are either not well defined or not so well followed by people that, the parcel breaks. And then we have to come up with the add a new test case and then try to parse that so that, you know, it works. But then it's, we have a very big list of test unit test cases, I think, for for this to to exactly to tackle this problem. So it it's been a challenge on to on on that part, I think.

As an example of that, in the the age of Unicode and different encodings, 1 of the more recent problems is

strange characters in datasets

where the encoding is not explicit.

So should we interpret this as an e acute, or is it invalid data or what? And occasionally, this happens even with big public databases

where they are curating user submitted

information and strange characters get in. And it could have been someone wrote this up on a computer set to, I don't know, a a Russian keyboard or whatever, and then this character has entered the database. And when you try and pass it on an American setup, it doesn't work or so Unicode will solve this, but at the moment, bioinformatics has plenty of legacy file formats and encodings is 1 of the gray areas that's completely under documented.

And another problem is is the fact that

some of the people developing some of these tools,

they they just learned to program in whatever language to develop that tool, and it's very common that you see some for instance, the most strange thing I've ever seen was an was a supposed XML file, which which

which shared comments, not XML comments, but comments in the middle. And it was anything but but a but a stand a standard format.

And it's very common that

people that do tools, they are essentially

researchers, that they are really not programmers, and dealing with some of some of the formats and some of the problems

of of those formats can can really be challenging

and sometimes not in in in a good way.

And what are some of the most exciting developments in bioinformatics

and biology

that have either happened in recent memory or that are, sort of in development that each of you are keeping an eye on? Personally, I would say the the revolution in sequencing

is really exciting, and it's still still changing.

Currently,

sequencing machines give you quite small fragments of DNA. And they've they've got longer than now into hundreds of base pairs, but the the current latest sequencing technology should give us even longer reads of DNA. And this will change a lot of the analytical

workflows.

Before, a lot of them it was like a jigsaw puzzle where you had to put together the tiny pieces to get the answer. The new sequencing machines are gonna give you much larger jigsaw pieces. They're gonna make certain tasks much easier. And also these machines, the the Oxford Nanopore in particular are very small. They're portable.

And I think we're gonna reach the point where perhaps in 5 years, you will be able to sequence things literally in the field with your laptop hooked up, and that will open up an awful lot of interesting biological applications.

For example, population monitoring,

disease outbreak monitoring,

agriculture medicine. It's

I think that's

1 of the most exciting developments in in biology and bioinformatics at the moment. Yeah. Not much to add there. It's it's really it's really the big thing. And,

where sequencing technology goes,

we will have to follow and some very interesting applications

are probably just around the corner. Yeah. I I have to I have to add that the I I have the same thing written on on, my notes that the, nanopore is probably 1 of the most exciting thing that's happening in recent times. You get much bigger, sequencing length or much bigger jigsaw puzzle pieces, as Peter mentioned.

But, I also have I'd like to add that there is a well, that's on the technology side. So that's the part where the data is generated. On the practical side, on bioinformatics, actually, I I see I I mean, I'm actually have been have been and still excited by the development of the Python ecosystem. I think I've been feeling it's become the de facto language for large parts of bioinformatics.

And,

it's,

1 of the things that I'm interested in is actually how people are coming to a more standard way of structuring their data pipelines,

or structuring their workflows. So this is also has been

a a problem, I would say, in the field. So aside from having so many different file formats, we also have different ways of saying how a pipeline should run. But the community is coming together now, more closely now to to agree on a standard and to say, you can this is my set of data and message pop ins, and you can run it on so many different back ends. But this is how I define it, and everybody can understand.

Probably another exciting development is because lots and lots of data is being generated

is the fact that,

performance computing is become is becoming a thing, and developing,

development of of high performance algorithms

is now very important. And suddenly, biology, that was, like, not using much, suddenly needs a lot of expertise

for being on the on the forefront of fields of everything related to cluster computing, GPU computing,

etcetera. And and this is probably going to increase and increase and increase as people are are able to to generate more data in in a cheaper and and and and cheaper way.

It's interesting from a point of view of the next generation that biology undergraduate students now are starting or happening for some time and having to learn not just mathematics for statistics, but also programming. Many universities now include substantial programming courses for biology students. So it's becoming a more numerous field, and that is

exciting and important for the future.

Yeah. I also have the impression that in the field you can find everybody from I mean it's the the big umbrella is biology where you can find people from easily from computer science, biologists themselves,

physicists, chemists, all sorts of statisticians, everybody's coming together and it's, it's really reflected in the development I think now and in the coming years. So a lot of exciting things.

And 1 thing that,

I'd like to go back to is given the

increase in the availability

of sequencing data,

and I'm assuming that that's going to increase the overall size of the datasets that you need to process. So I'm wondering if you foresee any difficulties

in the,

length of the sequence

creating any additional complexities

in processing it within biopython and whether that might require any sort of architectural

or just, sort of refactoring in those, sequencing libraries?

1 of the main tasks with DNA and RNA sequencing is genome sequencing. And the interesting thing there is that as the technology improves, this should actually get easier.

So it wouldn't surprise me that this is effectively a solved problem, and we wouldn't have to worry too much about dealing with

lots and lots of jigsaw pieces to put together a genome. But, yes, input output is is a bottleneck,

and

Python is nor is the best fit, but then you can drop down to c or call into specialized command line tools for doing the analysis. Things like

sequence lengths can be a problem. Some of the early file formats and tools in this area used a 32 bit limit on on index sizes, and that works okay for humans, but doesn't work very well on plants or even other animals like marsupials. So

the nitty gritty of the computer science does get involved and and sometimes it's a it's a,

handcuffing us.

So for somebody who wanted to

get involved in biopython and start contributing, how much domain knowledge is necessary

for them to get started?

And are there any areas of the project that would benefit from somebody who's purely focused on the software engineering

aspects of it? Let let me just say something. I ended up in population genetics

essentially, but so I'm a computer scientist by training. So when when I started this, I did a master's.

I I ended doing the the course on population genetics by mistake. So,

it can be,

it can really be overwhelming,

and I ended up doing population genetics now. But it can really be overwhelming, and the terminology

and everything can can be overwhelming. My suggestion

is don't be put off by that. Don't don't don't be afraid. And because currently, there are many, many, many things to do, and some of them are a bit tedious. Like, all these glue is not probably the the the most exciting stuff, but it's if you like that, that that's great. But there are lots of exciting stuff and and ending occurring on the algorithmic sides because this of data needs to be processed in an efficient and and reliable way. And if you if you might need to understand a little bit of the problem, but after that, it's mostly

the programming and statistics.

And I I I would say that more than the biology,

sometimes statistics can be especially when you are developing algorithms, statistics can be even a bigger bottleneck in terms of knowledge than, than than even biology itself. But,

sure, there is lots of space nowadays for for people that are interesting in the field, and they can actually just do very, very interesting programming and computing and and work on cutting edge stuff just to process this deluge of data, which is data that is that is ugly, that has errors,

that needs to be corrected in very, very sophisticated ways in terms of statistics and and and programming.

Because the data that we get from sequences

has errors, and there's lots of copies. And that that's that that makes for very, very exciting very exciting problems. It's not just glue,

but plenty of other stuff that that if you are able to get through the

the through the biological jargon, you will find really interesting stuff on the programming side. So for people wanting to contribute to the project, I would normally ask them what area of biophanics are they working on and try and match them up to a particular bit of biophan that matches those strengths. But we have people with very little or no biological background, and there are still plenty of things that they can do. If they have, for example, knowledge of c, they may be able to help improve that c bindings or even,

what you might call housekeeping work like,

style improvements to better match the PEP 8 style guidelines or docstrings improving the the documentation and code. There's plenty of things that people can do without necessarily having any biological or technical domain knowledge. So a variety of of contributors is great. And 1 of the things that I'm curious about from your answer, Tiago, was

when you do have errors in the sequencing data, I'm wondering if that's ever due to actual mutations in the genome

and what some of the difficulties are in differentiating

between legitimate mutations versus bad sequencing?

That that is a serious problem.

And but generally, no. The problem is that the sequencer because we we are deal dealing with very, very small

structures, chemical

chemical structures. It's the sequencer that's not sure

about, about the reads. And what the sequencer sends some metadata,

which is normally the the the

the quality of the read that that that the sequencer sing sings details. And if you are actually, 1 of the interesting applications is trying to find,

honest mutations, real mutations. And generally speaking, the amount of well, it depends on the organism. But in in in most organisms,

the mutation rate is much much lower than the error rate. So the the the 2 things conflate,

conflate themselves, and you need to spend lots of time trying to find what is a real

mutation versus what is an error. Again, especially because error rates

are normally much much higher than than mutation

real mutation rates. But I think, both Peter and Beau probably have have have have something to to contribute here because they they probably work as or even more than me on on that subject. I I think it it really depends on the, on a very yeah. The the quality of the incoming data as in the raw biological sample, also the the machine itself. Because there are actually a plethora of machines today. And they all have different error rates, but also different trade off, of course. But the we're actually I have the impression we can never really say up to absolute certainty. This is where the statistics come in, actually. So you have to model these things into our calculations. You have to model the error rate. You have to model the, maybe underlying mutation rate. And then you come up with an answer and you actually, you know, you you test your hypothesis that way. But it's it's it's something inherent so far, I think, that that needs to be taken into account in all all the downstream calculations.

And, occasionally, the error is actually from the computer. This isn't a biopython example. It's an example from an assembly tool called Miro,

and the author had a weird bug report. So DNA sequences are represented as strings of letters. And for some reason, he was getting the wrong letter, and he couldn't work out where this mystery letter was coming from. It didn't match the input data. He eventually tracked it down to 40 gram. So cosmic ray or just chance had flipped a bit and changed the letter, and it was a problem in the computer itself that had corrupted the data. So there are plenty of things that can go wrong, and

checking your your code and your pipeline for positive controls and negative controls to make sure that what you're seeing is real is an important part of science on a computer.

And what are some of the most problematic limitations

of biopython

and the overall bioinformatics

landscape?

And how do you work around them in your,

day to day? I would say inconsistent file formats is 1 of the most problematic things, and test cases and wider user uptake is 1 of the best ways to work around that. If more people are using code, the the better size it will be, the more likely the corner cases will be found and reported and fixed. But against that, we have a rapidly changing ecosystem. The sequencing technology in particular is changing rapidly,

and the file formats and standards themselves are evolving over time. So it's a moving moving goalpost, and to me, that's 1 of the biggest limitations.

So user feedback was is how we can really work around that. I I have to agree on the file format. I think that's 1 of the I I would I yeah. I don't want to use annoying as the word, but

it's very early in the analysis that you encounter these things sometimes, but you really have to take them into account. And, you really if you wanna make a script where you wanna make a program that can, you know, work in most cases, you have to take them into account. And I think that's

there's there's often there's no way of knowing what kind of error it's going, that's going to occur other than doing analysis and expecting people to

actually,

report back to you if they found something weird or unexpected. And then, you know, coming up with a,

a way to circumvent that. As an example,

gene names, the people that discover gene usually get to name it. And normally the names are fairly sensible. They're made up of letters and numbers, but you get special characters in there as well, like minus signs or angle brackets or quotes. And those tend to cause problems when they're embedded in larger file formats. It's it's actually usually,

a nice well, I I I don't wanna say nice again, but I think it it's sort

of good in a way that you if you if your program errors out,

when it cannot parse something because there are also some problems. For example, there are many ways to represent the coordinate in various file formats.

Some people use 0 based, some people use 1 based accounting,

but, these these things don't trigger an error. And sometimes when people don't conform to a file format that it should be 1 based, but they use 0 based, for example. And the analysis just goes through until the end, and then you don't notice the error until fairly late in the in the in the analysis process. So that's something that also needs to be in the back of most,

people working in bioinformatics today. And that's the optimistic

view because I suspect many errors

are actually,

not detected at all because they are

silent bugs as as as Bo was saying. So I wouldn't be shocked if if many of of of these analyses with with important

silent bugs actually

get get get published, and nobody really detects the the problem there.

Yeah. That's a bit worrying indeed. And I think to to this extent, we actually do a lot of validation in bio Python. Right? I think a lot of the code is checking whether

sanity checks whether the file formats that we're passing is really what we think it is.

A good portion of the bug reports are people with weird files that don't follow the format, but something's gone wrong. So, yes, biopython tries to be quite strict on passing things and catch errors early. And that's a a design choice. Mhmm. Yeah. In general, it's definitely better to fail fast and fail loudly so that people can account for that

and maybe potentially provide some override path if they say this isn't actually an error. It's something that I'm legitimately trying to do. But, yeah, being able to raise issues earlier in the process will definitely save a lot of time and effort. Resist the temptation to guess, which is a philosophical difference. I think, for example, BioPerl, for their file format process, they

are more willing to do things like automatically detect the file format.

Whereas in Biopython, we said, look. The user should know what it is, and if it doesn't match, that's usually a sign that's a problem somewhere else. Yeah. I think that's also alluding to, to 1 of the Zen pythons, I guess. Better be explicit than implicit, I think. Exactly. Yes.

So for people who are interested in getting involved in the field of bioinformatics,

what are some of the bits of encouragement that you might offer and maybe some, reference material that they might look at to determine if that's a path that they wanna go down? The advice I give people is that they kinda have to learn to program,

and the best choice of your first programming language would be whatever your peer group is using so that you have people right there in the room that you can talk to and get help from.

So if the group happens to work on Python, great. Start learning Python. If they look all use Perl, then then learn Perl. So in person feedback when you're getting started is really helpful. Once you build up some experience, you're gonna be able to help yourself with online searching and googling for error message that is very practical way of of learning to program. And unfortunately, you need to do that in in bioinformatics as well. I think the answer will vary,

slightly,

with regards to your background. So for instance, if you come from computer science, you might be overwhelmed

by,

all all the biology

and all the biological

problems.

And, again, I think it's important that

you find,

probably,

mentors or somebody that can help you navigate what is

the the enormous

world of biology.

And you really need to to have a conversation with with researchers,

which which, by the way, probably need you because there is a massive need of of of people to do that data analysis.

And

find finding,

good mentors on the biological side is very important if you come from from computer science, I think. Yeah. I I cannot overstate the, importance of having mentors with,

like, direct mentors that you have in your school or in your,

maybe your lab. Or also something that's probably online that you don't meet

in real life very often because there's just so much happening in bioinformatics and

in so many different sub fields, like the on the mathematics part, on the statistics part, on the technological side, on the computer science, the algorithms that people, write to analyze this data. And you you can easily get lost with with so much development. You don't know which 1 to pinpoint to, and you don't know which 1 to pay attention to more closely. So I think, yeah, having a mentor, having a peer group is probably the most important thing. And,

I would also say to maybe to add to

to to resist the temptation of writing everything your own. And that's a good way to learn, I think, if you wanna do it on your own. But there's a lot of the tool set, a lot of libraries are getting mature. And then it's not that difficult for these days to find a collection of people or a collection of libraries that have been developed,

taking over so many corner cases that you're probably not aware of,

that, you may have stumble if you had written your own tool. So that's that's really important of peer groups, I guess, because they can pinpoint you to the latest advances and the latest best practices to to, to grow in the field. And just to note also, there is some asymmetry in information on the Internet. So if you come from computer science, you are used that you you Google for everything,

and there is an answer. There are actually 10 answers, and documentation

is great. But if you then go into the research side, the the amount of available documentation is is much, much

lower, and the communities are smaller.

So it's

it's much more difficult if you don't have somebody to, to help you and and guide you because the the amount of available information

that really helps you is not as good as if you go into into in in the other direction.

Because as as we all know, if you want to learn programming,

actually, if you are at home, you you can actually there's lots and lots of stuff. But if you want to learn,

cutting edge biology, there is much less stuff around. That's 1 of the nice things about being involved in biopython and the wider community here. It's it's networking and meeting other like minded people who may be able to help in practical things about how to do a particular analysis, what tools to recommend, or job opportunities, that kind of thing. I like the community here, and that's 1 of the reasons why I'm still actively involved many years later.

So are there any other topics that you think we should cover before we start to close out the show? I'll probably just have that if you come into this field, you have to like multidisciplinarity

and people with different thought patterns

because I imagine most most listeners here are programmers or or or or waiting to program. They will have they will have to learn a little bit of biology. They will have to learn a little bit of of statistics,

and they they might end up in interacting

with biologists, statisticians,

biochemists,

and which have different,

mentalities and and different ways of addressing

problems. And that that's really if you like that, that's really awesome. But you have to be prepared

that this is a a multidisciplinary

effort

where you you will,

sooner or later,

have to learn new stuff that you are not very comfortable with. So for anybody who wants to follow the work that you are doing and the work that's being done on biopython,

I'll ask you each to add your preferred contact information to the show notes. So for anybody who wants to

follow-up, they can do that from there. And with that, I'll move us to the picks. And my pick this week is a conference that's local to where I'm at in the Boston area. It's actually happening up in Lowell, Massachusetts.

It's called keep it low conf, and it's a,

DevOps oriented conference.

So it's fairly,

small scale and low key, but it's

trying to reimagine

some aspects of the sort of common tech conference

and

make it a little bit more, I guess, cozy is probably the best way to describe it. So for anybody who's in the area and interested in

that sort of general field of technology, definitely recommend going and taking a look at that. It's run by, 1 of the local community members.

So definitely

worth, supporting his efforts there. And with that, I'll pass it to you. Do you have any picks for us today, Peter? I could pick, Jupyter Notebooks, formerly known as IPython.

For those that aren't aware of roughly speaking, it it's a way of

combining software and and notes of markup languages in 1 document, a bit like, mathematical or MATLAB. So you can have your figures and your code,

for an analysis all in in 1 place, and it's very nice as a a teaching tool or for sharing,

an analysis with a colleague. And there are various things online that will exploit. So if you just stick them up on GitHub, then it will render them in a read only way, which is really neat, and I want to play with these more. And, Beau, do you have any picks for us this week? I think I,

I'll I'll go with conda actually. That's, sort of like a distribution channel for scientific packages these days. It's fairly similar to Py Py in that I think it has a lot of it takes a lot of inspiration from PyPI. Sorry. I think,

but it's it's actually

becoming more

used in bioinformatics today because it can deploy not only Python packages, but also other binary packages that are used often in bioinformatics. There is a bioinformatics specific Conda channel called Bio Conda. If you want to distribute your packages,

especially once we're able to do informatics, I encourage you. I think I encourage everybody actually to look at it and how they do things. It's not perfect 100% of productivity, but it is very close and it's very convenient to use and it's very convenient to, also deploy packages there. So that's I think that would be my pick today. Okay. And how about you, Thiago? I have a few. So if if you want to

develop in in the JavaScript world, the Python project,

allows you to to write,

Python Python 3 directly on the browser. And it's really, really an interesting,

way of getting your feet wet in the web programming space

if if you are a Python programmer. On on a on a nontechnical

note, if you want to know where Paradise is, that is Glacier National Park and Flathead Lake in Northwestern Montana.

And, it's probably the most beautiful place I've ever been, and I encourage you to to visit it because it's it's if you like nature, it's

absolutely mesmerizing.

Well, I appreciate all of you taking the time out of your day to join me and talk about the work you're doing with biopython and bioinformatics.

It's definitely an interesting field that's seeing a lot of activity

and innovation. So it's definitely a space I'm interested in keeping an eye on. So I appreciate your time, and I hope you enjoy the rest of your days. Thank you so much. Thank you. Bye bye. Thank you for organizing this.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.init