Summary
Bioinformatics is a complex and computationally demanding domain. The intuitive syntax of Python and extensive set of libraries make it a great language for bioinformatics projects, but it is hampered by the need for computational efficiency. Ariya Shajii created the Seq language to bridge the divide between the performance of languages like C and C++ and the ecosystem of Python with built-in support for commonly used genomics algorithms. In this episode he describes his motivation for creating a new language, how it is implemented, and how it is being used in the life sciences. If you are interested in experimenting with sequencing data then give this a listen and then give Seq a try!
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host as usual is Tobias Macey and today I’m interviewing Ariya Shajii about Seq, a programming language built for bioinformatics and inspired by Python
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Seq is and your motivation for creating it?
- What was lacking in other languages or libraries for your use case that is made easier by creating a custom language?
- If someone is already working in Python, possibly using BioPython, what might motivate them to consider migrating their work to Seq?
- Can you give an impression of the scope and nature of the tasks or projects that a biologist or geneticist might build with Seq?
- What was your process for identifying and prioritizing features and algorithms that would be beneficial to the target audience?
- For someone using Seq can you describe their workflow and how it might differ from performing the same task in Python?
- How is Seq implemented?
- What are some of the features that are included to simplify the work of bioinformatics?
- What was your process of designing the language and runtime?
- How has the scope or direction of the project evolved since it was first conceived?
- What impact do you anticipate Seq having on the domain of bioinformatics and genomics?
- What have you found to be the most interesting, unexpected, and/or challenging aspects of building a language for this problem domain?
- What is in store for the future of Seq?
Keep In Touch
Picks
- Tobias
- Ariya
- Breakthrough documentary
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Seq
- MIT CSAIL
- Bioinformatics
- LLVM
- Intermediate Representation
- MatLab
- Moore’s Law
- BioPython
- Smith Waterman Algorithm
- Hamming Distance
- Pattern Matching in Functional Programming
- SIMD == Single Instruction Multiple Data
- Computational Genomics
- Phylogenetics
- Sequence Read Archive public data set
- Google Cloud Life Sciences
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit in private networking, node balancers, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI and CD pipelines, they've got dedicated CPU and GPU instances. Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute.
Don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Arya Shaji about SEEK, a programming language built for bioinformatics and inspired by Python. So, Aria, can you start by introducing yourself?
[00:01:09] Unknown:
Sure. I'm a, 4th year PhD student at MIT, CSAIL, working with professors Bonnie Berger and Saman Amarsenka. Before SEEK, I worked on various other genomics applications like genotyping and, sequence alignment for a particular kind of 3rd generation sequencing data. And now, for the past couple of years, we've been working on this project. So
[00:01:33] Unknown:
that's sort of a little bit about me. And do you remember how you first got introduced to Python?
[00:01:37] Unknown:
So the first language I learned was actually Java in high school. And shortly after that, I stumbled across Python, and think I kinda gravitated towards it because, it seemed like a language that was really targeted at people who don't have a computer science background and sort of from its design and everything. It's very simple, clean syntax, and, the large community kind of led me towards Python. So that's how I, first first encountered it. And so
[00:02:07] Unknown:
now you're working on this seek project. I'm wondering if you can give a bit of a description about what it is and your motivation for creating it and how you first got interested in the area of bioinformatics.
[00:02:18] Unknown:
Sure. So I guess I'll just say, a few words about what SEEK actually is. So sort of at a high level, SEEK is essentially a domain specific language for, computational genomics and bioinformatics. Language wise, it's based largely on Python. So you can think of it as sort of a, Python implementation. On top of which, we've added a few domain specific features, types, and compiler optimizations. The key difference between seek and CPython, for example, is that c compiles to LLVM IR with no runtime overhead. So the language itself is completely statically typed and, unlike CPython, we have no, no, runtime type information or anything like that. And so, for as far as the motivation is concerned, really what motivated SEEK was, the fact that we were looking at these different genomics applications and, what we realized was that they're doing different things at a high level, but they're essentially reusing the same primitives. So they're operating on sequences, DNA sequences, for example. Those are strings of ACGT.
They're dealing with the really big data. They're doing operations like sequence alignment, sequence indexing, hashing, stuff like that. So they're all sort of using the same set of building blocks. And that's what really motivated us to build a language that exposed these operations as primitives and the compiler that understood them and could do optimizations on them. We tried for a really long time to actually to build a high level language that exposed these primitives as sort of course grain building blocks that you could glue together in different ways. And, ultimately, we actually spent, like, a year and a half trying to do that, and we essentially found that it was impossible. Like, in bioinformatics, there's just too much, you know. Yes. They're using these, primitives, but there's so much stuff interspersed in between that to build a high level language that had just these building blocks was essentially impossible.
And I think a good analogy for that is MATLAB. Right? So MATLAB is like a language for, linear algebra, but it's actually a low level language. Right? Because you can't express all the things that are relevant to a linear algebra application with just matrix operations alone. You need some low level infrastructure, and I think bioinformatics is, is really similar. So that's why we kind of settled on Python mainly, again, because of its appeal to non programmers and, you know, the large community that it has. But we still need the performance, you know, biological data is growing really fast. It's, like, much faster than Moore's Law and, because of that performance is really critical in our domain. So that's why we sort of took Python and reimplemented it from the ground up, in a, you know, statically typed way with no, again, no runtime overhead. So that's sort of the, the motivation for SEEK.
In terms of bioinformatics itself, I think the reason that I sort of gravitated towards that was just because it was, I don't know. I I I wish I had, like, a more concrete answer, but it was just a it's a really cool field, I think. And it's a really concrete application of computer science. So I think,
[00:05:21] Unknown:
that's sort of why I I really like this field. Yeah. It's definitely an interesting area, and it's exciting to see some of the ways that computational power can be applied to the realms of biology to get some more concrete and in-depth understanding than what we've been able to do previously. Mhmm. Definitely. And do you have much experience in the past of implementing different languages and compilers, or is this 1 of your first, forays into that space?
[00:05:49] Unknown:
So, so my research has always been basically bioinformatics, but on the side of kind of, had a few toy projects here and there that were compilers. So I have a little bit of knowledge, nothing really, like, super formal, but just from personal projects that I've, worked on in the past. But I I'd say this is definitely
[00:06:09] Unknown:
the biggest compilers related project that I've been a part of. And you mentioned that in some of the other available languages and run times that there weren't the necessary primitives specific enough to bioinformatics to be able to get the performance that you were looking for. And so you went down the road of creating this custom implementation, and you also targeted Python as the syntax for it. And I'm curious for somebody who's already working in Python and maybe using something like the biopython packages for doing some sort of bioinformatics processing. What might motivate them to consider migrating their work to SEEK even if they already have an existing code base? And what are some of the potential challenges that they might face in that conversion?
[00:06:52] Unknown:
Sure. So 1 of our goals is basically to make the Python to seek transition as seamless as possible. So we're not quite at the stage where you can literally take an entire preexisting Python code base and run it in seek. You'll probably need to change a few minor things here and there. But, you know, that we're continuing to close the language gap, and we're hoping to, at some point in the near future, get get to that stage. But, you know, if you have some small snippet of Python code, most of the time, it should just work as is and seek. About BioPhyton in particular and maybe some other libraries, just the same, I think the 1 of the main reasons for switching would definitely be performance. Right? So Biopython is a pure Python library, and we've actually done a few benchmarks against it. And seek is, substantially faster, especially if you're, you know, dealing with these huge datasets. Now 1 thing I will mention actually that's interesting is we're actually implementing biopython, in seek. So we're you know, BioPhyton has a preexisting API. We have a lot of the same functionality and we're in the process right now of actually implementing BioPythons API using the primitives that are available in seek. So, again, I think the point is to try to make that transition as seamless as possible.
But for someone who's really interested in performance, I think Seek would definitely be a, a pretty good alternative to to bio Python. I think another important point when talking about the difference between a language like Python and a language like Seek is sort of the metadata overhead for lack of a better term. So if you think about a linear algebra application that sort of does operations on a few really big matrices, most of the time there is actually just spent in the linear algebra kernels that are, you know, hand optimized c or Fortran kernels. And you really only have a few objects that exist in your Python application. In bioinformatics or a lot of this genomics applications that we're interested in, the situation is really different where you have potentially billions of sequences that you're processing at at any given time. So any metadata overhead or run time type information or per object overhead adds up really quickly. So I think that was also 1 of the motivations to build SEEK that sort of eliminated all of that overhead. So I think that's another reason why SEEK might be a might be a good choice for for those kinds of applications.
[00:09:13] Unknown:
And in terms of the actual code bases that people who work in bioinformatics are going to be writing and collaborating on. What is the general scope and scale of those types of projects in terms of the sort of complexity of the applications and the maybe number of different code files? Is it something where somebody writes a fairly straightforward script to brute force through these different sequences? Or is it something where people are going to be leaning heavily on libraries and wanting to integrate with something like web frameworks or other components of the ecosystem?
[00:09:48] Unknown:
Yeah. That's a good question. I think it varies a lot. There are definitely applications where, you know I mean, there are cases where you're doing some kind of ad hoc analysis and you really only have a single script that you're worried about. There are definitely a lot of cases where you have some bigger project, that you're not necessarily relying on external libraries, but just the project itself is bigger and you have, you know, multiple source files or something like that. And then, like you said, there are definitely cases where where you're relying on external libraries. Probably not web related, but definitely, you know, things like machine learning frameworks for instance. So I think it varies a lot.
And I think because of that, it's really important and seek to have interoperability with other frameworks to to be a priority. And that sort of has, been a priority for us. So so I think I think it it definitely varies a lot based on your concrete application.
[00:10:41] Unknown:
For the primitives that you are incorporating into Seek, you mentioned that there are some elements that are lacking in different implementations. And I'm curious what your process was for identifying what the useful primitives were and some of the algorithms to bake into the implementation of SEEK to simplify the work of people working in bioinformatics.
[00:11:04] Unknown:
Sure. So I think that's actually been 1 of the hardest things to do because, again, we spent a really long time trying to design a language that sort of isolated these primitives and just expose them, in in a in a higher level language. And, the reason for that is because, yeah, you know, abstractly, you have these primitives like sequence alignment, for example, which would be like a Smith Waterman dynamic programming algorithm or, you know, some kind of sequence indexing, whether you use a hash table or FM index is also a very common, data structure used in, genomics applications. But when you actually look at implementations of these things, they vary so much from case to case that it's really hard to provide these things as sort of, fixed, course grade building blocks.
So, again, this was why we settled on a much lower level language. But having said that, there are definitely things that are essentially ubiquitous. So sequence types. Again, that's if you're dealing with DNA, that's strings of a, c, g, t, or you can also have an end base there, which is like an ambiguous, ambiguous nucleotide. Sequence types, kmer types, which are also frequently used. Those are, fixed. So sequences are arbitrary length. Kmer types are fixed length k sequences, and you can do things like 2 bit encode those and stuff like that. They're very common operations like reverse complement, for example, where you take a sequence and you physically reverse it, and then you swap a's and t's and c's and g's. That's a very common, operation that's done on sequences and, basically, any genomics application.
Again, these string matching, algorithms like Smith Waterman or Hamming distance calculations. So these are sort of the things that we thought were useful to implement and seek. I think the other benefit of seek is that it's a compiler. So we can actually do higher level optimizations that a library
[00:12:58] Unknown:
couldn't necessarily do. So for example, if you take reverse complement, reverse complement has various algebraic rules. So if you take a sequence and you reverse complement it twice, then you get the same sequence back. So a compiler that actually knows about these things can exploit those algebraic rules and, do optimizations that a library couldn't do. And then for the target audience of SEEK, is it largely people who are working in the sciences who just need to be able to process the data that they're coming from? Or is it also common that there might be a set of programmers on staff who work with the domain experts to be able to understand the scope challenges that poses in terms of how to approach some of the challenges that poses in terms of how to approach some of the interface and workflow for the people who are actually using SEEK in their day to day work.
[00:13:51] Unknown:
Sure. So I think, SEEK as a language, I think is in a sort of a unique place because it, in a way, bridges the gap between a really usable language like Python and the performant language like c or c plus plus And because of that, I think, if you're someone who is not necessarily an expert in programming, but, you know, as, let's say, a biologist or something like that and you're trying to analyze some really large dataset, I think, Sika, that's, someone who we actually designed the language for. It's supposed to be something that someone who doesn't have a background in programming or software engineering can pick up relatively easily and use. But at the same time, someone who is, let's say, like, an algorithmist and they're designing these really performance critical, applications for, you know, sequence alignment or sequence assembly or what have you, that's also someone who I think could benefit from c because, you know, we do a bunch of these domain specific optimizations that are pretty difficult to replicate by hand. So I think both sides of the spectrum are sort of people who were were interested in targeting. Again, that's sort of a tall order, but definitely something that we've thought a lot about.
[00:15:03] Unknown:
For the overall workflow itself, what are some of the tooling elements that you have implemented to simplify the overall workflow, and how might the actual
[00:15:20] Unknown:
might 1 impact. But what are some of the other elements that might differ in terms of how people will approach building things with seek versus what they were doing in Python? Sure. I think that's probably the area that we need to work on the most. So, you know, we have for example, Seek has a debug mode that gives you nice stack traces and all all sorts of things like that just like Python does. But again, seek compiles to LLVM. So it's definitely, the debugging process for seek is more challenging than today is more challenging than it is in Python. So I think that's definitely something that we're hoping to work on in in the near future to sort of give peep the tooling and the debugging support that if they come from a Python background, they're they're familiar with. I think the other interesting thing in that regard for seek is that because it's a domain specific language, we could actually do things like domain specific debugging, for instance. So, you know, if you're processing sequences and you're aligning sequences, we could potentially implement something like a domain specific, debugger that lets you, you know, visualize alignments, visualize sequences, and all that kind of stuff. So I think we're not quite where we wanna be, in terms of debugging and tooling and all that kind of stuff, and the primary reason for that is really just manpower. We we haven't had anybody,
[00:16:37] Unknown:
we're we're just sort of limited on manpower. So but again, we that's something that we would definitely wanna work on. And because of the fact that the syntax is very similar to Python, are you able to lean on any of the existing tools such as linters or static analysis that's available to the Python ecosystem and modify that to work with seek as a sort of shortcut to be able to get some of those tooling elements in. And then another aspect of the workflow and tooling is the question of being able to test the applications that you're developing with SEEK. And I'm wondering what facilities you either currently have or are planning on. Sure. So
[00:17:13] Unknown:
in terms of existing Python tools, that's, not something we've actually explicitly explored, but I think it's something that definitely makes a lot of sense. Again, like, the syntax is almost identical to Python. We've added a few, extra language features like pipelines and, pattern matching. So I think we might have to make some modifications to some of those existing tools that you mentioned, but that's something that definitely is in the realm of possibility. So for testing, we actually have a testing framework in SEEK that is kind of similar to to what it is in Python, but, essentially, what you can do is you can annotate a function with the test decorator, and then you can have asserts in that function that, they won't terminate the program. They'll actually just, like, fail the test if they if the assert fails. But I think it would be really nice to sort of expand that and maybe even implement some of the testing facilities that exist in regular Python into sync. And again, you know, the fact that we haven't done that is really just, because of manpower. We just haven't had any anyone,
[00:18:16] Unknown:
to work on that yet. So For the actual implementation of seek itself, I know that the syntax is, as you said, largely similar to Python. And I'm wondering how you approached the actual construction of the language and the compiler, and if you are able to leverage any of the elements of the CPython implementation or at least use it as a reference for things like the tokenizer or anything like that for being able to build the parser and the compilation and some of the underlying architectural decisions that you've made as you have gone through implementing SEEK? Sure.
[00:18:52] Unknown:
So a lot of the language design, thankfully, was essentially just dictated by Python. So, you know, Python behavior and semantics and, of course, syntax is basically all present in SEEK. On top of that, like I said, we've added a few features like pipelines, for instance, which are a pretty natural abstraction for thinking about processing genomic data, pattern matching, which you might find in other functional languages. Novel aspect of that in SEEK is that we actually allow genomic pattern matching so you can match sequences and you can have, you know, wild card bases or, like, stars, like a regular expression, stuff like that, and some of the new types that I I talked about. We don't reuse any components of CPython. I think the parser is something that potentially, you know, that that's actually something that we could possibly reuse. Again, we have to add support for some of these other language features, but Python's parser is something that we could potentially build on. In terms of the runtime, our philosophy when designing the runtime was essentially just to make it as minimal as possible. So all of the core types like lists, dictionaries, and sets, and even a lot of the functionality of types like integers and float and string and stuff like that, more primitive types are all implemented in c. So we have a small runtime library mainly for garbage collection and a few hand optimized, SIMD optimized functions for sequence alignment. But our goal has basically been to reduce that as much as possible and implement everything that we can in seek itself. So that's sort of been our our design methodology
[00:20:25] Unknown:
so far. Because of the fact that you are using the Python syntax as a reference, I'm curious how evolutions to the language are going to be reflected in Seek itself and what plans you have for being able to maintain future compatibility as the language evolves and any potential challenges that you anticipate as a result of that? Yeah. I think that that's a that is definitely a challenging problem. Like, you know, as you make changes
[00:20:52] Unknown:
to Python, then we'll sort of have to play catch up a little bit, at least for the short term. Yeah. I think, again, your suggestion of, you know, using Python's, CPython's parser, for example, might actually help in in that regard because, you know, it'll sort of make it easier possibly to integrate some of these language changes. But I think, c being a separate language that's essentially independent from c Python, I think that's definitely a, a challenge to think about. Yeah. And you mentioned that when you first started approaching this problem,
[00:21:24] Unknown:
you spent a year going in a different direction before you ended up with the current direction of building Seek. And from the time that you actually started building this custom runtime, I'm curious how the overall scope or direction of the project has evolved
[00:21:38] Unknown:
and how much that differs from your original conceptions of it. Sure. So I think, seek today, really, it can be used as a general purpose language. It's probably best suited for scientific applications that don't rely on Python's dynamic features, some of which we'd we've, had to get rid of. But, you know, if you have some scientific application that isn't dynamic in in nature, I think SEEK could, SEEK could be a good fit even if it's outside of bioinformatics. And, again, being a low level language, we've talked a lot about extending to other fields of, subfields of bioinformatics.
So right now, we're focused mostly on computational genomics, but, you know, there's more to bioinformatics than that. There's phylogenetics, for example, or population genomics. And I think SEEK is in a place now where we can potentially target those fields as well. And again, when we started, like I said, we started by thinking about a much higher level language. And I think if we had gone that route, it would have been much more difficult for us to expand to some of these other areas. And like I said, even in other domains outside of bioinformatics, I think SEEK can potentially be, be applicable. So at this point, again, our our domain of interest is still bioinformatics, but I think SEEK can definitely be a useful tool even even outside of that field. And so as far as the actual bioinformatics
[00:22:54] Unknown:
field, I know that when looking through the documentation for this project, it alludes to the fact that there is a lot of sort of messy code or inconsistent approaches to problem solving in terms of the way that the software is developed and challenges in terms of the speed of execution. And I'm wondering what impact you anticipate SEEK having on the overall domain of bioinformatics and genomics, and some of the standards that could be implemented in terms of the training of people who are working in those fields to improve the overall capacity for being able to run these analysis. And the impact that this increased speed has on their ability to perform meaningful research?
[00:23:36] Unknown:
Sure. So we hope to see a lot more tools and methods, being written in SEEK in the coming months years. And I think the benefit that would have is, not only would it sort of give everyone a unified framework for bioinformatics software development, but, optimizations and features that we add in future versions of the c compiler could even be applied retroactively to existing software. Right? So if someone writes piece of software today in c, and however many months down the line we add support for a GPU back end or FPGA back end, then our hope is that that software that was written today could just run as is on on those back ends or same for any other compiler, optimizations that we add. And I think, really, like, in an ideal world, Seek would allow bioinformatics software to sort of keep pace with the growing data that, again, is sort of really outpacing, you know, Moore's Law. And I think by 2025, it's predicted that genomic data will even have surpassed Twitter and YouTube data. So it's a really, really fast growing dataset, and I think Seek sort of gives us at least 1 tool to keep pace with that. And in terms of the datasets,
[00:24:45] Unknown:
do you find that there are a large volume of information that's available in the public domain for people to be able to do their own experimentation and test out seek with those datasets? Or is it something where a lot of the information is held in data sets by different companies working in the biotech industries or the pharmaceutical industries? And any challenge that you've seen in terms of being able to make Seek available and get it in the hands of people who are doing these types of research or any collaborations that you are
[00:25:19] Unknown:
either currently engaged with or seeking to be able to get that feedback to help evolve the language? Sure. So I think in terms of data, it's sort of a combination of of both. There's definitely a lot of publicly available data out there. So for example, there is, there are many databases. 1 of them is SRA, sequence read archive, and that has a ton of publicly available sequencing data. So I think the the there there's a lot of data out there. Yeah. In terms of collaborations, Google Cloud Life Sciences actually recently reached out to us to talk about running SEEK on the cloud. So that's something that, we're we're actually starting to work with them to develop a cloud back end for, for SEEK. And I think that's something that we're really excited about, especially, again, you know, as the scale of the data increases to have something like a distributed computing back end for SEEK,
[00:26:10] Unknown:
and in general, a compiler that can perform not only single machine optimizations, but optimizations that are relevant to a distributed computing environment. I think that that's a really, really powerful tool. So As you have been building the seek language and working on improvements and experimentation with it and working with some of the end users, what have you found to be some of the most interesting or unexpected or challenging aspects of building a language for this problem domain and just some of the overall elements of language and compiler design? I think definitely, Ade. That's something I I alluded to earlier. I think in terms of bioinformatics,
[00:26:44] Unknown:
identifying, the core primitives and operations and, you know, bioinformatics and computational genomics is actually really, really hard. Again, because, you know, you could draw these very coarse grain boxes around things like alignment or indexing or hashing and stuff like that. But when you actually look at the concrete implementation of these things, some subtle details, but those details actually have algorithmic implications. So it's really hard to sorta identify those primitives. But I think we're sorta on the right track in giving people a low level infrastructure to implement these things themselves actually. So that, again, that was 1 of the motivations behind going with a lower level language rather than a higher level DSL. In terms of actual compiler design, I think, I don't know. For me, dealing with these generic types and duck typing of Python in a statically typed context has actually been really, really hard. So and what we're doing in seek is essentially taking a Python program that's dynamically typed by nature and imposing a static type system on top of that, and that can lead to some actually really, really difficult to resolve corner cases. So that's actually been a really hard on the programming language and compiler design side, that's actually been a really hard problem. And we're still continuing to sort of close the language gap with Python. So there's still some cases that, that we're working on there. So that's sort of, I would say,
[00:28:03] Unknown:
on both sides, on the bioinformatics side and the programming language side. Those are sort of the 2 biggest challenges for me at least. Do you think that there are other problem domains that would benefit from having a similar runtime available to them? And do you think that there is just an overall benefit to having custom languages for some of these different research areas or different use cases versus having general purpose languages that are broadly applicable, but not necessarily optimized and sort of what you see is the trade offs and the overall spectrum of programming language availability for solving some of these interesting and challenging problems?
[00:28:43] Unknown:
So I think it it varies a lot by domain. Bioinformatics is sort of a it's a unique domain in that a lot of practitioners are not software engineers or programmers or computer scientists by trade. So I think for bioinformatics, something like SEEK, going the Python route and implementing something that behaves and, you know, has the semantics of Python was was really useful. In terms of general purpose languages versus domain specific language, I think both definitely have their their use cases. I think with DSLs, it's important to to sort of have interoperability with existing libraries and systems to be a priority. And again, this is something I mentioned earlier, but I think, we don't wanna sort of fall into the trap of not being interoperable with other systems and then someone who uses seek is sort of unable to use, let's say, NumPy or TensorFlow or some of these other libraries that exist for Python. So I think DSLs are really good, especially if performance is critical.
But at the same time, I think interoperability needs to definitely be a priority, and that's something that,
[00:29:52] Unknown:
we've definitely had in the back of our minds as we've worked on SEEK. So what do you have planned for the future of Seek in the near to medium term? And what are some of the overall impacts that you hope to have as it progresses?
[00:30:06] Unknown:
So I think in the short term, definitely, we're still working to close the language gap with Python. So right now, we have a unidirectional type checker. So a lot of most cases we can, you know, do type deduction on. If you have a equals 2 plus 2, for example, we can tell that a is an ant. That's a super easy case. But Python actually has some more complicated cases, like, for instance, if you use lambdas Python, their Lambdas aren't typed. So if you sort something, you just have, like, list dot sort and some Lambda. You actually need bidirectional type checking to resolve the type of that Lambda. So we're, in the process of actually implementing that right now. Along with some other things like, optional types which will allow us to deal with nones, for example. You know, in Python, you can assign anything to none in the statically typed context is a little bit more tricky. So that's another thing that we're working on. So just sort of closing the language gap a little bit more. Beyond that, we're working on a new, intermediate representation that's a little bit higher level than LVM IR. And our hope is that that will actually allow us to do a lot more, Python specific and, domain specific optimizations. So, you know, in Python, let's say, if you have a case where you're adding 3 or concatenating 3 strings, that's something that we could potentially recognize in this new intermediate representation and optimize. And on the bioinformatics side, there are many again, that case that I mentioned where you reverse complement something twice, that's something that we can potentially catch and and if we have that if we have an IR as well. So that's another project we're really excited about. Another thing I alluded to earlier was different back ends like GPU and FPGA and seeing how those how those things interact with our with this domain various domain specific optimizations.
That's another thing we're, very excited about exploring. And, like I mentioned, working with Google Cloud Life Sciences to, run seek on the cloud and seeing what we can do there. I think that opens up the door to a whole bunch of other domain specific optimizations that, that a compilter like Seek could do. So those are sort of the projects that are ongoing right now that we're really excited about. And, you know, just getting more people involved in Seek, I think. So far, we've had a really limited number of people working on it. It's myself and, the co first author on our paper, Ibrahim Nivanagich, who is he was a former post doc at MIT, and now he is a professor at University of Victoria. So it's mainly been the 2 of us with the undergrads, your ops, working with us. So just getting more people involved, I think that's that's really something we're very excited about as well. For the future success and sustainability of the project, what are some of the risks that you think
[00:32:40] Unknown:
could pose a threat in terms of its future viability? And what are what is your thoughts in terms
[00:32:48] Unknown:
of the level of involvement that you're going to have once you have finished your PhD program? So for the for the first part of your question, I think I'll just have to come back to interoperability because I think that's such an important point. We really wanna make sure that if someone uses CIC, they're still able to use these other Python tools and, libraries and frameworks that exist today. I think that's something that we're not quite at the place we wanna be right now. We have Python interoperability and seek. So if you have some Python function that's pure Python, you can call it right now and in seek. And we do all of the marshaling to and from Python types between seek types. But I think, you know, when we talked about to have tooling and debugging support, that's something that we're actively working on. So in terms of viability, I think that's a really important, really important aspect. In terms of my own involvement, I think this is a project that has you know, we have a huge laundry list of ideas and things we wanna explore. So I'm not yet a 100% sure about what my future plans are gonna be, but I definitely envision working on this, even after I I complete my PhD. So I think this is definitely a long term project and, I'm really excited about it. So yeah. Are there any other elements of the Seek project itself or bioinformatics
[00:34:04] Unknown:
or language design that we didn't discuss that you'd like to cover before we close out the show? I think we did a pretty good job, to be honest, covering all our bases. Well, for anybody who wants to follow along with the work that you're doing or get in touch or contribute to the project, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And this week, I'm going to choose board games as a way to have something to do particularly in these interesting times. So definitely recommend taking a look at your board game closet or maybe contributing to it. 1 that I've been enjoying play with my kids is a game called Labyrinth. And I'll also mention board game geek as a great site for being able to discover and read reviews on different board games as you're determining what new ones to add to your collection. And so with that, I'll pass it to you, Aria. Do you have any picks this week? So I I would have to go with this documentary that I recently watched called Breakthrough, which is,
[00:34:57] Unknown:
it's not a it's not a new movie. I think it was released in 2019, but, it's just 1 that I happened to recently watch. It's about Jim Allison, who's a scientist, whose work led to new cancer treatments, and he ultimately won the Nobel Prize because of it. He's a really unorthodox scientist, I would say, and the movie sort of shows his perseverance throughout his, research career.
[00:35:19] Unknown:
And it was a really interesting movie. I'd recommend it. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with SEEK. It's definitely a very interesting project continues to grow and gain some adoption. So I appreciate all the work that you're doing there, and I hope you enjoy the rest of your day. Thank you so much for So I appreciate all the work that you're doing there, and I hope you enjoy the rest of your day. Thank you so much for having me. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Arya Shaji and SEEK
What is SEEK and Motivation Behind It
Challenges in Bioinformatics and SEEK's Approach
Transitioning from Python to SEEK
Tooling and Debugging in SEEK
Impact of SEEK on Bioinformatics
Future Plans and Challenges for SEEK