Summary
Serverless computing is a recent category of cloud service that provides new options for how we build and deploy applications. In this episode Raghu Murthy, founder of DataCoral, explains how he has built his entire business on these platforms. He explains how he approaches system architecture in a serverless world, the challenges that it introduces for local development and continuous integration, and how the landscape has grown and matured in recent years. If you are wondering how to incorporate serverless platforms in your projects then this is definitely worth your time to listen to.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- The Python Software Foundation is the lifeblood of the community, supporting all of us who want to run workshops and conferences, run development sprints or meetups, and ensuring that PyCon is a success every year. They have extended the deadline for their 2019 fundraiser until June 30th and they need help to make sure they reach their goal. Go to pythonpodcast.com/psf2019 today to make a donation. If you’re listening to this after June 30th of 2019 then consider making a donation anyway!
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Raghu Murthy from DataCoral about his experience building and deploying a personalized SaaS platform on top of serverless technologies
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by giving a brief overview of DataCoral?
- Before we get too deep can you share your definition of what types of technologies fall under the umbrella of "serverless"?
- How are you using serverless technologies at DataCoral?
- How has your usage evolved as your business and the underlying technologies have evolved?
- How do serverless technologies impact your approach to application architecture?
- What are some of the main benefits for someone to target services such as Lambda?
- What is your litmus test for determining whether a given project would be a good fit for a Function as a Service platform?
- What are the most challenging aspects of running code on Lambda?
- What are some of the major design differences between running on Lambda vs the more familiar server-oriented paradigms?
- What are some of the other services that are most commonly used alongside Function as as Service (e.g. Lambda) to build full featured applications?
- With serverless function platforms there is the cold start problem, can you explain what that means and some application design patterns that can help mitigate it?
- When building on cloud-based technologies, especially proprietary ones, local development can be a challenge. How are you handling that issue at DataCoral?
- In addition to development this new deployment paradigm upends some of the traditional approaches to CI/CD. How are you approaching testing and deployment of your services?
- How do you identify and maintain dependency graphs between your various microservices?
- In addition to deployment, it is also necessary to track performance characteristics and error events across service boundaries. How are you managing observability and alerting in your product?
- What are you most excited for in the serverless space that listeners should know about?
Keep In Touch
Picks
- Tobias
- Raghu
Links
- DataCoral
- Perl
- Airflow
- Serverless Computing
- DynamoDB
- Aurora
- SNS
- SQS
- Lambda
- S3
- API Gateway
- EMR
- Apache Hive
- AWS Glue
- RedShift
- SnowflakeDB
- Hadoop
- Function As A Service
- Distributed Systems
- Conway’s Law
- SRE == Site Reliability Engineer
- Rollbar
- AWS Batch
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or you want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI pipelines, they just launched dedicated CPU instances. In addition to that, they just launched a new data center in Toronto, and they've got 1 opening in Mumbai at the end of 2019.
Go to python podcast dotcom/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show. And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system that can keep up with you that's designed by software engineers for software engineers. Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own. With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page.
Podcast dot init listeners get 2 months free on any plan by going to python podcast.com/clubhouse today and signing up for a free trial. And bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to serve as a platform agnostic digital space for bot developers and enthusiasts of all skill levels to learn from 1 another, share their stories, and move the conversation forward. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news.
For newcomers to the space, they have the beginner's guide to bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. And to help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need, they have compiled a list of the major options and how they compare. Go to python podcast.com/discoveredbot today to get started and thank them for their support of the show. And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.
We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graph Forum and the Data Architecture Summit. The agendas have already been announced, and super early bird registration is available until July 26th where you can get up to $300 off, or the early bird pricing for $200 off is available through August 30th. Use the code b m l l c to get an additional 10% off any pass when you register. Go to python podcast.com/conferences to learn more about this and the other conferences and take advantage of our partner discounts when you register.
And, also, the Python Software Foundation is the lifeblood of the community, supporting all of us who want to run workshops and conferences, run development spritz or meetups, and they also ensure that PyCon is a success every year. They have extended the deadline for their 2019 fundraiser until June 30th, and they need help to make sure they reach their goal. Go to python podcast.com/psf2019 today to make a donation. And if you're listening to this show after June 30, 2019, then consider making a donation anyway. And you can visit the site at python podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And if you have any questions, comments, or suggestions, I'd love to hear them.
And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
[00:04:13] Unknown:
Your host as usual is Tobias Macy. And today, I'm interviewing Raghu Murthy from Data Coral about his experience building and deploying a personalized SaaS platform on top of serverless technologies.
[00:04:24] Unknown:
So, Raghu, could you start by introducing yourself? Hi, Tobias. Thanks for having me today. My name is Raghu Moorthy. I'm the founder and CEO of Data Coral, and I'm glad to be here.
[00:04:35] Unknown:
And, so I know that I've interviewed you about your work on Datacorl for the data engineering podcast, but I'm curious if you can talk a bit about how you first got introduced to Python and, maybe just briefly talk about some of the ways that it gets used at data coral.
[00:04:53] Unknown:
Yeah. So, just a little bit about my, background as well. I've been an engineer for, quite a long time at companies like Yahoo and, Facebook. And, for the 1st few years of my career, I was mostly working on c plus plus and Perl. I love Perl because of kinda how you could succinctly represent a lot of business logic, but it was also kind of a write only language. You if you wanted to write similar logic, you just write it all over. You wouldn't try to reuse any code or anything like that. So that is kinda super interesting, but, this is all at, Yahoo mostly. And then at Facebook, I got introduced to Python in the data infrastructure team because a few of the tools that were being built on top of Hive and Hadoop, especially around kind of the ETL side were built in Python. I found the language to be initially at least cumbersome to work with. It was pretty opinionated about, kinda how code should look to the human versus just a computer, which was kind of a pretty dramatic change.
So all around my time at Facebook, like, of course, we are writing a lot of Java code, for these kind of big data systems, even c plus plus for some period. But Python was mostly around the tooling side. So there was, like, an ETL system called Databie, which was a Python configuration driven ETL tool. There's, like, the next 1 is called Data Swarm, which was also, Python. And, like, that is probably, also was an inspiration for Airflow, which is, like, a pretty popular ETL kind of DAG management software that's open source right now. At Datacorrel, we use Python mostly for helping our data scientists write their own code, custom code to build out the transformations.
We have just about, frankly, Datapol, we have just about gotten started kind of using a lot more Python. Earlier, it was kind of simple tools. But once we decided that we wanted to support, data scientists and allow them to do more than just SQL, Python was the obvious choice. And, we're working on some pretty interesting and, kind of fun features that, we hope to put out there, please. And
[00:07:04] Unknown:
so as I mentioned, we talked a bit about your work at Data Coral on the data engineering podcast, and I'll add a link to the show notes for anybody who wants to go back and listen to that. But for anyone who hasn't listened through that conversation, if you can just give a bit of a brief overview about what it is that you do at Data Coral. Yeah. So at Data Coral, our goal is to make,
[00:07:25] Unknown:
data scientists and data analysts and even kind of citizen data engineers kind of self sufficient. We help automate data pipelines so people who know, just SQL can actually build complex data flows using the SQL like declarative language. And these data flows end up getting compiled into serverless data pipelines. And these data pipelines run, and they capture kind of data quality metrics and data provenance and things like that that make it very easy for data scientists and data analysts to be able to kind of really get a good sense of how their data is doing by just focusing on the business logic of the data. And, we've been around, for about a little over 3 years now. And, yeah. So that's, I guess, a good bit of overview.
[00:08:11] Unknown:
And so the majority of the technology stack that you have built at Data Coral is based on serverless technologies. And before we get too deep into the technical aspects of how you're leveraging those capabilities, I'm wondering if you can just share your definition of what you see as being the term serverless, and the types of services and technologies that fall under that umbrella. Yeah. Definitely. So clearly, I mean, serverless does not mean that there are no servers. It just means that,
[00:08:44] Unknown:
if you are using any serverless technology, you're not really thinking about anything that is related to servers. Things like provisioning them to handle capacity, worrying about am I paying for idle resources, worrying about whether, whatever our provision is actually gonna work as it as my application set, scales, how do I deal with fault tolerance. All of these things, if I'm not having to worry about it by using a certain set of services, those services are what are serverless for me. Yeah. So in terms of the actual, technologies themselves, like, if you kind of think about it, a SaaS application is kind of the ultimate serverless technology or just a user of it, and you just use it. You don't worry about how it got deployed or how it got provisioned and so on. But when you think about as a developer, as a developer, you're not having to worry about all the provisioning and stuff like that. Then you're probably using a public cloud or a cloud that offers a bunch of kind of platform as a service kind of services.
So if you're using a cloud like AWS, then you have databases like DynamoDB or Aurora. You may have, like, streaming systems like Kinesis. You may have other kind of services like SMS, SQS. Those are all, at least in my mind, serverless technologies for a developer. And, of course, the big 1 that we use a lot is AWS Lambda, and that is for compute. Right? So the others can offer different kinds of functionality beyond the straight up compute. And, of course, I guess, the other way going is s 3, which, again, is mainly for storage. And so you mentioned that Lambda
[00:10:19] Unknown:
gets used pretty heavily in data coral, but I'm wondering if you can talk through some more of the types of services and technologies that you're using to build your particular application stack, and how your usage of those technologies has evolved as they have matured, and as more of them have become available over the past few years? Yeah. So when I first got started, I mainly started looking at AWS Lambda and, of course, s 3 for storage as kind of the 2 main services. We also, I think, had, used things like API gateway to be able to provide, like, an events endpoint and things like that. And
[00:10:56] Unknown:
the goal as, as I mentioned earlier, is for data core to provide a way for people to specify, end to end data flows. So as in be able to collect data from different places, organize that data in different kinds of query engines, and be able to even publish the transform data into applications, into production databases, and so on. So the way we have picked these serverless technologies, has been by kind of carefully looking at every single 1 of I mean, AWS, for example, keeps providing newer and newer technologies that are kind of these past services. So we have always kind of looked at every single 1 of them to see what is the best possible way to use each of these serverless technologies to be able to build, like, a really robust end to end kind of data infrastructure stack. And the way we have kind of used so I'll give you an example of how it has evolved. Right? So earlier, we had kind of we had stored data in s 3, and we said, okay. We would like to make that data variable directly from s 3. And back then, there was EMR, and, there was, like, Hive on top of EMR. So we said, okay. We'll actually spin up a a Hive metastore and stick all of the metadata into that Hive metastore and then allow, like, our customers to use EMR to kind of query that data that's sitting in s 3. But then about, I guess, a year in, Amazon offered the service called the Glue Data Catalog, which is essentially the Hive, meta store product as a service by Amazon. So the moment we looked at it, we're like, why are we spinning up, like, a database and a server to actually run this service? We might as well just use the blue data catalog. So we have we have, like, a opinionated view of how these data flows should be built out, what the interface should be for our users. And given the technologies that are available, we kind of pick and choose the ones that we feel are kind of the best best way to kind of provide, like, the really robust service. And then as and when newer ones kinda come along, we are able to kind of replace, like, small parts of it while still providing the same service and, of course, improving the robustness or improving the even the cost of operating, data coral itself. There's as you can imagine, there are several other examples, but I think this could just be a good 1. And so
[00:13:10] Unknown:
for developers, they're typically used to thinking about their application design in the context of deploying to a server or even something like a container where there is some sort of resource where they're able to gain access to different operating system primitives or that you can have maybe multiple services running, you know, adjacent to each other. And I'm curious how you have found your experience of building on serverless technologies to influence your overall approach to application architecture and systems design. Yeah. Definitely. So in terms of the
[00:13:49] Unknown:
application architecture itself or, like, in general, like, what we are building is a distributed system. Right? So there's huge amounts of data that we'll need to process, so you cannot all do it in kind of 1 server. So you're you have to split that computation or, like, that, data storage and stuff like that across multiple machines. But we didn't wanna be in the business of, managing those machines. So what we ended up looking at I mean, when you think about data infrastructure, there's kind of 3 distinct kinds of operations that, we have identified in a data infrastructure stack or in a data flow, if you will. You're collecting data from different places. So this operation of collecting data, it can be, in some sense, kind of infinitely parallelized. Right? So what that means is I can do this piece of pulling data from different places into these bite sized chunks that can then be kind of run inside of something like an AWS Lambda. But then you might end up in a situation where there's a transformation that is trying to process, like, I don't know, lots and lots of data, and the the logic is pretty complicated. In those cases, we would kind of just offload that into any of the kind of, for example, the data warehouses that that are available. It could be, like, a big data kind of query engine like an Athena or it could even be Redshift or it could be, Snowflake. In those cases, those databases themselves are they do require some amount of kind of cluster management, if you will, but that is necessary for kind of complicated workloads. But when you think about what an end to end data infrastructure stack does, a lot of it can be represented in terms of these bite sized operations. And those are the ones that we have kind of moved to serverless technologies.
And what that has also done is, it has essentially freed us up from kind of doing a significant amount of kind of cluster management. So this also has, like, pretty significant implications on kind of the overall application architecture itself. So if you're trying to provide a SaaS kind of tool or a SaaS application, you're typically trying to build a multitenant architecture where there's the similar the same installations being able to handle multiple customers. Right? So, building, when you think about building a SaaS application, in general, the way, it is built out is you have kind of 1 installation with all of your kind of application code, and then you have multiple customers of yours using that same installation. So all of your customer data is actually flowing into the same, kind of infrastructure.
So the this gives you things like the ability to, scale across customers. It gives you, kind of the economies of scale. You can reuse resources across different customers. So those are all kind of, like, the good things about a multi tenant architecture. But then that also means that you're having to deal with kind of the vagaries of cluster management. So let me give you an example. So now let's say you have, like, a single installation and you have multiple customers that are using single installation for the application, and then there are multiple customers who are using that application. Now you can kind of commingle all the data, like, all the processing. That's fine. And then you start realizing that there are your application is blowing up, so you're getting more and more customers.
Now you cannot indefinitely kind of increase the size of your installation, add more compute, add more storage all the time. So at some point, you'll have to break it up into, let's say, 2 installations. And at that point, you have this decision of of, okay. Am I gonna have customers who are gonna be all in 1 installation or the other installation? I kind of play this forward and just end up with this kind of bin packing problem. You have, like, a small number of installations, and then you have customers that you need to allocate to these different installations. Now these customers may be growing, shrinking.
Now you you're spending most of your time trying to organize customers within your installations and a huge amount of, automation that needs to get built out. I know this because this is literally the problem that I had to solve at Facebook back in the day where we had, like, 1 giant Huddl cluster and multiple teams using it. So we had to kind of carve out pieces and move it into different data centers and so on. So for, these kinds of, SaaS applications, the ideal situation would be to have essentially kind of 1 installation per customer so that they can go, scale them up and down based on the customer usage.
But then that adds a huge amount of overhead around kind of management and kind of maintaining fault tolerance and so on. Now when you think about a serverless architecture for your application, then spinning up new installations is actually pretty straightforward. It doesn't take too much time. Each installation is kind of almost gonna it can do kind of lights out operations in terms of scaling. And you're not having to kind of centralize all of your kind of customer data into 1 place. So this allows somebody who has built an application in a purely serverless manner to be able to, in fact, deploy isolated installations for each customers for each of their customers.
So you end up having multiple isolated installations instead of a single installation that is multitenant. So our belief is that serverless technologies have made it kind of really possible to build this kind of, if you will, a private SaaS, kind of, applications where customers can decide to use different applications, but then all of those applications run within their own environments, which then means that no data has to leave their system. So there's, like, a pretty strong data security and data privacy argument to be made. And, yeah, as you mentioned, having multi tenant services
[00:19:20] Unknown:
become increasingly difficult as you deal with customers that might be operating at orders of magnitude different scales than each other because then you started dealing with issues of managing priorities as far as access to different resources within your environment, or ensuring that you have allocated enough capacity for certain customers and that they're not going to stomp on the capabilities of the sort of, the customers that have lower requirements. But, yeah, as you mentioned, having everything delegated to serverless technologies or being able to deploy directly to customer environments removes having to even think about those different edge cases that are sort of easy to overlook from the outside, but, incredibly painful once you start having to deal with it on your own.
[00:20:09] Unknown:
Yeah. Absolutely. Like, 1 of the bigger challenges around kind of building multi tenant systems is this whole notion of a noisy neighbor. So there could be, like, these spikes of activity by kind of 1 or 2 kind of customers, which then causes, like, the entire cluster itself to kind of get into a bad state. And then you had to spend a bunch of time nursing the cluster back into health. And then, by the way, there'll be a lot of workloads that have kind of piled up while this cluster was in a bad state. Now you are in you're then having to kind of recover the whole thing.
But then while you're recovering, there's more data potentially that's coming in. So you increase the size while you're recovering and then spin it back down because you don't wanna kind of pay for too many resources. So this is literally the the life of, like, a really high scaling
[00:20:57] Unknown:
kind of multi tenant architecture application. And so you mentioned that a significant portion of your application relies on the AWS Lambda service, which has been, referred to as a function as a service platform, and there have been other incarnations of that paradigm and other cloud providers and in open source technologies. But I'm wondering what you have found to be some of the main benefits for targeting services such as Lambda and other function as a service platforms,
[00:21:27] Unknown:
and what your litmus test is for determining whether a given project is a decent fit for something like that? Yeah. So when when you think about a function as a service, let's kind of define what that actually means. Right? So it just means that the application developer writes a piece of code, upload it to the service, instructs the, the service to run that function whenever certain events happen. The service then is able to kind of figure out how much resources are needed in order to run that function. If the number of invocations kind of increases dramatically, then the funk the the service automatically scales up and down the number of resources that are needed to be able to run, those functions. Now this is a very kind of different philosophy to building, you know, distributed systems. Right? So when you think about building, kind of, distributed systems and the 1 that I'm kind of familiar with is this kind of Hadoop like, systems where, you have a single distributed system, and this distributed system needs to handle workloads of very different types. Right? So you might have, like, jobs that are processing, like, multiple terabytes of data so they run for a really long time.
Then you might have people coming in and wanting, like, quick answers to or, like, quick results to their kind of jobs. So you have jobs that have kind of interactive, latency requirements. There might also be jobs that have kind of deadlines to say, okay. I want this job to finish by this time, every day or something like that. So most of the time that is spent in building these kinds of distributed systems is in getting the most, kind of intelligent way of doing, resource management and resource scheduling so that you can handle all of these different kinds of workloads.
And this is what makes building such kind of distributed systems incredibly hard and also a lot of fun. But Lambda essentially took our function as a service in general, took an opposite approach. So what it told the developer was, I don't care what your workload looks like. Break it down into a shape and size that fits me. And if you did that, then I'll be able to run it in a really robust manner. I'll make it super scalable. And, also, it'll be kind of cheap for you to run it that way. So for me, coming from this kind of background of doing this cluster management and dealing with different kinds of workloads and stuff like that, This is a very kind of refreshing, change in, in some sense, the philosophy of, of a service. Right? So what I started looking at, and this is also kind of the, I guess, the litmus test as you call it, is if there are operations that can actually be broken down into this kind of bite sized chunks of work, then they are amenable to Lambda.
So even though Lambda offers this kind of philosophy of representing any kinds of workloads into the shape that it wants, building a distributed, system itself like, there are certain things that you had to handle that, will not those problems don't really go away. So when you think about building a distributed system, there's very simplistically you're thinking about kind of 5 things. 1 is, kinda, how do you deal with configuration management? As in, how do you do provisioning? How do you, like, configure on those jobs and things like that? How do you kind of do the resource allocation and, resource scheduling?
And this is something that we have outsourced to Lambda. But once you, kind of run these kind of jobs or whatever else, you need to do a whole bunch of state management to make sure that the jobs are running in the right way and there is enough, kind of state that is maintained so that orchestration becomes easy to build. And, of course, orchestration itself has to be built out. Like, how do you make sure that, you run 1 task after the next and so on? And then finally, and this is something that most distributed systems don't really do a great job of, is around visibility. How exactly is, the system behaving? How are workloads behaving inside of a given system? So if you're thinking of using Lambda, you know that the resource management and resource scheduling aspect of it is something that you don't have to worry about. But the other 4 things around configuration management, around orchestration, around state management, and trying to provide visibility, those are things that you still have to take care of.
[00:25:46] Unknown:
And with Lambdas, they also, as you said, come with a certain number of constraints that require you to build your services to fit that. And some of those are things like the cold start problem where the first time an event fires to load a particular function, there might be some latency as that, as that function gets warmed up and loaded into the cache. But there are also things like the execution time limits or some of the other constraints that placed on Lambda functions and particularly for cases where, as you were mentioning, there might be some task that needs to churn through a lot of data or needs to be able to fetch data. I'm wondering what you have found to be some of the edge cases that you've run into with Lambda and some of the different ways that you've worked around it and other services that you have found useful to leverage in conjunction with Lambda?
[00:26:40] Unknown:
Yeah. So like I mentioned, like, Lambda is very particular about how big a particular task can take. When it was, I mean, first built, my understanding is that it was mainly meant for building these applications where it's kind of you send a request, and then there's a response. And in order for you to produce that response, there's, like, a small piece of code that needs to run, and that was Lambda. And that is typically, you want responses there in, like, let's say, few 100 milliseconds or, like, yeah, let's say, a few 100 milliseconds. Then the problems around, like, cold start and stuff, become really, kind of they they get, sorry, I'm blanking on the word.
So cold start becomes a very big problem if you want your latencies to be kind of in the order of milliseconds. But then turns out, Lambda can also, if it is not invoked through an API gateway or something like that, Lambda itself can actually run for several minutes. So that is what we have leveraged, quite a lot. The fact that Lambda can run for I think when we first got started, Lambda could run for up to 5 minutes. And my thinking was you can do some serious damage in 5 minutes in terms of, like, processing. But, of course, you'll have to do a bunch of kind of state management to figure out what is the next Lambda to run and so on. Right? So, in fact, we have actually, nowadays, Lambda can run up to, I think, like, 15 minutes, which is really good.
But at the same time, in terms of the overall architecture of our kind of data flows, we have, we chose, like, a whole micro batch processing model for our data flows. And a Lambda function that could run only for a few minutes was like a perfect show in for what we needed to build near real time, really robust data pipelines where you didn't have to, kind of wait for too long for your data. And, also, if something failed, you wouldn't have to go back several hours and kind of restart everything. You're just kind of doing the last 5 minutes worth of, processing. So we actually leverage the, the main limitation, around Lambda, which is the amount of time that it could be run to be, like, a core advantage for the overall architecture for, the data pipelines that we were building.
And then what we ended up doing was saying that, okay. Now given that we have to do bookkeeping every 5 minutes, what is the right kind of data model to have for that kind of processing? And we've, kind of built out, like, a whole way of thinking about, this micro batch kind of processing model and how that actually impacts the way you wanna represent these data flows. I don't think we have kind of much time to talk about that. That's probably more around data engineering. But we have essentially leveraged both the strengths and the weaknesses or or, I guess, constraints of Lambda to actually build something that works really well for our overall architecture.
[00:29:38] Unknown:
And so when building for cloud specific technologies for so things like Lambda or if you're relying on other, services that Amazon might provide or that are not something that you can easily replicate in the local environment. It can be difficult to figure out how to set up your local developments to make sure that what you're running locally for iterating and testing for your code is actually going to function as expected once you actually deploy it to the destination service. And so I'm wondering how you approach things like local development when you're building on these serverless technologies and how you manage to ensure that you have sort of a a close enough approximation of what the functionality is going to be rather than having to push to Lambda or wherever not if you're
[00:30:38] Unknown:
using, like, a whole bunch of PaaS services, then, you're not if you're using, like, a whole bunch of past services, then trying to mock all of them to run locally in your environment I mean, there are some libraries that are out there, but they're not really that robust. Or at least, in our experience, we have just found that the best way to know how your code is running is to actually deploy it in Amazon. But that said, we have actually kind of layered our, kind of our code in such a way that most of the business logic of what we need to, like, actually build and, like, that is where we spend most of our time, that stuff doesn't really require any Amazon services. Right? So when you think about, let's just talk about Lambda.
Lambda gets invoked when an event kind of triggers it. Now when you're actually writing that Lambda code, there is no reason for you, for that code to know how that event actually triggered that function. In fact, that that whole thing is abstracted away by Lambda itself. So what we have built is kind of there's a whole notion of kind of of course, we had to deploy our software in AWS. But then we have built, like, a thin layer on top where we have defined the interface between what our framework so that that thin interface is what we call our framework. So that framework knows how to deal with all of the Amazon services, but then it only provides the necessary information for the rest of the business logic to actually, kind of do the orchestration, do even the state management, do even do kind of actually the business logic itself. And for the the data aspects of it, like, if we are writing using s 3, then we, end up just using local files to do a lot a lot of that testing.
But it has been a lot of trial and error. Right? So our goal is to, get to a point where, like, 90, 95% of the time that, we are writing and testing code, we're actually able to kind of test it out locally. But then the framework code itself and kind of the deploy time code and all of that stuff, in order to test it, we do have to deploy it. And in those cases, we have kind of, broken up our, kind of all the entire, kind of platform into a bunch of microservices so that we'll be able to test each microservice individually.
And that allows us to even though we are deploying stuff into AWS, it takes maybe tens of seconds instead of just running immediately. But we'll try we typically try to restrict that to purely the framework code. But, again, this is not a solved problem for us. We are still learning how to make it, much better. There are other kind of, tools that we had looked at in the past, but they didn't kind of offer the kind of flexibility that we wanted, especially around the deployment side if you wanted to deal with all of the permissions and the roles that needed to get created and so on. So for those kinds of things, we have kind of rolled our own. And in terms of the
[00:33:34] Unknown:
experience of building microservices, I know that 1 of the, sort of, pieces of wisdom that I've come across as to whether or not it makes sense to use microservices is based on the sort of structure of your organization because in general software tends to replicate the communication patterns of your organization. And so seeing as how data coral itself is still fairly early on, I'm sure you don't have hundreds of developers who all sort of, focus on 1 service at a time. I'm wondering what your experience has been in terms of building microservice style architecture and how the serverless technologies might simplify that overall application paradigm given that you don't have to deal as much with the underlying infrastructure
[00:34:26] Unknown:
and deployment pains that go along with it? Yeah. So, I mean, at that overall architecture level, in terms of kind of actually running this whole software, clearly, serverless makes it a lot easy. But, you had to think very carefully about what are the interfaces that you have between these kind of microservices, and are those interfaces clearly defined? So the way we have, tried to approach this is that we have kind of a a central way of, or, like, a standardized way of doing state management and of doing orchestration. And that means that all of these microservices kind of talk the same language, if you will, in terms of how they communicate with each other. And in terms of the microservices that we are, trying to build, we are we have standardized some interfaces. So that's kind of 1 of the main things that you have to do.
And you have to like, this is essentially kind of the the ultimate form of kind of a service oriented architecture, right, where you're kind of saying that every little piece of functionality that you build can be invoked in, like, a certain manner, and it publishes how it can be invoked and things like that. Does that make sense? Yeah. So, when you think about building, these microservices, they're kind of, 2 things that we always keep in mind. 1 is, can we actually standardize the interfaces of all of these microservices so that there's a shared metadata layer that has, like, the complete state of your kind of entire system. And all microservices are communicating with that the state store, to be able to figure out what they need to do. So that's number 1, which is kinda trying to standardize interfaces between microservices.
Then the second thing is this, whole concept of separation of concerns. Now if you have kind of standardized interfaces and you have said, okay, these types of microservices need to be focused on these types of functionality, then if you realize that while you're trying to kind of add this additional feature or, like, build this new microservice that kind of tries to kinda combine the, the functionality of microservices that need to actually be separated out, then we think really hard about where exactly this functionality should lie so that we are not kind of mixing up, multiple piece of functionality into kind of a single microservice.
So this is something that's kind of done in a pretty case by case basis. But, like, off late, we've been talking quite a bit about, this whole notion of, separation of concerns where you're saying, why is this microservice having to talk to this other kind of database or, whatever else when you know that this other microservice owns all of the data that's going into that database. So those are kind of conversations that need to happen. As you mentioned, we are not at that scale where we need kind of each team to kind of build out their own and, like, not really have to worry about it. For now, we have tried to build out, kind of these frameworks on top of these serverless services. And then we're really, kind of thoughtful about how these services communicate with each other. So these microservices, as you can imagine, is kind of like the, ultimate service oriented architecture for your application.
But this is something that we continue to learn how to do better every day. And I'm also curious how you approach
[00:37:55] Unknown:
managing continuous integration and continuous deployment and testing of the services that you're building in the serverless paradigms, and particularly given that you're deploying to customers, VPCs, I'm sure that that compounds the level of difficulty as far as making sure that you have everybody synchronized across all the different versions and that you aren't potentially impacting customers at a point where they're not expecting an upgrade or just sort of managing that overall communication and expectation and trying to make sure that everybody is staying up to date with the latest version of what you're building and managing?
[00:38:36] Unknown:
Yeah. Absolutely. So when you have, multiple isolated installations running into, like, tens or 100 or 1, 000, kind of SRE kind of takes a new meaning. Right? So we are we in fact have an opening for a serverless reliability engineer, which is a way to kind of just think about, not only how the CICD works, but also how do you make sure that you are able to, monitor all of these isolated installations without having to go to each 1 of them separately. So right now, as as I mentioned, we try to build a lot of unit tests that can run locally. So, that ends up kind of getting done whenever, new code gets kind of becomes available for review and so on. But in terms of the integration side, we have built out a bunch of integration tests that can actually deploy into AWS. A lot of this is kind of homegrown. We are actively working on making it kind of more continuous integration and also have, like, much more lower latencies for even doing the deployments.
But a lot of it has been kind of homegrown. So that's the reason, we're we're looking for somebody who can who has done the SRE kind of a role, but then can essentially rethink how it could actually work in this kind of multiple isolated installations of, like, serverless services.
[00:40:01] Unknown:
And you mentioned the challenge of monitoring the, sort of, capacity and status of all of these different customer deployments. And so I'm wondering what your approach has been in terms of metrics and monitoring and just overall observability and alerting for the product that you're managing for your customers?
[00:40:22] Unknown:
So yeah. So 1 of the decisions that we made very early on to have, like, a shared metadata layer for a given installation. Right? Because it's all isolated. So that shared metadata layer allows us to actually have quite a good view of, like, what's going on in the system. In fact, we actually make all of this kind of state become available to the data scientists so that they can see how their own data quality is, how fresh their data is, and so on. But when you think about across different installations, most of the time, we have, been able to centralize, like, errors and stuff like that into kind of a a separate SaaS tool that then is, able to kind of alert us when things are kind of going alright.
But that is still something that's work in progress for us. So we use this tool called Rollbar, which has been, like, pretty good for us, where we've been able to actually send alerts or, like, send errors from across different installations into kind of 1 place that allows us to have, like, a single pane of glass across all installations. But, again, being able to kind of automatically, remediate when there are errors and stuff like that is something that's actually going into our platform. I don't believe there are love to know about tools that will actually help us to do that. In terms of
[00:41:39] Unknown:
the just overall technology landscape for serverless platforms and capabilities. So Lambda was 1 of the forerunners in terms of functions as a service. But as you mentioned, it's not just functions as a service that encapsulate what it means to be serverless. There are also things like SNS and SQS that you mentioned or BigQuery from Google. And so I'm curious what you personally are most excited for in the overall serverless space that you're keeping an eye on or things that you would like to see come, available to simplify some of the challenges that you're facing?
[00:42:18] Unknown:
Yeah. I mean, ultimately, it's about getting to, more and more complex workloads that can all be kind of automatically provisioned pretty quickly and be able to run. So 1 of the technologies that we're actually excited about and we are playing around with it quite a lot is AWS Batch, where you can kind of, it essentially takes the Elastic Container Service and makes it really easy to kind of spin up jobs that all look the same, or at least have the same configuration. And you can kind of spin up many, many, many jobs very quickly. So we have actually built a layer on top of AWS batch that makes it look like a Lambda. So if you want to run some complicated piece of code that might take longer than 15 minutes, then we can automatically kind of fill that all into a batch if you want to. Right? So that's the capability that we are actually building about. And I think there'll be more and more such things where you can spin up containers or maybe a cluster of containers. Just spin it up, do the work that you need to do, and then it can be kind of turned down. I think there are many companies that are trying to do this, and I'm always kind of on the lookout for these kinds of kind of services, that allow us to spin off large amounts of compute in, like, an easy way and be able to kind of spin them back down when they're not needed.
And if it just allows kind of application developers to build, like, much more sophisticated business logic. I think those are the kinds of things that would really make it easy for application developers to build, like, much more sophisticated business logic, without really having to worry about kind of all of the stuff around kind of provisioning, capacity management, resource scheduling, and things like that. And are there any other aspects
[00:43:59] Unknown:
of your experience of building with serverless technologies
[00:44:03] Unknown:
or any of the other work that you're doing with data coral that we didn't discuss yet that you'd like to cover before we close out the show? No. I mean, just, in terms of Data Coral itself, I mean, our goal has been to leverage as many of these kind of services that the clouds have to offer to be able to provide, like, a really robust end to end data infrastructure as a service within our customers' environment. So the approach that we have taken is to not be cloud agnostic as in only use kind of the common denominator that's offered by all clouds, kind of build out the rest ourselves. Instead, we have chosen to be cloud best, as in we leverage any and every piece of kind of technology that a cloud has to offer.
And we'll build the stack that would be kind of best in class for that cloud. So all of the abstractions and stuff that we build on top, I mean, they'll remain consistent. But whenever we kind of move to a particular cloud or, like, we you know, start providing our service in a given cloud, we wanna make sure that we leverage everything that the cloud has to offer. So as these clouds start offering similar kinds of services, then, like, of course, we can just reuse the code, but then we'll write kind of, layers that'll allow us to build the abstractions that we need and then leverage whatever the clouds have to offer. So we're kind of super excited about kind of everything that, we're about plan to build out in the near future as well as the kinds of problems that we believe will be solving, in the near
[00:45:30] Unknown:
in the future. Alright. And for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move this into the picks. And this week, I'm going to choose the Avengers Endgame movie. Finally got a chance to watch that the other week, and it was quite enjoyable. I think they did a really good job of tying all the story lines together and bringing a lot of it to a close. So I'm gonna avoid giving any spoilers, but, suffice to say that I had a good time. It's definitely worth a watch. And so, yeah. With that, I'll pass it to you, Raghu. Do you have any picks this week?
[00:46:10] Unknown:
First of all, thanks for, not giving away any spoilers for that movie. I hope to watch it, sometime soon. My pick, this week is the Golden State Warriors to actually win the NBA championship.
[00:46:24] Unknown:
Alright. Well, thank you very much for taking the time today to talk about your experience of building on top of and using these different serverless technologies. And also thank you for sharing your experiences at Data Coral. So thank you for that, and I hope you enjoy the rest of your evening. Yeah. Thanks for having me again. Yeah. Absolutely.
Introduction to Raghu Murthy and Data Coral
Raghu's Journey with Python
Overview of Data Coral
Understanding Serverless Technologies
Data Coral's Application Stack
Impact of Serverless on Application Architecture
Benefits and Challenges of AWS Lambda
Local Development and Testing with Serverless
Microservices and Serverless
Continuous Integration and Deployment
Monitoring and Observability
Future of Serverless Technologies
Closing Remarks