Summary
Servers and services that have any exposure to the public internet are under a constant barrage of attacks. Network security engineers are tasked with discovering and addressing any potential breaches to their systems, which is a never-ending task as attackers continually evolve their tactics. In order to gain better visibility into complex exploits Colin O’Brien built the Grapl platform, using graph database technology to more easily discover relationships between activities within and across servers. In this episode he shares his motivations for creating a new system to discover potential security breaches, how its design simplifies the work of identifying complex attacks without relying on brittle rules, and how you can start using it to monitor your own systems today.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This portion of Python Podcast is brought to you by Datadog. Do you have an app in production that is slower than you like? Is its performance all over the place (sometimes fast, sometimes slow)? Do you know why? With Datadog, you will. You can troubleshoot your app’s performance with Datadog’s end-to-end tracing and in one click correlate those Python traces with related logs and metrics. Use their detailed flame graphs to identify bottlenecks and latency in that app of yours. Start tracking the performance of your apps with a free trial at pythonpodcast.com/datadog. If you sign up for a trial and install the agent, Datadog will send you a free t-shirt.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Colin O’Brien about Grapl, an open source platform for detection and response of system security incidents
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Grapl is and the problem that you are trying to solve with it?
- What was your original motivation to create it?
- What were the existing options for security detection and response, and how is Grapl differentiated from them?
- Who is the target audience for the Grapl project?
- How is the Grapl system architected?
- How has the design of the system evolved since you first began working on it?
- How much effort would it be to separate the Grapl architecture from AWS to migrate it to other environments?
- What have you found to be the benefits of splitting the implementation of the system between Rust for the system and Python for the exploration?
- What challenges have you faced as a result of working across those languages?
- What data sources does Grapl use to build its graph of events within a system?
- Can you talk through the overall workflow for someone using Grapl?
- What are some examples of the types of exploits that you can identify with Grapl?
- What are some of the most interesting, unexpected, or innovative ways that you have seen Grapl used?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building it?
- When is Grapl the wrong choice?
- What do you have planned for the future of Grapl?
Keep In Touch
- insanitybit on GitHub
- @InsanityBit on Twitter
Picks
- Tobias
- Artemis Fowl book series by Eoin Colfer
- Artemis Fowl Movie
- Colin
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Grapl
- Grapl Security
- SIEM == Security Information and Event Management
- Rapid7
- Metasploit
- Insight IDR
- Erlang
- DGraph
- Splunk
- Elasticsearch
- AWS Lambda
- Sysdig
- Sysmon
- AWS CloudTrail
- Guard Duty
- OpenFaaS
- AWS SQS
- DynamoDB
- PyO3
- Dropper Malware
- SSH Session Hijacking
- Vagrant
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode today. That's l inode, and get a $60 credit to try out our Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And today, I'm interviewing Colin O'Brien about Grapple, an open source platform for detection and response of system security incidents. So, Colin, can you start by introducing yourself?
[00:01:27] Unknown:
Sure. Thanks, Tobias. My name is Colin O'Brien. I am the CEO and founder of Grapple, where we're building a next generation SIEM or detection response system that leverages a graph based and Python based approach to help defenders
[00:01:43] Unknown:
catch attackers faster. Do you remember how you first got introduced to Python? Oh, definitely.
[00:01:48] Unknown:
Yeah. I started off my career at Rapid7, and my first role there was sort of assisting 1 of their data science teams. So I wasn't a data scientist by any means. I was just an engineer, but they needed some help sort of building out, scaling out their research and the machine learning models that they were building. And, of course, they had chosen Python as their programming language of choice. So I sort of just dove in, started building out services for I got into Jupyter Notebooks, which the data scientists had been using to share information with each other to generate reports. I had previously just been using languages like c and c plus plus mostly.
So it was actually a really different kind of language, and I was sort of blown away by the power of it and what you could do with some of the the really top notch quality libraries like pandas and NumPy and those Really easy to just get started with. And, yeah, I've been using it on the side even when my main job has taken me to other languages like Java for a long time now.
[00:02:55] Unknown:
And it's interesting that at Rapid7 being the home of Metasploit, which is a famous framework for penetration testing, but also written in Ruby that there is such a strong presence of Python as well. I'm curious if there were any sort of language wars between Python or Ruby or any issues of trying to figure out sort of compatibility or communication patterns between Ruby code bases and Python at the Rapid7?
[00:03:21] Unknown:
Sure. Yeah. Yeah. I mean, I think always in good natured fun, certainly. But Rapid7 employed a lot of different technologies. So Metasploit was in Ruby. Product that I was working on, InsightIDR, was in Java. Data science team was leveraging primarily Python, but, definitely they use some R as well in there, couple of Go projects, and actually even an Erlang project in there at 1 point as well. So Rapid7 really wasn't shy about picking the right tool for the job so long as everyone sort of got together and agreed on it. In terms of Metasploit and the data science teams, there was actually a lot of overlap in team members. But, you know, Ruby and Python are both pretty dynamic, powerful languages. And so I don't think anyone was pushing for 1 over the other when it just made sense.
But we did certainly have a consistent discussion about what the right tool for the job was.
[00:04:23] Unknown:
Now with your work on grapple, it's, as you said, a modern approach to being able to do detection and analysis Yeah. Absolutely
[00:04:41] Unknown:
Yeah. Absolutely. So Grapple is a system that's designed to ingest data, usually event or log data. So, you know, process executions, network connections, AWS API calls, those sorts of things get sent up to Grapple. Grapple will process that information and translate it from logs and events into a graph format. Right? So really try to expose all of those connections between those logs. Just as 1 example, a process execution event might have a parent PID and a child PID, but it doesn't have much information about, you know, what that parent process was. Right? The the relationship's sort of hidden and implicit in the log. Grapple takes that data out, makes it really explicit, processes and cleans the data, and then stores it in a centralized graph data lake, which is built on top of Degraf.
At this point, Grapple will execute your attack signatures, what it calls analyzers. These are Python snippets, essentially, and what they're going to do is search that graph looking for malicious or suspicious patterns. Right? So maybe we don't expect 2 processes to execute together in the same process tree, or we've never seen this process call out to an IP address in Russia or something like that. And so you search around for those connections, the analyzers will find them, Grapple will join all of that data up. Right? Because it's all in a graph, so it's sort of just expanding out and painting the master graph with risk scores and that sort of thing. And then you use a Jupyter notebook to investigate the graph and interrogate it. So you have a Python environment, a library that we've built, and you can pivot off of the graph, expand it out, visually look at it, and inspect properties of nodes. And so it's a very powerful end to end system for understanding what's going on in your various networks and really being able to dig in, pivot off of information, and get to the heart of what behaviors are going on there.
[00:06:50] Unknown:
And in terms of the motivation for the project, what was it that drove you to build this entirely new system for being able to perform these analyses and detections? And what was lacking in the existing ecosystem of tooling for this problem space that necessitated you building this entirely new system?
[00:07:11] Unknown:
I think most systems out there, as an example, Splunk or Elasticsearch, generally work directly on raw log data. Right? So you can write queries that are sort of almost like regexes over a log, and that's good for very simple attacks. I think, especially, you know, 10, 15 years ago, that was fairly reasonable. But attacker behaviors at this point tend to really exist and span across multiple different events. It's very unlikely that given a single event or even a single event source that you would be able to detect effectively an attacker without lots of false positives and that sort of thing. And so what I kept running into, especially at my last job at Dropbox where we had 1 of these more traditional systems, was that I wanted to find attackers. I wanted to look for suspicious process execution, see where the binary file had come from, see if that was created by another process. Right? It was very join heavy workflows.
And these time series databases, like Elasticsearch, really don't make joins easy or efficient. So, I mean, you can go to the Elasticsearch or the Splunk documentation, and they'll both say the same thing, which is that the complexity of those queries of those join commands really is just prohibitively slow. In the case of Elasticsearch, they actually don't even provide a SQL like join. They have a much more specialized version, which is also not very performant. And Splunk's SQL join has a number of caveats, like only returning a certain number of results before it gives up, timing out after 60 seconds, very, very significant performance issues. You know, from my perspective, these systems were slowest at the thing that I wanted to do the most. I wanted to stop writing detections on, you know, properties of an attack, things like a process name or a file name. Those are things that attackers can change really, really easily. So if we tie ourselves to those, we're really setting ourself up for failure.
I wanted to track the the structure of an attack. Right? What does the attack look like at a fundamental level? And so that's where the graph really started to come in. I didn't like the query languages. Every single vendor has their own query language. They're all fairly strange. They make a lot of simple things very easy. But once you wanna do anything a little bit out of their comfort zone, things can get pretty rough with hidden performance issues, really large queries. Python was an obvious choice for me. It's, you know, a a choice for the the data science community. A lot of security people like Python, very, very expressive language. And so merging this concept of graphs and structural attacks with Python is really the initial motivation for me to start exploring this space.
[00:10:15] Unknown:
And you mentioned that you first began working on this while you were employed at Dropbox. I'm curious what the overall process or initial thoughts were on the nature of it being open source and any challenges that you faced in building the system in the open while working somewhere like Dropbox?
[00:10:35] Unknown:
I had actually started it just before I began my role at Dropbox. In between Rapid7 and Dropbox, I had about 6 weeks off. I spent most of that time building up an initial prototype of Grapple. When I got to Dropbox, really no issues at all. Grapple in no way competes with the company, and I think that was their only concern. If it were a file storage system or something or a collaboration service, I think there may have been more pushback, but I consistently worked with legal, got the clear from them, got the clear from my managers, let them know if, you know, I was planning to, do anything new with the project. And they were very supportive, which I really appreciated because I know there are a lot of companies where colleagues of my work, and they just are no longer capable of contributing to open source at this point. So Dropbox was very supportive of it. I was pretty happy with that. And the team themselves, we actually did publish a blog post about some of the work that myself and a colleague, Mayank, had worked on, which was to bring more of Python capabilities to Dropbox's security team. So we built out an automation system. We built out a Jupyter Notebook service that we could use to work with different data sources that we had at Dropbox.
So there were a lot of good lessons learned while I was there.
[00:11:58] Unknown:
In terms of the end users of the Grapple project, who is the overall target audience, and what does the overall workflow look like for somebody who's interacting with Grapple after you've already ingested data into the system?
[00:12:13] Unknown:
I think today, we're targeting text and response teams specifically, and generally, maybe a little bit more modern of a detection response team. It's not always typical that members of the team would have Python experience or, you know, any programming experience. We're kind of targeting teams that have started to realize that that's becoming necessary if you wanna keep up with attackers. And I think that we'll find that more and more companies end up realizing that as time goes on. The use case is powerful, but minimal, I would say. So, really, there's a couple of ways to interact with Grapple. The most obvious is just putting new signatures into it. So these are implemented as a Python class.
You inherit from our analyzer base class, and you implement 2 methods, 1 which builds and describes a query. So this is gonna be something like a process query with children, and then you would describe the children with another process query. And Grapple will take that and run it against the master graph if there are any new processes that update. And then another method, which is going to be the response to what happens if that signature matches. And this is where you can do follow-up context texting. You can ensure that by the time you actually investigate that signature, all of the necessary context has been added there. And that's really nice and simple. You can pivot off of the data right there in that method, pulling in, you know, the entire process tree, network connections, files, anything like that.
After that, if you've built up enough of these analyzers and, unfortunately, there's something going on in your network, the user interface for Grapple will sort of sort based on different systems or properties in your environment, like your users' laptops or your AWS accounts. It's entirely configurable, which ones are the riskiest? You go into work, you sit down, you say, I've got time for, say, 5 investigations today. You pick the top 5 riskiest systems in your network, open up a Jupyter notebook, and pretty much just start pivoting off of the data. And 1 of the cool things that we've done is that as you're pivoting off of the data into Jupyter Notebook, you'll have, on a separate window, a live updating visual representation of the graph that you're interrogating. So, you know, you call a method like get children on a process in the Jupyter Notebook, and instantly in the other browser window, you'll see that that process has expanded with all of its child process nodes attached to it. So it's very interactive. It's really designed for iterative exploration driven workflows.
[00:15:00] Unknown:
In terms of the actual architecture of the system, can you describe how it's implemented and some of the ways that the design has evolved from that initial prototype to where you are now?
[00:15:11] Unknown:
So 1 of the initial important goals with grapple was to have as low maintenance costs as possible. I've seen many security teams spending a lot of time, a lot of money just keeping their existing system up and running. This could be multiple full time people just making sure the system is up and running. 1 of my initial goals was to use as much serverless technology, as much managed technology as possible. So Grapple deploys to AWS. All of the compute is in AWS Lambdas. The Jupyter Notebook is managed through SageMaker. The event pipeline is s 3, SQS, these sorts of things. So it really focuses on keeping maintenance costs low.
The only exception there would be the Dgraph cluster, which is running in a Docker Swarm on EC 2. So that's not fully managed for you, but Docker does a lot of heavy lifting, makes it easy to upgrade the underlying system, that sort of thing. So it really does focus on low operational cost. The pipeline is pretty traditional in terms of, like, ETL. So events come in, a Lambda will transform them in some way. So parsers will turn a log into a graph, and then they'll emit that event so that the next Lambda can process it. So in general, the way this works is a parser will generate a graph. That graph goes through our sort of data cleaning Lambda, what we call the node identifier. This is gonna do some really, really cool stuff. Like, it'll figure out based on the metadata of that node what it really is. Right? So to give some explanation there, you could think of a process on your system, which has a process ID, but that process ID isn't actually unique. If that process terminates, the process ID will show up again.
It's just a a matter of time, really. And so Grapple takes metadata, like the process ID, the asset, the time that the process started and stopped, and resolves that to a unique identifier. And so this happens for every single node that it processes. Anything always has an identity in Grapple. So these identified nodes get merged into the master graph database. This is really simple service. It basically just performs an upsert into Dgraph. And this is actually 1 of the reasons why I chose Dgraph was because it has really strong support for write performance and horizontal scaling of writes. Grapple has to ingest a lot of data, so that was super important.
Every time these updates happen, your analyzers trigger automatically and scan whatever the latest in the graph is. So they're very efficient. They don't have to scan the entire graph every time. They only ever scan what has updated. And that means that grapple scales really, really nicely. In the entire system, there's actually as far as I can tell, off top of my head, I don't think we even have a single linear algorithm. Everything is either logarithmic or, in most cases, actually
[00:18:29] Unknown:
data sources that you're working with? Is it mostly just things like the Linux audit log, or do you also support things like pulling information from Sysdig running as a sidecar in a Kubernetes cluster, or things like system events from s 3 or other cloud services?
[00:18:46] Unknown:
So today, what we support is Sysmon logs, and we're working on some AWS support as well. So that'll be CloudTrail, AWS Config, and GuardDuty. Grapple has a plug in system. So if you, say, had your Sysdig configured on, you know, maybe, like, a Kubernetes cluster or something, you could build a parser using our plug in system and then send that data up to Grapple, and it would just understand past the parser level how to work with that data. So, really, if you can get it into that graph format, we can do anything with it. We do intend to build out a nice large suite of parsers and plugins.
But at the moment, we've really just been focused on honing the system itself, so making sure that the plugin system is robust and and that the system can scale. That said, the plug in system has really come along very well. It's been redesigned in the last month or so, so it's really easy to tell Grapple, you know, I want to express this thing that you don't know about today. Right? So with AWS, for example, that's entirely implemented in the plug in system, and Grapple can still work with things like s 3 buckets or IAM users because the plug ins tell it how to do so. I mean, it's pretty straightforward.
[00:20:07] Unknown:
As far as the specifics of the AWS implementation, how difficult will it be to replace some of those with either similar paradigms from other cloud providers or just the sort of nonspecific abstractions of what that functionality is, such as maybe the open FaaS for a Lambda replacement or just a generic queuing system for replacement of SQS and things like that?
[00:20:35] Unknown:
I think it shouldn't be too hard. So, I mean, just as sort of some evidence of that, there is actually a version of Grapple that you can run locally on your laptop, and it just runs in Docker containers. There's no Lambda environment or anything set up for it. That's because the system is pretty abstracted away from those underlying details like SQS or s 3, that sort of thing. There's a library we've built that pretty much you just tell it, perform this computation, and then you plug in event sources and where you'll emit the events, and you can do the rest basically really simply. I actually built a version that could just run on the file system using INotify and and other such things. That said, there is a little bit of a hurdle. We do use DynamoDB.
This is probably the system that would be hardest to move away from. It's just it's a really great key value store, and I'm not familiar with the equivalent in GCP. And and I think in Azure, it's Cosmos DB, but I I don't have the experience today to port that over easily. But, really, other than that, there's actually nothing that would be too hard to port over. It's it's all very abstract in a way. Both Python and Rust made that pretty simple to do.
[00:21:57] Unknown:
And as you mentioned, there is a split in terms of the implementation where you have some of the core elements in Rust, and you have Python in there as well for being able to handle some of the processing as well as being the end user interface for being able to do this exploration of the data. And I'm curious what you have found to be some of the benefits of splitting the implementation between those languages and any challenges that you found working across those boundaries.
[00:22:24] Unknown:
It's been a really interesting thing to build the system in what I guess you could call a more polyglot way. So we've got JavaScript, TypeScript, Python, and Rust, where Python and Rust by far dominate the code base almost, like, 5050 each. I think it's really just about choosing the right tool for the job. So something like a parser. Right? It's just it's pure compute, something you could very easily spread across multiple threads. You want correctness. You want all of the nice things that a language like Rust gives you. But data exploration and really iterative data exploration where you're almost in a REPL, right, that's you know, a Jupyter notebook is just a really fancy REPL, that's just Python's bread and butter, and and it has tons of libraries for supporting these sorts of things. It's also a really nice dynamic language. So it's made a plug in system, and this whole, you know, users can add their own code to grapple. That that's been a lot easier because of Python's dynamic nature. So we've gotten a lot of value out of both of those, and I think going forward, we'll really be able to do a ton. I just saw actually recently the pyo 3 library, which allows for Python to Rust FFI, so calling Rust directly from Python.
I got an awesome update. It's something I've investigated in the past. So we can sort of bridge that gap more and more between them, so it kinda feels like we're just using 1 tool rather than 2 different languages. But in the meantime, there are certainly some challenges. We have some code duplication. We have libraries that only exist in Python but don't exist in Rust. So a new service that needs some capability, even though maybe Rust would have been a better fit for it, we end up writing in Python or vice versa for that matter. Sometimes, Python would have been faster to get up and running with, but Rust happened to have the library that we needed because we had just built it for 1 language. So it can slow things down in cases like that. I'd say initially, certainly, when when those first base libraries had to be built twice, that was a bit of a time sink, but it's always faster building it the second time. I just ported 1 of our libraries over to Rust. It was a couple of days of work, whereas building a library the first time was probably about 2 weeks. So, you know, there is a cost, but I think it's worth it. The benefits we get by choosing Rust where it makes sense and choosing Python where it makes sense have been well worth paying. So
[00:24:58] Unknown:
And in terms of community engagement, how has that split played with people who are interested in using and contributing to the system?
[00:25:06] Unknown:
That's a good point and probably 1 of the harder trade offs because our end user, ideally, doesn't have to know a whole bunch of programming languages. And, really, the number of people who know Python and Rust and our insecurity, It's not a big group of people by any means. So for that reason, we do understand that there's gonna be a larger barrier to entry. This is where I'm really hopeful for that pyo 3 FFI approach where we can build up a system that looks like it's entirely Python. Right? Like, all Python on the top. And then under the hood, we can do the work of maintaining open source contributors probably aren't going to need to get their hands into anyways.
We can take that and put it into Rust, get all of that nice performance and stability, but the interface will be really nice and high level. So longer term, I think that's gonna be a great solution for it. But today, it does mean that the barrier to contribute is higher than I would like, certainly.
[00:26:19] Unknown:
This episode of podcast.onet is sponsored by Datadog, the premier monitoring solution for modern environments. Datadog's latest features help teams visualize granular application data for more effective troubleshooting and optimization. Datadog Continuous Profiler analyzes your production level code and collects different profile types, such as CPU, memory allocation, IO, and more, enabling you to search, analyze, and debug code level performance in real time. Correlate and pivot between profiles and distributed traces to find slow or resource intensive requests. In addition, Datadog's application performance monitoring live search lets you search across a real time stream of all ingested traces from your services.
For even more detail, filter individual traces by infrastructure, application, and custom tags. Datadog has a special offer for podcast dot in it listeners. Sign up for a free 14 day trial at pythonpodcast.com /Datadog. Install the Datadog agent and receive 1 of Datadog's famously cozy t shirts for free. On the open source side of things, how are you approaching governance and management of the road map and sustainability of the project, especially since you also have a business that you're building on top of it?
[00:27:38] Unknown:
I think this is 1 of those potential points of friction where you take a project that was really just a a passion project and a side project where I just wanted to use grapple, and no 1 was building it, so I had to build it. And then, really, to build it, you need a a solid business that can actually bring other people in to work full time on it. What we've ended up with is something that I'm I'm pretty hopeful for. So the core of Grapple is totally open source, and that's that's never gonna change. You know, we keep it, I believe, dual licensed Apache 2.0 and MIT, so you can choose whichever 1 to to fall under.
And then our plan is really to try to make a viable business out of either a managed version of Grapple, which we've been hard at work on, or through plugins and support. So not not every plugin, by any means, will be licensed separately, but something like a GuardDuty plug in, for example, GuardDuty being a detection system in AWS, that's already, like, a paid feature for AWS. It's something only an enterprise would ever really be using. And so it makes sense to kind of have that be source available, but maybe not Apache 2 0. Maybe something that you either have to contribute to if you're using it or, work with us on on some kind of deal that everyone can walk away from happily. Governance is something I'm really curious to experience as we grow this project out more and more.
You know, we don't wanna be dictators by any means. We don't want to, like, fight people who are trying to make the product better. I think this is something that happens a lot with purely managed systems that also have an open source version is that the company is incentivized to make the product harder to manage. Right? Because they want to be the only ones who can do that, and we really don't want to be that at all. We're trying to make Grapple as easy to use for as many people as possible. My hope here is that by just being open about our road map, you know, we we use a GitHub issue tracker, and we're planning to move more and more of our sort of Kanban boards and that sort of thing into the open.
And, hopefully, by just engaging with the community, like with our Slack channel, which is publicly open as well, we can come to a place where everyone's just trying to do the right thing. Right? Where we don't have to worry that someone's gonna come in and, you know, try to reimplement maybe 1 of those plugins, like the GuardDuty plugin, just to cause us trouble. Right? If they wanna do it because it's better the way they're doing it, that's great. I I would never wanna get in the way of that by any means. So my hope is that we can come to a a good understanding with our community where everyone's just trying to help companies and help people stay safer.
[00:30:31] Unknown:
In terms of the specifics of the security compromises that you're able to detect and being able to surface those events. I know that a lot of the initial generation of systems like this were very heuristics based and brittle and subject to easy compromise by attackers who knew what was being used to detect these particular threats because of the fact that these are graph based and you're able to join all of these complex events together to get a good view of things. What are some of the useful questions to be asking in order to be able to to discover events and ways to surface them in the case of any sort of known compromises that that a system might be subject to?
[00:31:18] Unknown:
Yeah. So I think there's a lot of attacker behaviors that, for multiple reasons, are just more painful to express in other systems. And I think the 2 main reasons that I've seen are that the query languages and the underlying database make it really hard to combine multiple events together to make 1, you know, larger contextualized event. So you're forced to just work with what's there, which is typically a lot more brittle. You're just lacking information. And that coupled with the false positive issue that these other systems has, really pushes defenders into building worse detection. So you get detections that overfocus on properties that the attacker can control and then basically try to filter out as much existing data as they can to try to hone in on the attacker.
But what this means is that the attacker has tons of room to just work within what you've already, you know, whitelisted out of your detection and that sort of thing. That's a really serious problem, and 1 I've seen just about every company using these systems run into quite a lot. Grapple takes a very, very different approach. So, of course, the graph based approach just means that we can much more easily understand the system. If I wanna join together a process, all of its children, its parent process, its grandparent process, you know, its binary file, those sorts of things, that's extremely efficient. So 1 example that I really like is the dropper behavior. So droppers are a malware technique where the initial payload that you download is very small. It's, like, very benign looking, a really, really simple program to basically make it harder to analyze for, like, an antivirus or something.
And the dropper reaches out to the attacker's command and control service, downloads the payload, and then executes that payload. Right? So we have multiple different events here. We've got the dropper execution. We have a network connection, file creation for the payload, and then the payload execution. Right? 4 different events, which could easily be across 4 different source types. A traditional SIEM, a traditional system like Elasticsearch or Splunk will have an extremely difficult time expressing that efficiently. With Grapple, that's trivial. It's a couple of lines of Python.
Just very, very natural to work with something like that. And it'll execute really, really fast, so we'll catch it immediately. And there's no properties there for an attacker to really take advantage of. It's just the structure of what they're doing. It's not that we care about the process name or, you know, what these processes are or where that file is. Those are details that we we don't wanna tie ourselves to, especially because, you know, we might have 3 you you could have a a fleet of Linux and OSX and Windows machines, but that structural pattern will apply to all of those the same exact way. So we don't have to write 3 different searches and worry about the peculiarities of each operating system.
And then the other side of that is that Grapple just doesn't have false positives in the traditional sense. We don't have to rule out specific types of droppers or anything like that. What we do is we assign a risk score. We say dropper behaviors have a very risky behavior, but maybe some other behavior, like a unique parent child process execution pair, that has a really low risky behavior. Attackers will do that. Right? They'll execute processes in ways that we don't expect, but that also kind of just happens day to day. We don't wanna just ignore it. We wanna figure out anything an attacker can do. That's what we should be expressing. But we'll lower the risk score, and then Grapple will combine those graphs together, see if they overlap, and raise that risk score up over time to make sure that the incident responder, when they get in the next day, can look at that and see, okay, this is at the top of my list. Whereas, you know, all of that other stuff that were just anomalies, you know, totally benign, those sink to the bottom of the list. So grapple really addresses 2 of those core issues that I see holding defenders back from building the right signatures.
[00:35:47] Unknown:
And in terms of somebody who's getting started with grapple initially, do you have a prebuilt set of rules that somebody can get started with for being able to detect a common set of attacks? And how much security knowledge and understanding is necessary for somebody to be able to use grapple effectively to identify potential security issues?
[00:36:10] Unknown:
So we do have a open source set of initial analyzers, and we actually have another set that we've been working on with a couple dozen more that we'll be adding in open sourcing in the near future. So you should be able to just get started right away if you wanted to deploy it and have pretty decent coverage over a number of attacker techniques. In terms of what knowledge is required, of course, security knowledge is certainly going to help. You'll know what's weird. You'll know what's not normal. You'll know what attackers do, and that can really inform your your decision making about how to prioritize the work that you're gonna do and which signatures to build.
But if you know some Python and you've got the data going through the system, it's very easy to work with. You can run queries against the graph. You could export that data into Pandas data frame, start working with it almost in a SQL like syntax with Pandas, and just do basic statistical analysis. Right? I mean, just, you know, you could easily say, show me which processes have executed that have a rare process name. Right? Maybe fall into the the bottom 4th quartile of process name executions. And these are things that you could Google that, and Stack Overflow will come up and give you the Python snippet in 2 seconds to do something like that. And that's really what I think more defenders should be looking for. I actually think they overfocus on attackers instead of really thinking about what's going on in their networks. And that's something that with Grapple, anyone can really do that. They don't have to be a security expert. They just have to work with the data. It's all there right in front of you. You can just sift through it and and figure out what's going on.
[00:38:02] Unknown:
And it seems like that's also a prime candidate for being able to use some of these prebuilt libraries for doing anomaly detection as well, where you can have a base set of behaviors on a system that's pretty heavily locked down or isn't exposed to public use just to get a general idea of what is the baseline of processes that I'm expecting to run when I have this system deployed. And then maybe in your production environment where it's more likely to be exposed to potential compromise, you can then run it to see, okay, what are the anomalies that are different from my QA baseline that I can, you know, dig deeper into to see, do I need to blacklist these? Do I need to set these particular sets of processes to be disallowed on these systems or remove potential targets from being accessible based on these attack vectors?
[00:38:50] Unknown:
Yeah. Absolutely. So I think that's actually 1 of my favorite things that I think more security people should do in the same vein of not thinking so much about attackers, but instead thinking about systems is, you know, what are the policies and and expected behaviors in your production environment? Right? You expect maybe certain ports to be open. The first thing you should be doing isn't searching for, you know, a specific attacker behavior. It should be validating that that policy is in place and ensuring that if that policy changes, that you can see that and react to it. So anomaly detection, baselining, when you don't have to worry about false positives in the same way, all of a sudden those techniques become much more viable.
Certainly, in the traditional alerting is pretty expensive. It takes up a lot of your time. You know, do you page on this 1? Do you just email? Does it go to Jira? Is it high priority? Whatever. There's not a lot to make those decisions. And so that yeah. I actually think that starting with just understanding your systems, understanding maybe aberrant behaviors, anomalous behaviors is much more important. And then when you start to layer those attacker behaviors on top of all of that, you end up with this really high fidelity detection system that by the time you come in to actually look at an attack, which maybe has raised that risk score way up, you've got all of these other anomalies that are joined together and connected to it. The investigation is, you know, practically done for you.
It's an incredible thing. I've actually done that multiple times where just using contexting and then having a single attack signature, having that actual attack signature, you know, crop up in my risks, I could see everything. I mean, virtually every action the attacker did that was relevant and interesting was already part of that graph that I was looking at. So I see that as just a a huge capability with
[00:41:02] Unknown:
And in terms of uses of the grapple system, what are some of the most interesting or unexpected or innovative ways that you've seen it employed or any particularly notable attacks that you've been able to detect with it that were maybe novel or especially sophisticated that you were intrigued by?
[00:41:21] Unknown:
Yeah. I think my favorite 1 was detecting SSH hijacking. So there's multiple types of SSH hijacking. This is where an attacker is able to leverage the local SSH agent to sign their SSH connections so they can move into, say, your production environment if you don't have 2 factor authentication or if they're in your 2 factor I'm sorry. If they're in your production environment already, they can call back into your system's SSH agent to sign new requests and start moving around laterally. And the key to this attack that makes it really difficult to detect is often just the fact that it takes place across multiple systems.
So you've got, you know, an SSH that's initiated on 1 system, which self might actually involve, first, a connection to the local SSH agent, and then you've actually got the execution on another system. And you could have chains of these executions. Right? The attacker can continuously forward the agent across your production environment if you don't have the right isolation in place. This is just a devastating attack if an attacker is able to pull it off. This is, I believe it was matrix.org who was attacked and fully compromised, And 1 of the root causes was the attacker being able to abuse SSH forwarding like this. And so Grapple, someone was able to use it. And in maybe a week, they were able to build up a suite of detections that could catch agent forwarding on the client side, on the server side using different techniques like watching the connections, watching the inter process communication from processes to the SSH agent, which involved multiple joins actually, not necessarily across nodes, but across source types that formed those nodes.
I think, in total, it was something like 8 or 9 detections and just totally destroyed that attack in an environment that was actually vulnerable to this. If an attacker had pulled it off, the policies weren't in place to prevent it for historic reasons. Being able to detect that in so many ways, no matter what the attacker did, was really cool. I was really impressed by that. That was actually early on when I had just started the company, and it was just extremely motivating to see Grapple leveraged in such a powerful way.
[00:43:54] Unknown:
And in your experience of building the product and building the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:44:05] Unknown:
Gosh. Yeah. So many, really. I've never done anything like this before in any capacity, really. I mean, even just starting Grapple as a side project, all of my side projects had always just been, you know, I'll get it, like, 70% done, and then I'll move on to the next thing. And I sat down with Grapple, and I said, like, this is the project. You're just gonna work on this. And that was a completely new thing, taking a project solo from, you know, idea to proof of concept to deployable to scalable production ready was so challenging and such a great learning experience in so many ways. I'm sure so many engineers can empathize with the idea that that last 10% of the project is always so much harder than the first 90%.
And with a long tail, large project like Grapple that does so many interesting things, you know, shifting technology choices along the way, learning about graph databases or even graph algorithms, and that sort of thing was just really interesting and challenging. And then, you know, we've hired 6 people now. So the team's grown quite a lot in the last couple of months. Still pretty small, but it's been very interesting to actually run a company, to not just be a developer. You know, I don't get to just sit down and write code every day. I'm doing a lot more support for my team. I'm figuring out how taxes work and, you know, filing in different states and working with customers.
These are incredibly valuable skills, I think. I I've been really happy to kind of be thrown into it in this way and be forced to to learn all of this, but huge learning curve for sure. It's just so many things thrown at you at once, but I'm very lucky that I've managed to build an awesome team. We have some people who are just really unique sets of skills, unique combinations of skills, a lot of great engineering and security talent here. So that's made it a lot easier for me because now I'm I'm at the point where I can lean on my team a lot, and they can kind of take it and run with me just sort of supporting them along the way.
[00:46:21] Unknown:
For somebody who is considering Grapple, what are the cases where it's the wrong choice?
[00:46:27] Unknown:
Yeah. So I think that Grapple can maybe ask for different things from your security team. The traditional security operation center model where you have tiers of analysts isn't really what Grapple is built for. It's not to say that it couldn't fit into an environment like that at all, but if your security team, they feel like they're fine using the tools that they already know. They're not looking to learn Python. They're not really programmers. You don't have that investment from the team into sort of getting the skills necessary to use a system like Grapple, then I think you would be the wrong tool. The the team would probably struggle with it. It would be different enough in how they're thinking about things that they would be more effective with a tool that fits into their mindset. But, really, Grapple just asks that you think about things a little differently. Right? Instead of events, you think about graphs and behaviors.
Instead of maybe a SQL like query syntax or a very, very specialized domain specific language, you learn the basics of Python. You can call a couple of methods, maybe create a class, for loops, that sort of thing. So if that barrier is not something that you're prepared to get over, grapple would probably not be the right tool.
[00:47:53] Unknown:
And as you continue to work on grapple and continue to explore the space of security and tax and being able to remediate some of those. What do you have planned for the future of the product and the business?
[00:48:06] Unknown:
Yeah. Tons. We are really ambitious, but, you know, some of the obvious ones, I think, are just getting more data sources, more plug ins built for Grapple so we can address more use cases. I would love to see more more endpoint instrumentation, like you'd mentioned, audit, which we have a very early proof of concept for, OSquery, also more services like G Suite and GitHub or 0365, which are often overlooked as entry points for attacks, but can be really important capabilities if an attacker can get control over those. So just expanding the use cases there, improving our query DSL built on top of Python to express more and more attacks even more effectively, That's always top of mind for us. We never want Grapple to be the limiting factor in the attacks that you're expressing.
And as I mentioned, also getting a managed service up and running, I think, would be great. It would mean people can bring their data to us, and we can do the work of managing Grapple. We can help understand what's going on in our customers' environments more directly, build signatures for them, work with them, and collaborate, and just get a better idea of how people are actually using Grapple. And I think that would be really huge for us just in terms of understanding how we wanna continue to build it. But, you know, we're very early. We're small. We're growing. And we're also still figuring out exactly what everybody's gonna want from it. So I think we've we've got a great foundation. We've got great capabilities built into it.
But I think it's gonna really grow to be so much more than it is today.
[00:49:53] Unknown:
Are there any other aspects of the grapple project or your experience building a business around it or the overall space of security detection and response that we didn't discuss that you'd like to cover before we close out the show?
[00:50:04] Unknown:
You know, I I think really what a lot of grapple gets right is that it's really about shifting how we think about approaching this problem. You know, security, especially in detection response, has been approached in much the same way for the last 10 to 20 years, this event based way, this, you know, reg x on fields way. And GRAPL really just was an experiment to say, what if we just threw out the book? What if we said we wanna solve these problems and really rethink every aspect from bottom to top? And I think we've built something really awesome just by taking that approach. And I think, actually, you know, a lot of technologies have done that. They've just thrown out the book of what is typical and what most people are doing.
Docker comes to mind, for example. Leveraging containers has been massive, and that's that's a very different world from even 10, 15 years ago with Vagrant and, you know, all of these other VM based approaches. So, you know, I just hope that we see more and more technologies cropping up that aren't constrained by the way things are already done.
[00:51:16] Unknown:
For anybody who wants to get in touch with you or follow along with the work that you're doing or get involved with the grapple project, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And this week, I'm going to choose the Artemis Fowl series of books and the movie that they made off of the first book. It's just a lot of fun books that I've been reading with my kids and wife. So just a lot of interesting story lines and some interesting blending of sort of fantasy and modern technology.
It's interesting adventure. So definitely worth checking out if you're looking to stay entertained with some anything there. And so with that, I'll pass it to you, Colin. Do you have any picks this week?
[00:51:52] Unknown:
I had mentioned the PYO 3 library. I've been checking it out, melding Rust and Python together. I think it's 1 of the most impressive technical projects that I've seen and and really enabling to start merging the capabilities across programming languages. So if you're interested in those 2 languages or how you could improve the performance of your Python code, I would highly recommend checking out Py 0 3.
[00:52:19] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Grapple. It's definitely a very interesting project, help identify security issues in our system so that we can resolve them more effectively. So definitely a very useful way to spend some time, and I appreciate all the effort you've put into that, and hope you enjoy the rest of your day.
[00:52:45] Unknown:
Thank you so much, Tobias. I really enjoyed getting to talk with you today.
[00:52:50] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Interview with Colin O'Brien: Introduction to Grapple
Colin's Journey with Python
Technology Choices at Rapid7
How Grapple Works
Motivation Behind Grapple
Target Audience and Workflow
System Architecture and Design Evolution
Supported Data Sources and Plugins
Portability Across Cloud Providers
Benefits and Challenges of Using Python and Rust
Community Engagement and Contribution
Open Source Governance and Business Model
Detecting Security Compromises
Getting Started with Grapple
Anomaly Detection and Baseline Analysis
Interesting Use Cases and Notable Attacks
Lessons Learned in Building Grapple
When Grapple is the Wrong Choice
Future Plans for Grapple
Final Thoughts and Closing Remarks