Summary
The growth of analytics has accelerated the use of SQL as a first class language. It has also grown the amount of collaboration involved in writing and maintaining SQL queries. With collaboration comes the inevitable variation in how queries are written, both structurally and stylistically which can lead to a significant amount of wasted time and energy during code review and employee onboarding. Alan Cruickshank was feeling the pain of this wasted effort first-hand which led him down the path of creating SQLFluff as a linter and formatter to enforce consistency and find bugs in the SQL code that he and his team were working with. In this episode he shares the story of how SQLFluff evolved from a simple hackathon project to an open source linter that is used across a range of companies and fosters a growing community of users and contributors. He explains how it has grown to support multiple dialects of SQL, as well as integrating with projects like DBT to handle templated queries. This is a great conversation about the long detours that are sometimes necessary to reach your original destination and the powerful impact that good tooling can have on team productivity.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to pythonpodcast.com/census today to get a free 14-day trial.
- Your host as usual is Tobias Macey and today I’m interviewing Alan Cruickshank about SQLFluff, a dialect-flexible and configurable SQL linter
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what SQLFluff is and the story behind it?
- SQL is one of the oldest programming languages that is still in regular use. Why do you think that there are so few linters for it?
- Who are the target users of SQLFluff and how do those personas influence the design and user experience of the project?
- What are some of the characteristics of SQL and how it is used that contribute to readability/comprehension challenges?
- What are some of the additional difficulties that are introduced by templating in the queries?
- How is SQLFluff implemented?
- How have the goals and design of the project changed since you first began working on it?
- How do you handle support of varying SQL dialects without undue maintenance burdens?
- What are some of the stylistic elements and strategies for making SQL code more maintainable?
- What are some strategies for making queries self-documenting?
- What are some signs that you should document it anyway?
- What are some of the kinds of bugs that you are able to identify with SQLFluff?
- What are some of the resources/references that you relied on for identifying useful linting rules?
- What are some methods for measuring code quality in SQL?
- What are the most interesting, innovative, or unexpected ways that you have seen SQLFluff used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLFluff?
- When is SQLFluff the wrong choice?
- What do you have planned for the future of SQLFluff?
Keep In Touch
- alanmcruickshank on GitHub
- Website
Picks
- Tobias
- Alan
- Lost Connections: Uncovering the Real Causes of Depression – and the Unexpected Solutions by Johann Hari (affiliate link)
- The Wim Hof Method by Wim Hof
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- SQLFluff
- Tails.com
- Hypothesis
- Project Euler
- Flake8
- Black
- dbt
- Snowflake
- BigQuery
- SQL Window Functions
- ANSI SQL
- PostgreSQL
- MS SQL Server
- Oracle DB
- Airflow
- SQL Subquery
- Common Table Expression (CTE)
- The Rise Of The Data Engineer blog post
- The Downfall Of The Data Engineer blog post
- Object-Relational Mapper (ORM)
- Tableau
- Fishtown Analytics SQL Styleguide
- Mozilla SQL Styleguide
- The Zen of Python
- dbt Packages
- yapf
- Set Theory
- Flake8 SQL Plugin
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Alan Cruickshank about SQL fluff, a dialect flexible and configurable SQL linter. So, Alan, can you start by introducing yourself?
[00:01:06] Unknown:
Hi. I'm Alan. I will I'm head of data at tails.com, in the UK. So we are a tailor made online subscription pet food business. People come to us, dealers, love your deals about the dog and themselves, and we make buttons of food specifically for that dog and ship it to their door. As I said to the maintainer of hypothesis at Python Conference a while, we make dog food using Python, and he has a dog, so that was quite nice. I guess I'm also the maintainer and the person who started SQL fluff about 2 years ago. Yeah. It's definitely funny the number of things that Python has come to power over the years. Yeah. Dog food was not 1 of the ones, but it leads to some more interesting puns around dog fooding and and in So I I I think you can say that you dogfood your work on SQL fluff. Well, yes. Although, there was a surprisingly long period where that wasn't true, where I was maintaining a project that we weren't actually using internally. That did feel somewhat awkward for about 6 months. But, no, we we do now use it and use it quite extensively internally.
[00:02:02] Unknown:
And would you say that you eat your own dog food?
[00:02:05] Unknown:
Yes. Well, we make wet food and dry food for dogs, and let's just say the wet food is more palatable. It just looks like a nice meat stew. And from the actual dry food perspective, it smells great. Some taste good for humans, and some are yeah. Let's just say that the taste is formulated more for dogs for dog palate than for a human palate.
[00:02:26] Unknown:
And do you remember how you first got introduced to Python?
[00:02:29] Unknown:
I think so. So my path to Python was probably a slightly odd 1 in that my first experience with with software in general was was with c. I started playing around with c in school, and I had a couple of projects there. You know, my background's in engineering, but not software engineering. I'm a mechanical kind of design manufacturing engineer by background. As part of that course, it was introduced to quite a lot of c programming. I think I came to Python quite late, almost entirely as a scripting language to automate tasks. I think my first experience with Python was actually to automate cleanup of file permissions on a shared drive in 1 of my first jobs.
And from there, realized that it was a lot easier than c and built out. I don't think I would have been the Python engineer I am today without projects like Project Euler as a great place to learn. That's where I first was introduced to Python, but I think not really in anger until I joined my current role. And in a small joining as a the first analytical person into a small start up, there's not a lot of engineering resource around. And the engineering resource that a small organization tends to be spent on building the product that it's selling rather than on on the analytical side of things. And so I relatively quickly have to be at least infrastructure self sufficient, and that starts a ball rolling of that ended up with a team of data engineers that we have now that starts with automating and making sure that we've got the infrastructure and tooling need to just do the analytics we need to do to answer the questions we need to answer.
[00:04:03] Unknown:
Needing to be able to be self sufficient and build the tools and infrastructure that you need for your analytics, I'm sure, was a natural progression towards creating the SQL fluff package. So I'm wondering if you could just describe a bit about what it is that you've built there and some of the story behind how it came about.
[00:04:17] Unknown:
I'll start with the story, and that explains what it is in the background. So as I was getting into Python, 1 of the things that I think some of the people who guided me along that way have pointed me towards Linters, like Flak8, who I understand have been on the podcast recently, and and latterly, tools like Black. And they gave a certain level of quality to the kind of the software we're putting together, particularly as we had a fledgling team of analysts and data scientists, most of whom didn't come from a software engineering background. Anything we could do to increase the robustness, reliability, readability of of the Python code we had was always a plus because none of us came from a very fluent software engineering background.
Like any team of analysts, we have a chunk of SQL that we write, and I think even doubly so now that we are very heavy users of DBT and Snowflake, I think, as a data warehouse, we have quite a large SQL code base. And what I realized pretty quickly is that there there wasn't an equivalent linter for SQL. And we've got a lot of SQL. And at the time, I think 2 or 3 years ago now, it was not tidy. Some of the team would write it in 1 way, and it'd be legible in their definition of legible. Some of the rest of the team would write it in another way, which was legible, but their definition of legible. And some of the team I'm not gonna call people out live on a podcast, but some of them just, I don't think had any particular regard for legibility as a whole. It was very much functional in the bad way. Functional as in it works, but it wasn't set to communicate to other people. And, like, in the early days when you've got a very small team and people are working in a somewhat siloed way, that's kind of okay.
But I think as we've grown as a team, but also matured as a function where we the c the code we write is not just there to do a job. It is there to communicate to other people what this is for and whether that's another team member or a future version of that person. And if it's unreadable or at least very difficult to read, that's a drag on the the efficiency of the team going forward. So found out that SQL Lint didn't exist. 1 of the things that we've done at tails.com since relatively early on, actually, is we've had internal hack days, the same way that, you know, a lot of businesses had.
And the prototype for SQL fluff came out of 1 of its hack days, I think, in late 2018, if I remember rightly, where I naively thought, you know, how hard can this be? Right? I'm sure I could put together something in a day and came out of that day with something that, you know, checked new lines and white space and some of the basic stuff. But getting a little deeper realized quite rapidly that we we couldn't lint SQL without passing SQL. And that was a whole new challenge. And I think if people look back at the kind of commit history or change log of SQL fluff as a project, the periods where development has been slow or progress has been slow is where the project to get a parser to work stumbled or didn't move.
I personally, if I remember, there was a there was a period of about 6 months where I was collaborating with 1 of the earlier collaborators on the project on a particularly hairy bit of, I think, bracket passing. And I think I just burned out. I just ditched the project for 4 or 5 months because I felt like I'd gone in too deep. That's how it came about. What that means we've ended up with is we have we have a linter, but really behind that, we have a dialect nonspecific SQL parser, 1 that is based around a a core dialect of of Base SQL, but that can be configured to all of the different intricacies of different SQL dialects, whether that is functionality which doesn't exist. So, you know, if you're in Postgres and adding window functions, which might not exist in base dialects, or whether it's just as simple as what characters are used for commenting or quoting. So if you're comparing MySQL or Postgres or BigQuery, you'll find that that what different kinds of quoted strings mean have different meanings. And being able to configure that without rewriting an entire dialect when the the base bits of SQL are the same. There's a parser there which understands it, but then also will generate a parse tree which is non dialect specific, and that means we can apply non dialect specific linting rules to dialect specific SQL, whether that is the kind of things you would expect within a Python context, so whitespace around operators, commas, indentation well, actually, we'll come back to indentation in a second. Operators, whitespace, capitalization, the basics, through to some of the more SQL specific things. So unlike Python, SQL doesn't care about white it doesn't care about indentation. It's not indentation specific.
And that means that people do some ungodly things with indentation in SQL. And that also includes placement of commas. The the running joke on the dbt guys brought this up when they were on. There's a running joke in the analytics community around trailing and leading commerce. I'm sure everyone has strong views. I'm sure in the comments for this, there'll be people waving the flag for 1 or the other. So that's where we've ended up. Now the early days of applying that to our database meant, naturally, that there were a lot of files that had some relatively simple errors that were extremely tedious and time consuming to fix.
And that was the initial impetus behind the fixing object that we've got of just, you know, 100 files of SQL all with slightly dubious white space and not wanting to pick through them all. And And that was the beginnings of the fixing code we've got. And the extent to which SQL Flat is able to automatically fix linting errors is actually much greater than what I'd initially envisaged for the project. I thought it would be a, you know, a nice bonus feature for some simple errors where it would fix them. Actually, I don't know the number off the top of my head, but more than half, and I wanna say something like 2 thirds of all of our limiting rules have automated fixing built into them. The density is much higher on the very simple common rules that come up. And so within reason, SQL fluff is both a linter, but also an an auto formatter to as a labor saving device for a team.
[00:10:51] Unknown:
Definitely interesting how it has come to kind of embody the flakate and black use cases for people writing SQL. And it's interesting that you pointed out the trailing versus leading commas debate. I was actually going to ask what your preference is.
[00:11:06] Unknown:
I'm thoroughly on team trailing commas. But even within the comparatively small analytics community at tails.com, there are a selection of people with strong views either way. But it brings up an interesting point because, like, we joke around people having strong views either way. But I think that the thing that SQL fan, I guess, tools like this, in general, enable teams to do is to have productive discussions about this stuff. Like, it really doesn't. People are gonna hate me first. It doesn't matter whether you have trailing commas or whether you have leading commas. What matters is that you all agree, and you all do the same thing.
And we can talk later about whether all of you implies your team or your organization or the industry as a whole. At a micro level, as long as you all agree, it doesn't matter. And if you previously, you'd have a conversation with someone, and they'd be like, oh, I think you should you know, you do a code review with someone. And code review for SQL is always bad in the first place. But part of it is because this stuff is really painful. Like, nobody wants to put together a huge SQL file, and then their 1st round of feedback in code review is, hey. Can you sort out your indentation? It feels really gross for the person getting the feedback. It feels awful for the person giving the feedback because you're like, I need to give this feedback because we all understand that style guides are useful.
And in fact, I realized quite a bit recently, but as a side on why style guides are useful, like, the code is not just there to function. It is there as a way of communicating with other people. And if it's not legible and isn't consistent, then it is slow to read. And if it's slow to read, it's slow to understand, then mistakes happen, and your team is slow. So that's why style guides are useful, and I think everybody broadly kinda gets that. But the application of a style guide if your style guide is being applied by a human, the process of a human giving feedback on a style guide is just unpleasant. It's just really not very nice. And and inconsistent.
Well, exact it it's inconsistent not just because humans are humans, that but that is true, but also because people are trying to walk this line between how do I give feedback about the style guide and this code being up to scratch with that and also give substantive feedback. And, occasionally, someone will see a really big substantive issue or they wanna go to, which is about the the logic or the intent of the sequel someone's written, and therefore says that first and doesn't talk about the style guide stuff. Or maybe they talk about style guide stuff first because they otherwise say they can't read it for them to review it, or they do some mixture of the 2, and either extreme is bad. And so I guess the intent of what I hope SQL fluff ends up being is that we can automate that whole style guide process and get to a point where we can have actual productive code review on SQL in a way that is already consistent, so it's easy to review, and it's easy for the future presented to someone to review, and they can do that effectively.
And code base as a whole for people with a lot of SQL can be an asset that you're building on, not just a mess.
[00:14:06] Unknown:
In terms of SQL itself, it's 1 of the oldest programming languages, and some people might take issue at calling it a programming language, but it is it's 1 of the oldest languages that is still in regular use where, you know, there are things like COBOL and RPG that still have their use case, but they're not widespread. Where a sequel has gone through different levels of popularity, but it's definitely on the upswing, particularly with the rise of analytics engineering. And I'm wondering what your thoughts are on why there are so few linters available for it. Actually, I did a bit of a search in preparation for this show. And when I just searched for sequel linter, I basically came up with 2 results, and 1 of them was yours.
[00:14:47] Unknown:
Yeah. I think there's a few around. I did do a little bit of a research around what the state of the art was, and I think what I found the tools that did exist already were focused very much on validity of SQL, effectively pre parsing. Is this valid SQL, and therefore, will it work? In an effort to catch passing errors? And some of what they were doing was around readability, but it was very much validity checks rather than than formatting checks. And in some ways, some people had taken a stab at the SQL linting problem before, but they'd taken 1 bite of the problem, and I we've taken this, like, different tack of the problem. The question as to why it hasn't existed, I think, is a interesting 1. I have I have many views, some of which I think are probably valid and some aren't.
There's 2 guiding drivers to this. 1 is kinds of people and the kinds of organizations that have large SQL databases and so have the need to do this. And the other 1 the other set of drivers, I think, come out of the characteristics of SQL as a language. Let's start with the people side of this. Certainly up until recently, and still probably true now, most people that write significant amounts of SQL probably don't call themselves software engineers. And so while if you are a Python developer, you probably have the tools to build a Python linter that lints Python. If you are a SQL developer, you can't write a linter in SQL, although I've I've seen some people do some ungodly things with SQL, and maybe you can, but I wouldn't recommend trying to write. Like, it's not the kind of thing where you can write a SQL interpreter in SQL. That's not really within the wheelhouse.
So it necessarily means, I think, to get off the ground, you need to have a group of people who both are sufficiently fluent in SQL and write enough SQL to care about the problem, who also have the software engineering skills to get a project that is off the ground. And I think up until the recent comparatively recent convergence of analytics engineering or data science, where we've started to get more significant groups of people who write significant amounts of SQL and are fluent enough in a programming language to solve it, that the skills have been in the right place for this thing to happen.
Otherwise, there's there's no there are no people who are intrinsically motivated to do it. I think there are so that's the people side of it. SQL as a whole SQL is not very standardized. I mean, I realized ANSI SQL, which is 1 of the earliest standards of it, does have, I think, what stands for, ANSI is the American Standards Institute, because it does say standard in the name, but SQL is almost notoriously nonstandard for a selection of reasons, I think. 1 of them is because there isn't really a a body that governs it. There's a specification outlined in an academic paper. I think it looked at the tape for a while. Like, I wanna say in sixties, that it was originally drafted.
And then it has been used by a selection of different bodies who each of them has tailored it to what they need. And so rather than, I guess, in the spirit of contributing upstream to some centralised body who governs what SQL is, all of them have just kind of done done their own thing with it. Now there have been major players in that ecosystem that I think some smaller players have copied. I think MySQL and Postgres have been the most notable examples of in the open source space, but some of the bigger players, and I guess, potentially the more historic players here, have been have been private enterprises. So Microsoft and SQL Server, Oracle with a load of their offerings, and, I guess, back, you've got IBM and everyone else back there as well. So each of those organizations has taken their own bite at SQL, done something slightly different with it, and I don't think have been incentivized to converge on a consistent standard because most organizations tend to have just 1 of these tools, and you end up with people who have their careers in using 1 of these tools and so don't jump around as much.
The other thing that makes this a little different is the way that people tend to use SQL, and this is a more recent development, is around templating. So I think the first place that I came across this was with was with Airflow, but even more so with the DBT community. It's quite common to template your SQL, where the SQL files that are stored aren't themselves valid SQL, they had to be pre processed before they become valid SQL. And this was in some ways, this was 1 of the toughest pieces to address to putting together a linter, but it's also 1 of the things I think has been most important for SQL as a project and be 1 of the things that has, I guess, built some of the momentum around it and that we have an approach to that which works. The challenge is to make something that is valid SQL. In a very simple case, you know, if people are doing just templating to insert variable names or table names, then you could kinda guess. You could do some kind of heuristic to guess what the result would be if you replaced the tag with with something else. But if you're templating or doing loops in larger chunks of SQL, it's much harder to guess what the valid SQL would be out at the end. So the initial approach we took to this was to allow people to configure variables themselves. Effectively, if you're templating using Jinja, which I realize particularly for the Airflow community, that that's mostly what people are doing. If you're putting variables in there, give SQLFLOP some configuration, which allows it to insert some at least comparable variables, and then it will be able to render your SQL and then link it as normal.
That's worked for people who are using large SQL files in in an airflow context. What makes that doubly difficult with the dbt community is that the extent to which people are templating the SQL is is, I think, and I think it's safe to say, an order of magnitude more complicated. And people are templating much more than just, you know, a field name or a variable name here or that. What we introduced into SQL fluff, I wanna say, 6 months ago now or so, is to actually tie into the dbt compilation engine in the background SQL fluff so that it it will use dbt's native compiler to compile your SQL, which means that if it's valid for dbt, it's valid for us, and we will link it appropriately. And I was skeptical at the beginning because it adds a big comp like, it from a performance aspect, it adds a big overhead to integrate with that. And, obviously, we are integrating with DBT, which is also a tool which is very active development.
At the same time, the response we've had from the community around how many people are using and how many people have found it really helpful, particularly to get coverage of whatever dbt package people are using, and the dbt package ecosystem has been huge. And if I think about the kinds of comments we get on the the SQL club Slack thread, a lot of the discussion is around how people are using this with dbt. If you think about who is likely to have large SQL code bases and care about venting, it's unsurprising that the analytics teams that are using dbt are also looking at ways to make sure that that SQL code base is legible and usable
[00:22:10] Unknown:
so that it can be an asset they feel comfortable investing in. Yeah. I definitely think that sort of the rise of analytics engineering and the confluence of that with, you know, the SQL data warehouses and the fact that a large number of the source systems that are populating these data warehouses are SQL variants in their own right to your point of a lot of people might have invested in a given database ecosystem, you know, 5, 10, 20 years ago. Whereas now people have to be polyglot in terms of the databases that they use and the dialects that they're able to work with. So definitely an interesting sort of confluence of events that have led to the need for linting capabilities for SQL as a language.
And to your point too about the readability aspects of SQL, it's definitely very easy to write SQL that does its job, but is absolutely incomprehensible and is completely impossible to follow what the intent of the query is doing. Yes. And I'm wondering if you can just talk through in terms of your own experience and feedback from the community, what are some of the characteristics of SQL as a language and the various dialect variants and some of the issues that contribute to readability and comprehension problems across a team and across large SQL code bases?
[00:23:25] Unknown:
I think, thankfully, most of the readability and comprehension issues are dialect nonspecific. When we started it out, I was worried it was gonna be a much bigger deal than it ended up being. A challenge with the dialect side of it is that a lot of them are very low level differences, like quoting or very, you know, the the kinds of keywords that are allowed in select statements and what kind of causes there are. But the higher level the more specific concepts are actually surprisingly similar. When it comes to readability, there are a few that people from the Python community will be very familiar with, and I guess we touched on some of them already. Indentation, whitespace.
Long lines are the obvious ones, and they they are definitely definitely a case for SQL as well. Going a little deeper, we get into the structure and how people not just write the SQL, but but are structuring their queries. So an example for this, and 1 that I think we've had a lot of discussions around, is is subqueries and whether to use subqueries or common table expressions. So this is effectively, like, for anyone who's not familiar with SQL, do you define a reusable function or method and then call it later, or do you have a method that defines a method within it? Now if you are from a Python background, the second option sounds crazy. Why would you do the second option when you could do the first 1? For the SQL community, the support for the second is much more widespread than the support for the first.
And if you have a select statement that joins on something that has a select statement within it, but then has a select statement within that, has a select statement within that, It becomes very, very difficult to wrap your head around what is actually going on in this. And so some of the more, I guess, detailed linting rules that we started to put in more recently are almost around the degree to which things are nested. The other thing that I think is potentially more unique to SQL than it is to Python is around naming things and aliasing. I was having a conversation with someone about this in the last couple of days, actually, that it has become relatively common practice in the SQL community to alias effectively. You have a table name, which is relatively descriptive as you would with a a method name or a variable name in Python, but then to alias it to a very, very short alias and then use that when referring to it in all of your other columns.
Now that definitely optimizes for space. It means you have to do less typing. If you alias everything to a single letter table name, that's typing. However, when you're writing SQL, someone else is going to read. That's actually quite tricky. And if if I look back at some of the queries that I've written a while ago, lots of 1 letter table name, table aliases, 2 letter table aliases just because I didn't wanna type the extra characters. As hacky SQL to answer an ad hoc problem, fine. As something that we're gonna build on as a reusable piece of SQL in the future, actually, not great.
And, sure, it's gonna add some extra characters to actually type up the table name. But if your table name is so long that that is hard to read, maybe you should make the table name shorter rather than aliasing heavily. We've got a couple of of of linting rules now around the degree to which aliasing should be explicit and, actually, should you be aliasing at all if you if you need to. But those are some of the more interesting ones, I think, for this community, I think.
[00:27:01] Unknown:
Going back to your point too about some of the templating aspects and what you've mentioned as far as the 1 off hacky SQL script that I'm not going to reuse, it kind of goes back to the earlier conversation about why a linting tool hasn't really gonna come about earlier is that most of the time when people were writing SQL, it was a sort of personal endeavor where they might be a database administrator, team, but largely, it's going to be a solo effort of, I write this pile of SQL scripts. I know how to run them, and I'm the only person who ever has to look at them. Whereas now it's more of a software artifact that other people are building on top of. And SQL has historically had a very big issue with composability and reusability where rather than taking a fragment of SQL and then, you know, reusing it somewhere else, which is what Jinja templating gives us, it was, well, now I'm gonna look at this query because it gives me most of what I need, but now I have retype the whole thing to give the exact structure that I'm trying for.
And also because of the expense of data storage and database operations, it was unlikely that you would use something like intermediate tables where you could use fragments of SQL to be able to compose that overall operation together because you would want to do it entirely wholesale and with the database optimize those operations. It's definitely interesting how the ways that we're using SQL and the styles of SQL have evolved as we are using it as more of a reusable artifact as opposed to something that it was just 1 off and single purpose.
[00:28:36] Unknown:
Oh, I completely agree. There's a couple of blog posts from Maxine Beauchampagne, which I suppose are probably a year or 2 old now, around the rise and fall of the data engineer. And I think there's a point he makes in there which I think speaks to that, almost like a coming of age for the analytics community. Right? We've, for, I suppose in the case of SQL, literally decades, have been writing throwaway SQL and not building upon it. And that has driven a load of the behaviors you just mentioned. Whereas in the last 5 years or so, there is some of the practices and tooling and expectations that have been commonplace in the software engineering community for years are slowly starting to seep into the analytics community and in a great way. And I think this is an excellent transition and and log overdue.
But this move around reusability, comprehensibility, building software which is you can build on and and have as a robust piece of reusable logic. And things like testing, we're talking here about linting, but we are the the little brother to testing. SQL has always been historically pretty badly tested as well. And if it's tested, it's tested against not real data. So I think SQL for for us was built out of us, the community at tails.com, starting to put in place some of the practices around analytics engineering, which I think are heavily inspired by the dbt community. I don't think it's unfair I don't think it's an understatement to say that without the dbt community, SQL Fluff would not exist, because we were the customer's renaissance of approaches to how we use SQL.
We wouldn't have the same large SQL code base, and we wouldn't have been exposed to those ideas. And I think SQL fluff is part of this wider movement to professionalize analytics engineering as a discipline and make it something that is is much more standardized across the industry.
[00:30:32] Unknown:
Yeah. And I think another factor to why linting of SQL code has taken so long to come about is that for the period of time where software engineers were working heavily with databases where they might have wanted to write such a thing, the majority of the time rather than invest wholeheartedly in SQL because of some of those composability issues, they would instead either template the SQL in their programming language or rely entirely on something like an ORM to abstract the SQL entirely and not actually deal with SQL as a language and deal with it natively in whatever programming language they're using at the expense of the impedance mismatch of the sort of object relational mapper.
[00:31:16] Unknown:
Yeah. And for if you're writing transactional software, I think it sidesteps a problem in a really neat way and a and a way that's not a bad thing at all. But I think the data engineering community where we're, you know, we're not trying to get 1 record from a to b with minimum latency. We're trying to get 10, 000, 000 records from a to b all in 1 piece without breaking all of the infrastructure and costing all the money at the same time. And that difference in intent has meant that it's become much more important to write SQL directly because any ORM you're doing on top of it, performance hit is too great.
[00:31:57] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census today to get a free 14 day trial and make your life a lot easier.
And so now digging into SQL fluff itself, can you talk through some of the ways that it's implemented and sort of the architecture of it as a piece of software and just some of the ways that the goals and design of the project have changed and evolved since you first began working on it? Biggest
[00:32:53] Unknown:
change was the realization that we needed a SQL parser. That happened relatively early on, but the the and if you I mean, I guess the the download stats for different SQL file versions are are public. You can, you know, get the PyPI download stats. There are still people using the very early versions of SQL file, which don't pass the SQL because they are extremely performant. You don't care about the other little integrals. Now that number is going down. So that's the first biggest change. The second element that came into this was also the the templating piece, and making sure that we could template SQL became quite a integrating that in a way that it works led to some quite big architectural changes.
So in and around that, the root of where most of the functionality of SQL5 is is in this parser. And the way that we implemented that in a way that it could be configurable was to define effectively what we call a segment as a chunk of code. And segments are used both for parsing and for the eventual parse tree that we've got. And, architecturally, the way that it works is that you'll have a start with a a root segment. In our case, it's a we call it a file segment. It has instructions on how to subdivide itself, and then each of those things that it's subdivided into have instructions on how to subdivide themselves.
So a file segment might say, I'm expecting a delimited list of statement segments, and so it will look for the delimiters, take the bits between the delimiters, and say, here's a statement. Cool. A statement segment knows whether to say, yeah. That looks like a statement. I'll I'll take it, or whether to say, no. That that ain't that's not a statement. Go away. But once it accepts it, it will accept it and then subdivide itself into into smaller pieces. What that means architecturally is that if there are bits of your SQL that don't make sense, we can still pass the rest because we're effectively splitting your file up into pieces, and each 1 then splits up again, splits up again, splits up again. For example, if there's a stray keyword, try and alias something twice or you spell ass and put 2 s's on it. The rest of your query will pass, and we'll be able to lint the rest of it. And you'll just get a passing error for that section where it says I was looking for a select clause, and this is this isn't a select clause. This is something else, so go and fix it. And if if people browse the codebase v c folder, most of the logic is around implementing that and being able to generate a syntax tree for the SQL. And 1 of the discussions we've had a couple of times is whether we we needed our own parser or whether we could use an off the shelf parsing library.
And the 2 driving factors that led to us effectively building our own parsing library under the hood of this was, 1, was that we were able to parse pieces while still successfully parsing other bits of this. The other 1 was explicit parsing of whitespace. So most parsing libraries will first split your code up into tokens, bin the white space, and then let you nicely and safely lint the remainder. That's not very useful if you're building a linter because you wanna keep the white space because you wanna be able to check whether it's appropriate white space. And so we needed a parser which would keep a record of that white space and be able to know where it came from and and what happened to it. Those are the drivers that meant we did that. I think, latterly, having done that, it made it much easier to implement some of the templating stuff we did later on because we already had our own parser.
More recently, there are a couple of architectural changes which I think are interesting. 1 was around how we store the location of things, and in some senses there are 3 positions that matter: where was it in the raw file, where was it in the templated file, and where was it in the fixed file after we applied any fixes that we applied to that. And keeping track of all 3 of those positions, partly so that if we are say there's a linting rule about whether 2 things are supposed to be on the same line, whether we can apply that, but also to report back to the user. Because if we apply a lot of fixes to your file and say, hey. You are missing a comma on line 37, and they look at line 37 and there is no line 37, it's not very helpful.
It's not very useful. And so for a long time, because of the way we'd supported templating, we would report it in the templated file. So if all of your templating didn't change the number of lines, that's great. If it did, if you're doing looping, you would get reports of errors on lines that didn't exist in your SQL file. And 1 of the biggest changes we made was to actually map source files to templated files so that we can reference positions. That isn't supported natively by Jinja. My life would be a lot easier if it was. If the Jinja guys are listening, I'd love to reference the position in source files and templated files. I suspect it's not gonna be supported by Jinja anytime soon based on their implementation. And we have got a few heuristics in the in the background to try and match up bits of code to work out where it came from so that we can report back.
And that's been really useful for some of the ways that SQL stuff is being used. So SQL Fluff could be used as a command line tool. That's where it's initially designed to be used, but there are a few other projects that are starting to use it as well. The first were, 1 of the maintainers of of SQLflow made an online version of it. So you can go to SQLflow online and lint your code in a web browser if you like. I can't remember the URL, but there's a link to it on the GitHub page if you'd wanna find it. Secondly, there's also a Versus Code plugin. So if you wanna set up, syntax highlighting and linting within your editor, there is a plugin which uses secret buffer in the background to to do that. We've also had some interesting requests or feature requests on GitHub for people who are starting to use SQL Fluff as a parser to get other things out of it. So given we have a SQL parser in the background, if you wanted to actually, a good example of this that we're using internally, we pass we use Tableau as a BI tool, and so it has some SQL queries in the background.
We extract all of the queries that drive our main dashboards, send them through SQLFLUF to extract the table names and use that for lineage tracking so we can see which tables are powering our dashboards. That doesn't use any of the linting features of SQLFLOP, but it it allows us to look at a query and extract which tables is this referencing. If anybody listening is interested in in using SQLFLOP for that, in the git repository, there's a folder called examples, and there's some example code in there about how to use the Python API of of SQL Fluff to extract things like table names, table references.
There's been a long running discussion about whether we should do type inference and and lint around that. I don't think it's gonna happen anytime soon. It's just really complicated. There's so those are some of the interesting ways that people are starting to use it as a effectively, as a parser to pass multiple different SQL dialects.
[00:39:46] Unknown:
Yeah. I think the type inference question is interesting, but would be very complicated without actually establishing a connection to the database to be able to query it for its metadata or integrating with some other metadata management solution to be able to pull that information through.
[00:40:01] Unknown:
Yes. Now that being said, for the way that we use the dbt engine as a compiler in the background, it is already making a connection to a database to do some metadata inference. So it's not inconceivable, but you're totally right. It adds a whole layer of complexity.
[00:40:18] Unknown:
I think it's a good companion library and sort of additional effort that's worth exploring, but not necessarily inside SQL fluff itself because then you're just starting to balloon out beyond what its original intent is, and it starts to become all of the things, in which case nobody wants to use it because it does too many things. Yeah.
[00:40:37] Unknown:
Yeah. Definitely.
[00:40:38] Unknown:
In terms of the use of SQL fluff, you know, 1 of the roles of a linter is to be able to preemptively identify bugs before they make it through to production. And I'm wondering what are some of the kinds of bugs that you're able to identify with SQL fluff given the nature of SQL and the ways that it's being used and some of the things that are, you know, potential logical errors that are beyond the capabilities of SQL fluff to be able to identify?
[00:41:06] Unknown:
Easy answer to this is is exactly what people expect. There's just, parsing issues. Right? If you people have just written SQL that isn't valid is the easiest kind of thing that it catches, which is the obvious answer. However, the other kinds of bugs that SQL Server has helped us catch have not been the ones I expected. They are surprisingly subtle where it actually highlights gaps in people's understanding of how SQL actually works. So 1 of the examples that actually solved the bug for us was the qualification of the union keyword. So for anyone who doesn't know SQL, you can take a select statement, which is gonna get some data. You can then say union and then give it another select statement, and it will take the results of them and union the results.
However, most database backends, union is actually an abbreviation for union distinct. And you can explicitly say union all to say don't filter for distinct. Now if you don't know that that's the case and you're just thinking union, you'd be like, oh, cool. Well, I'll get all of my results. And if what you actually wanted to happen was a non distinct union, you need to actually tell it to do that. So 1 of the limiting rules we have in place is that unions have to be qualified. You can't just say union. You have to say union all or union distinct. To be explicit. I mean, this is in the zen of Python. Explicit is better than implicit. And this is an example of 1 where, actually, for for people that I've spoken to, not everyone is aware that that union does that. They just think it's I wanna see union statement. Right? It's just gonna do what you think a union would do.
The other 1 that is very similar in nature is the use of the distinct keyword, where if you write select distinct open brackets a close brackets from blah, it seems obvious that you're gonna find them distinct values of a. You can also say distinct brackets a, place your brackets, and then have a b c d e. And it looks like you're just doing distinctive a, but, actually, the distinct applies to the whole statement regardless of what your brackets do. Distinct isn't a function. It's just a keyword that you can put after you select. Even though how you've written it makes it look like you're only intending the distinct to apply to 1 column.
And linting that out, we've linted out a couple places where people thought it was only applying to 1 column. So, like, that's not actually how SQL works. And where SQLFOF is trying to make things explicit rather than implicit, We are rooting out places where people assumed the code they wrote did 1 thing but actually does
[00:43:33] Unknown:
something else. And in resolving their own linting errors, people have addressed where those things are happening. That points us toward the kinds of linting rules that you have embedded in SQL fluff, and I'm curious what you have used as resources and reference points for identifying useful rules and where they might be problematic and the types of stylistic elements that are worth calling out versus things that nobody cares about?
[00:44:00] Unknown:
That's a great question. There are 2 style guides we've been heavily inspired by, which are published style guides. So 1 of them is the Phish Analytics style guide from the team who who brought you dbt. They have their internal SQL style guide published on their GitHub organization, and we're heavily inspired by that because it's part of the same community that we've drawn a lot from. The other 1 is there's other published there SQL style guide as well, and we've been inspired by that 1 as well, which is quite detailed. The other place that some of the intent has come from is much more principle based, and I think that SQL Fluff is a Python library. Most of or I'll say most. All of the develop all of the maintainers are are Python, developers.
And so we've been inspired by tools like Gate and Black and the Zen of Python. So I feel like I've said it at least twice in this podcast already that explicit is better than implicit. And some of the same principles that flake8 has applied to Python linting, We have applied to SQL whether or not that was in any of the preexisting style guides. I look at the road map of rules that we'd like to bring in that we haven't yet brought in. I think looking forward, many more of them are inspired by principle based rules than by matching a published style guide based rule.
Now a comparison with black is quite interesting. SQL Fluff is opinionated. We have defaults about how we think your SQL should be formatted, but it is also still configurable. I think there is probably a a space within the SQL community for something more opinionated than SQL fluff or maybe an opinionated flavor of SQL fluff that doesn't let you customize. But I don't think the SQL community as a whole is yet at a level of maturity where that would be accepted. Like, you've got so many organizations that have their own style guides that are all different that, at least personally, I feel that it's much more important that those organisations lint their SQL according to some consistent set than they necessarily lint it according to any industry wide consistent set of rules.
However, I do think that is a future that if it isn't inevitable, it is at least desirable that we, as an industry and as a profession, start to converge on a consistent set of flinting rules for how good SQL is written. And I think that will open up a world for a much more opinionated view like Black more like like what Black has done for Python in the SQL world. I I mean, that has really interesting implications for the SQL community, the wider SQL community. Right? The open source efforts within Python have been great to some extent because there is a relatively consistent way of writing Python. Sharing and collaboration on SQL code has never really been a thing.
I mean, the the closest we're getting to it at the moment is with DBT packages, but I think some of that is, you know, SQL in some ways is is quite often the reflection of your business logic, and so lots of organizations won't want to share it. But, you know, if you've worked out how to model the data coming out of a common platform that everyone else is using, that seems like something that should be eminently shareable. If you come across SQL code that's written in a way that you don't doesn't fit with your organization's style guide or is is difficult to read, then you're less likely to adopt it. And I think if if we as a community can start settling on a consistent set of rules, that opens the door to much wider collaboration across the whole industry.
[00:47:35] Unknown:
Yeah. Even within Python, black has some sort of areas of contention where it has become more widely adopted, but there are definitely still holdouts that say, you know, I will format my Python the way that I want it to to be formatted. And, you know, the so there are tools such as Yap for YAPF that is a formatter and will apply consistent formatting that is very configurable, almost too configurable. Yeah. I guess in that regard, SQL fluff is more akin to YAPF currently, but I do agree that just sort of giving up in terms of what these sort of visual structure looks like as long as the logical structure is appropriate.
It it requires a lot of maturity on the part of the developers who are using it, but it is also a very beneficial action because it just eliminates
[00:48:19] Unknown:
that from being an area of consideration that you have to even think about as you're writing your code. And I think you're totally right about where SQL Fluff is at right now. Right? We're we're not intending to be dogmatic about this and not allow other people to format their SQL in how they weigh it. In some ways, it's I think there are all this might be just because our documentation is patchy in some regards, but there are a lot more ways to configure SQL fluff than I think a lot of the community use. It's a lot more configurable than than several people realize. However, I think the intent of what Black have tried to achieve where you shouldn't have to think about formatting.
Right? I think that's quite a powerful statement, and I think, in some ways, a very valuable ideal to strive for. Effort that you put into having views about how you format your code that is different to other people is effort that you're not spending elsewhere. And, ultimately, and this is sounding weird coming from a maintainer of a linter, formatting doesn't matter. I would love to be in a world where formatting was transparent. People didn't see the formatting. They saw the logic. And I'd much rather people who look at SQL code that I write see the intent and the the logic that is embodied within it rather than whether the commas are at the beginning of the line or at the end of the line. Because if they're seeing the commas, they're not seeing the logic. That's what I want to communicate. That's what I want them to see.
[00:49:41] Unknown:
That brings me back to a question that I didn't ask yet is in certain aspects of programming, you know, the intent is that the code is self documenting that the structure and the sequencing of the code imparts the intent. But there are still points where you do have to fall back to comments and documentation. And I'm wondering what your thoughts are on how that plays out in the SQL community and in the SQL code that you work with where you try to convey the purpose of the query in the structure and the content of the query, what are the points where you hit an edge case where you say, I have to add a comment here in order to make it clear what I'm trying to do because the SQL itself isn't able to convey that information?
[00:50:26] Unknown:
It's an excellent question and and 1 that I I can't claim to answer fully. The easy cases are some of the ones we've talked about already, things like aliasing, variable names, table names, naming stuff. And then that's common for for any programming language. Naming stuff is hard, and and SQL is no different. Putting time and effort into what you call things and and that be useful, it makes a huge difference. And I think some of the biggest changes of readability I've seen in our code base have just been around around naming. I think beyond that, there is a gray area here where depending on what kind of SQL dialect you have available to you, I'm assuming most people who have large SQL code bases are either gonna be in 1 of the more modern data warehouses, particularly if you're listening to this podcast, or they'll be in 1 of the big open source Postgres, MySQL kind of databases?
And in most of them, there are multiple ways to achieve the same. And some of those ways have performance implications, and some of those ways have readability implications. In most cases, and I might be wrong here, I think the way that is most explanatory of what you're trying to do also happens to be the most performant way. Now it may not be the way that comes to mind first, and so if I think about some of the code reviews that we do at tails.com, we'll often be like, ah, okay. I can see what you're trying to achieve with the SQL. Have you considered doing that like this? Which is both more concise, more readable, more performant.
And it may just be a way of writing SQL that someone has not come across before. That's not always true, though. There are sometimes more performant things you can do would actually make it harder to understand. And while the intent of SQL is that it's I'm not gonna get the terminology right here, but it's of the category of programming language, which is most abstracted. It's completely nonprocedural, and the intent of this is that it it effectively formalizes set theory. And so it shouldn't be tied to the implementation at all. However, most SQL parsers aren't that smart, and the way that you write SQL will be used as a hint by most, you know, interpreters as to how a database engine or background should execute what you've asked it to do. And so even if 2 things are logically equivalent, they may have different performance implications. So I think there are still a lot of cases where commenting, I think, is is necessary in SQL.
And if I think about the the kind of SQL we work with on a day to day basis, I think it would be relatively unlikely for any nontrivial SQL query to not have any comments in it. At the very least, it probably has a little comment at the top, almost thick or little Python docstring saying, you know, what is this file for? And almost certainly, any nontrivial calculation might have a comment next to it just explaining what the intent of the calculation is if it's not immediately obvious from the code.
[00:53:19] Unknown:
Yeah. And to your point about the structure of the SQL having performance implications where the most clear way to represent it is not necessarily the most performant, what comes to mind is, you know, a case that I ran across where there was a Postgres query that was using a window function which acts as a kind of optimization barrier in terms of the way that the Postgres engine analyzes and executes the query. And so it was running more slowly than it had to because of the fact that it was structured more logically and intuitively.
[00:53:53] Unknown:
Yeah. Those examples do come up. Right? They exist. What I have noticed a little bit is I think as humans, we often use the heuristic of concise to mean performant, that a short query is necessarily more performant and more readable than a long query. And I think over time, I've trusted that heuristic less and less, that optimizing for conciseness is not always a a good aim in SQL, both for performance and certainly for readability. And if I think about some of the rules that we've introduced in SQL fluff, I think personally I've had to change my original style of writing SQL the most for have been about spacing, you know, adding new lines, making new commentable expressions, adding more space to make it more legible and more clear, which in the past, I would have said was unnecessary because it makes the query longer, but, actually, it does make it more readable and has no performance impact.
Or, things like aliasing, which or some of the ways people group. So if you've got group by 1, 2, 3, 4, 5, 6, 7, 8, 9 because, you know, you just want the granularity of that view to be the first 9 columns. But, actually, that's not really the granularity. The granularity is the first 2 columns, and you should actually have some kind of simple aggregation on the others. Is an example of where, yeah, it's very concise to say group by 1234-56789. Actually, not very performant, and it's also not very legible.
[00:55:20] Unknown:
And in terms of your experience of using SQL fluff and interacting with the community and seeing of the ways that it's being used, what are some of the most interesting or innovative or unexpected ways that you've seen the project used?
[00:55:33] Unknown:
I think the most interesting use cases have been the ones where people are starting to use SQL Fluff as a library. Once they're starting to use the passing engine of SQL Fluff to achieve other goals, whether that's like the table name extraction we talked about a little bit earlier. I've had there have been a couple of feature requests on the project which have come from a few users who were evidently using it as a parser for something. I'm not entirely sure what they're using it as a parser for, but given the questions they're asking, they're asking questions that would only be that would only come from someone who is using it as a passing engine. I mean, that's fine. Right? It explicitly is a a big chunk of what the SQL code base is for, and if people find it useful as a parser or I think specifically as a dialect agnostic parser that will pass multiple different dialect well, it will pass multiple dialects provided you tell it what dialect it's expecting, and it doesn't do auto detection yet.
I think if you were trying to write a database back end and use the SQL class server as your main query evaluator, it would be quite slow because of I mean, nothing else because it's written in Python. But as a way of introspecting SQL code and then doing interesting things with it other than linting, those are some of the most interesting use cases I've seen, and I expect will be some of the that's the sphere I expect exciting things will happen in future with it as well. And in terms of your own experience of building the project
[00:56:57] Unknown:
and growing its use and using it for your own work, but also growing the community around it, what are some of the most interesting challenging lessons that you've learned in the process?
[00:57:06] Unknown:
What are the reasons that I initially wanted to start a open source project in the first place? Not just it being a SQL linter, but just a a project in general, was actually to improve my Python because in my day to day role and increasingly so these days, I don't write a lot of Python, and I think it's an important skill for people working in analytics and data. I I wanted a way of keeping the knife sharp, so to speak, which feels weird now to be the maintainer of a relatively well used Python open source project when I don't actually myself a very good Python developer. I think I've learned a lot about how to structure Python projects, about how to write good, legible, performant Python as part of going through process, which I suspect they are not gonna be things that are unique to me that I've learned, but I've learned a lot about that. Some of the unexpected and, I guess, challenging ones have been around the community side, and at least now some quite good literature out there around how to be a good maintainer, how to be a good project runner.
There's a lot of stuff to think about, and no 1 tells you when the right time to do it. When is the right time to have a, you know, place for the community to chat? When is the right time to have a code of conduct? When is the right time to, you know, put in place some basic governance about how we do release cycles, and when do you get rid of that? I don't think anything from my software development background prepared me for that at all. I'm I think I'm lucky in this regard that my my day job is running a team and maintaining an open source project. It's also about running a team, and that's about making sure people that there is a culture around that you role model and that there are boundaries which are clear enough to give people focus, but wide enough to allow people to be creative within it. I think those are the same lessons that you would learn running any team, but I think translating it to an open source project has felt very different. Like, while the theory has felt familiar, the practice of it has felt very different, and the hardest bit of that, I think, has been timing, When to let things run and when to apply structure. And I think the thing that has surprised me on that, may be interesting to see if any of the the other maintainers are listening because I don't know if I've actually talked about this. But I've noticed of late, there's I think we've got a core maintainer group of about 5 or 6 people. And particularly over the last couple of months, everyone's actually settled into a slightly different role. And each person plays a valuable part of that community and while no 1 has probably said, hey, my role is x, y, and zed and someone else said, yeah, my role is a, b, and z. I think there's an appreciation, at least within the Materina group, of what everyone's role is in with that and how they in the ability the extent to which they can contribute to the project can be useful and valuable and do something which everyone else appreciates and welcomes. I think as someone starting a project like this, I don't think you can force that on people. All you can do is create an environment, create a culture that people want to be part of, an environment where people feel able to forge their own path and then let them do it, which is somewhat scary, but, actually, it it does work. It's probably been 1 of the most satisfying part of running the project, just being the community and seeing it grow and seeing people take ownership themselves for it, just being pretty exciting. That and we 1 of the other maintainers wrote a great blog post about SQL fluff, which was on towards data science the other day, and our GitHub stars graph basically just goes straight up.
I think we doubled the number of stars that we had within a week. I think I actually have got the graph open here. It went from about 600 stars less than a month ago to now being on 1, 400 stars as of today. So that was quite exciting, and all it took was a little bit of publicity and a great blog post that that someone wrote. It'll be interesting to see what the graph looks like once this goes live too. Yeah. No pressure. Right? I'll have to start labeling the graph with SQL flat on podcast in a SQL flat on towards data science.
[01:01:19] Unknown:
Just like your metrics when you do releases market for this is when this release went live. Yeah. Right. This is when system loads started increasing.
[01:01:30] Unknown:
Yeah. Yeah. Definitely.
[01:01:32] Unknown:
And so for people who are considering SQL fluff and they're working with a SQL code base, what are the cases where SQL fluff is the wrong choice and they might be better suited with some other set of tooling or a different approach? If you're in a primarily
[01:01:45] Unknown:
transactional world, either because you're using an ORM or because most of your SQL are effectively strings within Python code. Partly SQL Fluff is the wrong choice because it won't be able to extract the strings from from your Python code, but also there are there are better options. I think Flake8 has a SQL plugin which will lint any strings that look like SQL. And I think some of the other SQL linting projects are actually slightly better suited to what I call transactional SQL. If you were looking for a parsing engine and speed and performance is very important to you, the parsing engine for SQL Fluff is not mature enough for that use case, but it's also just not designed to be lightning fast. Right? It's designed to be flexible, not to be really quick. And so I think if you were trying to find a passing engine, SQL Flash passing engine would be a wrong use case for that. I think where it definitely is the right use case, I think, is probably the other end of the spectrum. If you've got a large SQL code base, you've got lots of files, each of which might have quite complicated quite significant amounts of logic within them. SQLcliff is a is a great choice, particularly if it's and this is gonna be worth it. So particularly if it's not templated, SQL Fluff is actually relatively performed.
If it is templated, it's relatively performant. But I if you're using DBT, I'd highly recommend people use the DBT templates within that. It makes your life a lot easier. It does add a performance hit because we have to use the DBT engine to compile the code, but it does make your life a lot easier. I think the zone in the middle is if you have kind of medium sized SQL code bases or particularly strange ones. There's a couple of use cases I've seen where people try to use SQL fluff to lint machine generated SQL, which can either be very poorly laid out in the first place or very large files, so thousands and thousands of lines within a SQL file.
And while we've done a lot of work on performance within SQL Fluff, and I think there's some work we're doing right now to continue to improve performance, it still will take a significant amount of time to lint very large and very complicated files. If you couple a hundred lines, you know, 10 to 20 CTEs, which from an analytical point of view might be a relatively meaty query, that's fine. It's only if you're at the very, very, like, gross end of either SQL queries that people have collaborated on over months to make some huge monstrosity or machine generated SQL, it might not be the right choice initially, certainly for the machine generated SQL use case. And for the human written large curated SQL file, maybe you should split it up into smaller queries, and then SQL file would be a great choice.
[01:04:34] Unknown:
So don't use it to lint my database dump that's a 100 gigabytes. Yeah. Don't do that.
[01:04:39] Unknown:
I mean, in theory, it should work. Like any linting tool, I think it's worth keeping mind, like, what the intent of a linting tool is. A linting tool is much less about the technical effect of it, and it's much more about the behaviors it instills in your team. And so it works best in an environment where you have a team of people collaborating on a SQL code base, and therefore it needs to be consistent so people can collaborate effectively on it. If you don't have a lot of collaboration on your SQL codebase, consider what function the linter is playing. I'm not saying it's bad, but where it will have the biggest impact is where it has impact on a team's behaviors and their approach to writing SQL rather than just on the SQL code itself.
[01:05:17] Unknown:
And as you continue to work on and use SQL fluff, what are some of the things that you have planned for the near to medium term or any particular areas of help that you're looking for?
[01:05:26] Unknown:
1 of the big things we're trying to do actually, right now, a code I opened on 1 of my other screens is is threading. The the work of linting large code bases of SQL is eminently parallelizable. Particularly given the way that SQL Server is set up, it it should be possible to parallel lint to improve performance. There's a prototype already live as a pre release if people want to try it. If you're not using the dbt templater, it's great. If you are using the dbt templator, it's really slow. And that's what we're that's 1 of the things I'm working on at the moment. So I think and that will come out in a release relatively soon. So performance is is 1 of the focus areas at the moment, and I think for the foreseeable, it will continue to be a focus area for the project.
Thinking outside of that, we get a steady stream of syntax requests. Now I think for dialects of SQL that are very, very well used with with SQLcloves, so I'm thinking particularly Snowflake and BigQuery, where I know we have quite big user bases, we've got relatively good coverage across the whole SQL dialect of of those dialects. Less common dialects, the coverage is less good. So we get a steady stream of requests, and, actually, that's a great place for way for people to to help the project just to to report those when they find them. And, also, working with the past is a little bit of a learning jump, but it's relatively easy to get started at least for small contributions. So it's a great way of people people contributing to the project, and we're always looking for help with that.
The other thing that we get asked about these have slowed down a little bit because I think we've got raised, and it'll be good coverage, but some good ones out there around additional rules. So things to look for and things to lint. There's a reasonable set, and it's also a good set of things that we'd like to implement that we haven't yet done. That's a great way of getting involved. The last way, I think, that people can help is to get involved in the community, whether that's just being active on our Slack threads, helping new users, we have a steady stream of people who will come in and post in the help channel. Hey. I'm getting started.
Either using it in CI pipeline or as a commit hook or and just looking for help implementing it. That's a great way to help the community. And particularly if you're already using Superfluff, it's a really nice way to help other people and spread the word. And to on the subject of spreading the word, I mean, I want a podcast talking about it now. Write write articles on it. Tell your friends. Help us make the docs better. I'm conscious that this is the same advice that I think a lot of open source project maintainers will give us. Like, know, you don't have to be in there changing the fundamentals of how the project works to be a really, really valuable contributor, and I feel like I would echo that. Be in the community, help new users, spread the word, raise issues when you find them, and with enough context that helps other people debug them and and help improve the documentation.
Because those are always the things that can always be better and and sometimes get forgotten. So that's those are really great ways to help the community.
[01:08:14] Unknown:
Are there any other aspects of your work on the SQL fluff project or the overall problem space that it's addressing that we didn't discuss yet that you'd like to cover before we close out the show? We kind of covered it, but I think I'll just plug it again. We talked about linting and what that means for
[01:08:30] Unknown:
legibility and comprehensibility within an organization, But I think there's a shift going on. The the data analytics engineering or data engineering community is is maturing to be able to use and reuse SQL more broadly. And now we talked about what that means for collaboration and and sourcing of SQL code. I think there's an aspect of it which applies to people and their skills and what that means for training, onboarding, and development. I think there is a world that's not too far away where when someone joins an organization and learns SQL for the first time from an organization that's using a tool like SQL Fluff, they won't even be aware that there is another way of formatting SQL code. They will join the community not seeing layout. They will see logic.
And I think the benefit that that has for their own productivity and their ability to work effectively with SQL and write SQL, which other people can can use, will benefit from that. Implications of that, we'll believe in the industry, I think, are quite big, both from the ability for people to move from 1 organization to another and still be productive from day 1 without me to relearn how to write SQL in that organization's way. What that means for assessments and, you know, during a recruiting process. How do you assess someone's SQL? And if at the moment, the barrier to that is they can't reach a SQL because they're used to a different style guide, You don't want to test someone's familiarity with a different style guide during an assessment process. You wanna find out what they can actually write meaningful sequel, and I think it opens the doors to a lot better collaboration across the industry if we all write consistent SQL.
That would just make my life a whole lot easier. Even outside my role as a maintainer of SQL fluff, you know, as a data leader trying to, well, in my case, change the world of pet food for good through data, I think, for anyone else trying to achieve anything else. But linters are useless unless they change behaviors, and changing the behaviors of data analytics seems as useless unless they are changing the behaviors of the people that they work with. So I hope in some small way, SQL fluff can make other teams more effective in changing what people do. It's a small piece of a puzzle of everyone trying to make the world a slightly better place.
[01:10:40] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a show that I watched through recently, and I'm waiting for the next chunk of episodes to come out is The Nevers. It's just a very interesting show that has sort of a combination of Victorian steampunk and science fiction weaved together. So it's it's been fun to watch that. So definitely recommend that if you're looking for something to watch, although I will say it is not really kid friendly. So just the disclaimer there. So with that, I'll pass it to you, Alan. Do you have any pics this week? So I just got back from a weekend surfing in Cornwall in the UK.
[01:11:22] Unknown:
It's May, but it's still cold out there. And some of what I've been thinking about is the the feeling of zen you get coming back from some time in nature, and I think particularly in the cold. And so 2 books that I would recommend, 1 called Lost Connections, about how mental health is linked to people's connections, to family, friends, nature, among other things. The other 1, which is a slightly more off the off the wall 1 is so that people may already come across it, is the Wim Hof method, the Dutch ice man who advocates cold exposure as a way of keeping yourself healthy physically, but also mentally.
And so I think for anyone who is interested in the subject of mental health, I would thoroughly recommend at least reading 1 of those 2,
[01:12:05] Unknown:
Implications of Nature and Cold Exposure on Mental Health. It's what I've been delving into over the last few days. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on SQL fluff. It's definitely a very interesting project and 1 that I'm excited to see and excited to see grow in terms of its usage and capabilities. So appreciate all of the time and effort that you and the team have put into that, and I hope you enjoy the rest of your day. Thanks a lot. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Alan Cruickshank's Background
The Genesis of SQL Fluff
SQL Fluff's Functionality and Use Cases
SQL Readability and Comprehension
Implementation and Architecture of SQL Fluff
Linting Rules and Style Guides
Challenges and Lessons Learned
When Not to Use SQL Fluff
Future Plans and Community Involvement
Final Thoughts and Closing