👋 Hello! I’m Robert, CPO of Hyperquery and former data scientist + analyst. Welcome to Win With Data, where we talk weekly about maximizing the impact of data. As always, find me on LinkedIn or Twitter — I’m always happy to chat. 🙂
And one more quick announcement: we recently launched our Hyperquery 100 program, giving you access to a lifetime single license Team membership for $100. We’ll be closing this soon, so sign up today before space runs out! ❤️
SQL hardly fails, of course, particularly for analytics. Last week, I shared my perspective that analytics work should be done in SQL by default and Python only when necessary. But the ensuing discussions surfaced a common misconception that I want to address: we can’t sweepingly dismiss SQL or Python for every use case.
In this post, I’m going to talk through SQL and Python one more time, help you understand why these languages are both, at once, great and terrible, and hopefully, leave you with a clearer understanding of when and why you might want to choose one or the other.
How SQL fails, and Python wins
In general, SQL’s failings can be summed up quite simply as follows: you can’t adhere to SOLID principles. And as a result, any code you write will hit a spaghetti ceiling. Python, on the other hand, is a full-fledged programming language, and as such, offers rich expressiveness, deep abstractions, type-safety, etc. etc. making SOLID a plug-and-play endeavor.
In particular, here are some key shortcomings of SQL, particularly as it pertains to data work:
💩 Multiple responsibility: everything is entangled together. It’s quite difficult to adhere to the single-responsibility principle because all the steps in SQL are mashed into a single, verbose command. This makes it really hard to build anything composably. Python, on the other hand, makes it easy to adhere to the single-responsibility principle.
👩👩👧👧 No abstraction: there’s a single layer of abstraction. You can’t have classes, can’t re-use code, etc. etc. This makes it extremely hard to DRY or construct reusable elements. Well, you can use vendor-specific UDFs, but…
🔐 Vendor lock: … then you enter the world of vendor-lock. The core of SQL is somewhat constant across different warehouse vendors, but because SQL is so inherently tightly coupled to storage, all the advanced functionality heavily diverges. If you become an expert at writing UDFs in Snowflake, your skills will take some time to transfer if you move to a company that uses BigQuery.
Overall, this means that, for more production-grade data projects (e.g. pipelines, for instance, or building data applications of any sort), Python is more easily extensible, scalable, and maintainable than SQL will be. Production-grade entities built in Python will be more robust. ORMs should be strongly considered when possible. And even for some highly iterative domains like machine learning, you certainly should still work in Python, as there’s a higher ceiling on production-grade hardness (as opposed to deploying with this monstrosity).
How SQL wins, and Python fails
But as I’ve said before, what Python gains in expressiveness and feature completeness, it loses in analytics compatibility. Features always add friction. And for SQL, the natural guardrails are perfectly suited for analytics.
😍 Accessibility: anyone can read SQL. It’s taught in business schools. Reach matters with analytics work — the more decisions you impact, the greater your added business value — and the transparency of SQL makes this much easier.
🤖 Reproducibility: work in SQL is more easily reproducible, particularly if you’re comparing it to work done in Jupyter notebooks. But even ambiguous state of Jupyter cells aside, SQL is environment-less/standalone, so you don’t have to deal with horrendous edge case failures as APIs change (like, seriously, has anyone ever successfully re-run someone else’s Jupyter notebook?). Moreover, the absence of dependencies means that every SQL query is a standalone, self-contained data application, which would otherwise require an enormous amount of overhead to accomplish with Python.
🎸 Speed: SQL is generally faster for getting data than Python. It’s a bare-metal interface against the warehouse, natively built for speed. Yes, there might be some caching and subsequent optimization you can do locally with Python, but if you’re running analyses directly against the data, SQL is going to be faster in most cases.
Overall, SQL has limitations, of course, but for analytics, they’re features, not bugs. More people can read SQL, it’s easier to share and reproduce, and it’s generally faster to use. All of these mean that SQL is excellent for analytics, where reducing friction to usage and sharing are key to maximizing impact.
Hopefully this gives you a better sense of how to choose the appropriate language for different kinds of data work. Python certainly excels in data engineering, engineering, and machine learning applications, while SQL excelsin analytics. Of course, the lines tend to get blurred, but I hope I’ve laid out the tradeoffs clearly enough that you won’t throw either baby out with the bathwater.
And a final, obligatory sales pitch: if you’re looking for an analytics tool that lets you stay SQL-first, but gives you easy access to Python when you need it, check out Hyperquery! ❤️
Thanks for reading Win With Data! Subscribe for free to receive new posts and support my work.
Hyperbolic; maybe BigQuery ML is alright — I haven’t spent much time using it.