👋 Hello! I’m Robert, CPO of Hyperquery and former data scientist + analyst. Welcome to Win With Data, where we talk weekly about maximizing the impact of data. As always, find me on LinkedIn or Twitter — I’m always happy to chat. 🙂
I recently met with a C-suite executive at a Fortune 500 company. He told me point blank: “I’ve never seen an analysis done in Python that was even remotely as valuable as our SQL-only ones.” And as a long-time Python fanboy, I recoiled. I’ve expressed before that analytics should happen in SQL where possible, but never had I considered dropping Python (or R) altogether. We’ve made substantial investments in building our own Python capabilities at Hyperquery, and so it’s difficult for me to objectively dismiss Python. But if I’m truly honest with myself, would diving into a SQL-only mindset have been better for my work as a data scientist / analyst?
My thesis: Python is not unnecessary, but it’s absolutely overused. I still believe Python has its place — there are certain projects that are quite difficult to complete without the power of a full-fledged programming language. But there’s an argument to be made that you might get more impact out of your analytics work without it.
The Python trap
My disclaimer, if you’re looking for the bias: I love Python. It’s often touted as the “second best language at everything”. But reducing it in this way downplays its greatest strength. Python is a DX (developer experience)-first language — the concepts are intuitive, it’s more fault-tolerant than, say, javascript1, and the grammar and syntax make it read like pseudo-code. Complexity is hidden, but almost always available. It’s no surprise that those of us who use it tend to prefer it over SQL. When all you have is a hammer? Nah. Hitting things with the Python hammer just feels great.
If you’re manipulating local data, you can slice and dice it faster with Python than in any other language. Meaning if you’re deep in some over-the-shoulder Minority Report coding, Python is certainly the fastest tool. Having worked this way for years, I’ll admit it’s certainly a rush. I loved impressing others with my ability to do insane data manipulations, build predictive models on the fly, blaze through matplotlib syntax and generate elegant visualizations, etc. etc. Moreover, associating yourself to the Python camp is a quick way to immediately gain some base level of trust. Python. Ooh. Aah.
That said, unfortunately, the “second best language” perception is true — Python’s prioritization of DX inherently means it won’t be the best tool for any job. And analytics is no exception. Yes, technical iteration can be quite fast (and certain things are easier to build). But what you gain in iterative simplicity, you lose in other foundational areas crucial for analytics.
The problem? Well, I found two:
Error-prone. We all know that sinking feeling — when you later find out that your on-the-fly work is wrong. I’ve used Python heavily for the last 10 years, and it’s hard to admit, but there’s a skill ceiling for analytics baked into the nature of the tooling. I could iterate fast, certainly, but my work was always highly error-prone: I made a small error in a particular pandas syntax, I accidentally ran one cell before another, I swapped arguments or didn’t fill in my nulls. Python was made for speed, but its flexibility made it far too easy to make mistakes.2
Overall, it took longer to share. Yes, you could get to an answer faster, but you’d then lose the time you saved (and then some) in making the work reproducible, cleaning up the cruft, checking syntax, and fixing any subversive state changes.
So I started to turn more and more to SQL.
Why SQL is king
I doubled down on SQL during my time at Airbnb. Obviously, I continued building Python libraries and working in Python when I absolutely needed it, but for analytics I began exclusively using SQL. And a few things happened:
I quickly regained my speed, with much greater confidence.
Except for with pivots. Pivots in SQL are hell.I spent less time post-processing work to get it shareable.
It’s unnoticeable if you just don’t save your work in a shareable state, but if you do and want to preserve some modicum of structure around your own work, writing in SQL makes it a lot easier to save work in a usable state than in Python notebooks.More people used my work.
We had a knowledge-sharing mechanism for Jupyter notebooks at Airbnb called knowledge-repo, but sharing raw SQL or raw SQL in a Google doc led to greater usage from both technical and less technical folks alike.
It was easy to brush off these effects as ancillary, circumstantial, but I started to notice my most successful, impactful colleagues followed this pattern as well. Yes, Python-first folks had some clout, but the most technically proficient analysts worked almost exclusively in SQL.
And I’d argue these aren’t just subjective quirks. These are consequences that are fundamental to how SQL differs from Python. By way for comparison, consider the super popular programming language Rust. If you’re not familiar with Rust (or the hype around it), Rust is, in short, amazing because it’s both fast and safe — and before Rust, conventional wisdom suggested it was not possible to get one without sacrificing the other. The creators of Rust accomplished this by introducing opinionated rules around how memory is released as scope changes. This small design choice added friction to doing sloppy things with memory, and as a result, we ended up with a language that didn’t just enable you to do the right things, but pushed you to do things right by default.
SQL is not so different. Its design choices encourage better behavior:
Idempotent, practically speaking. Execution instructions (yes yes, not the data, but you can only go so far) are always idempotent and thus, work is always reproducible.
Comprehensible. The grammar is optimally readable, leading to greater involvement from the folks that you’re doing analytics for. Analytics is deeply collaborative, unlike other technical-first domains like research, machine learning, or engineering, and so comprehensibility is king.
Setup-less. It’s tied to the warehouse, not to your environment. Meaning you don’t have to handle environments, and you can rest assured your query will generally return the same results for you as it will for other folks.
Unlike Python, you don’t need to work to ensure that what you’ve done can be reproduced, understood, or used by anyone else. You get universal usability for free.
Why SQL fails, and the future of analytics
Still, it wouldn’t be fair to kill Python without pointing out that there have always been a few serious shortcomings to bare-metal SQL:
Lack of visualization capabilities.
Inability to DRY.
That said, the world is moving to one in which I’m hopeful these will be addressed shortly. On the visualization front, iterative notebooking tools like Hyperquery generally have rich visualization capabilities (or can jump to Python if you prefer the flexibility of code). With Vega-lite quickly becoming the standard for modern BI tools, you don’t have to worry about getting too accustomed to the interfacial oddities of different GUI visualization editors.
On the DRY end of things, we again are starting to have options. Within most SQL-based vendors, Jinja is becoming more common, allowing users to define macros and functions. For more permanent fixtures, warehouse vendors generally allow you to define re-usable UDFs.
So for now, I’m going to evangelize the SQL-by-default camp, at least until someone topples the decades of momentum that have dug us into this hole3.
==
, anyone?
Not to mention: this sort of reactive data work is not great for analytics as a domain.
Yes, it’s a hole. I’m touting SQL, but honestly, the current state of SQL isn’t optimal — vendor-locked extensions, no standardization, etc. It’s extraordinarily painful, but the benefits still stand.
i love your blog but your font choice is off. Perhaps consider a new one. Thanks for sharing your experiences, regardless.
I enjoyed this read so much!
Thanks man