š Hello! Iām Robert, CPO ofĀ HyperqueryĀ and former data scientist + analyst. Welcome to Win With Data, where we talk weekly about maximizing the impact of data. As always, find me onĀ LinkedInĀ orĀ TwitterĀ ā Iām always happy to chat. š
I recently met with a C-suite executive at a Fortune 500 company. He told me point blank: āIāve never seen an analysis done in Python that was even remotely as valuable as our SQL-only ones.ā And as a long-time Python fanboy, I recoiled. Iāve expressed before that analytics should happen in SQL where possible, but never had I considered dropping Python (or R) altogether. Weāve made substantial investments in building our own Python capabilities at Hyperquery, and so itās difficult for me to objectively dismiss Python. But if Iām truly honest with myself, would diving into a SQL-only mindset have been better for my work as a data scientist / analyst?
My thesis: Python is not unnecessary, but itās absolutely overused. I still believe Python has its place ā there are certain projects that are quite difficult to complete without the power of a full-fledged programming language. But thereās an argument to be made that you might get more impact out of your analytics work without it.
The Python trap
My disclaimer, if youāre looking for the bias: I love Python. Itās often touted as the āsecond best language at everythingā. But reducing it in this way downplays its greatest strength. Python is a DX (developer experience)-first language ā the concepts are intuitive, itās more fault-tolerant than, say, javascript1, and the grammar and syntax make it read like pseudo-code. Complexity is hidden, but almost always available. Itās no surprise that those of us who use it tend to prefer it over SQL. When all you have is a hammer? Nah. Hitting things with the Python hammer just feels great.
If youāre manipulating local data, you can slice and dice it faster with Python than in any other language. Meaning if youāre deep in some over-the-shoulder Minority Report coding, Python is certainly the fastest tool. Having worked this way for years, Iāll admit itās certainly a rush. I loved impressing others with my ability to do insane data manipulations, build predictive models on the fly, blaze through matplotlib syntax and generate elegant visualizations, etc. etc. Moreover, associating yourself to the Python camp is a quick way to immediately gain some base level of trust. Python. Ooh. Aah.
That said, unfortunately, the āsecond best languageā perception is true ā Pythonās prioritization of DX inherently means it wonāt be the best tool for any job. And analytics is no exception. Yes, technical iteration can be quite fast (and certain things are easier to build). But what you gain in iterative simplicity, you lose in other foundational areas crucial for analytics.
The problem? Well, I found two:
Error-prone. We all know that sinking feeling ā when you later find out that your on-the-fly work is wrong. Iāve used Python heavily for the last 10 years, and itās hard to admit, but thereās a skill ceiling for analytics baked into the nature of the tooling. I could iterate fast, certainly, but my work was always highly error-prone: I made a small error in a particular pandas syntax, I accidentally ran one cell before another, I swapped arguments or didnāt fill in my nulls. Python was made for speed, but its flexibility made it far too easy to make mistakes.2
Overall, it took longer to share. Yes, you could get to an answer faster, but youād then lose the time you saved (and then some) in making the work reproducible, cleaning up the cruft, checking syntax, and fixing any subversive state changes.
So I started to turn more and more to SQL.
Why SQL is king
I doubled down on SQL during my time at Airbnb. Obviously, I continued building Python libraries and working in Python when I absolutely needed it, but for analytics I began exclusively using SQL. And a few things happened:
I quickly regained my speed, with much greater confidence.
Except for with pivots. Pivots in SQL are hell.I spent less time post-processing work to get it shareable.
Itās unnoticeable if you just donāt save your work in a shareable state, but if you do and want to preserve some modicum of structure around your own work, writing in SQL makes it a lot easier to save work in a usable state than in Python notebooks.More people used my work.
We had a knowledge-sharing mechanism for Jupyter notebooks at Airbnb called knowledge-repo, but sharing raw SQL or raw SQL in a Google doc led to greater usage from both technical and less technical folks alike.
It was easy to brush off these effects as ancillary, circumstantial, but I started to notice my most successful, impactful colleagues followed this pattern as well. Yes, Python-first folks had some clout, but the most technically proficient analysts worked almost exclusively in SQL.
And Iād argue these arenāt just subjective quirks. These are consequences that are fundamental to how SQL differs from Python. By way for comparison, consider the super popular programming language Rust. If youāre not familiar with Rust (or the hype around it), Rust is, in short, amazing because itās both fast and safe ā and before Rust, conventional wisdom suggested it was not possible to get one without sacrificing the other. The creators of Rust accomplished this by introducing opinionated rules around how memory is released as scope changes. This small design choice added friction to doing sloppy things with memory, and as a result, we ended up with a language that didnāt just enable you to do the right things, but pushed you to do things right by default.
SQL is not so different. Its design choices encourage better behavior:
Idempotent, practically speaking. Execution instructions (yes yes, not the data, but you can only go so far) are always idempotent and thus, work is always reproducible.
Comprehensible. The grammar is optimally readable, leading to greater involvement from the folks that youāre doing analytics for. Analytics is deeply collaborative, unlike other technical-first domains like research, machine learning, or engineering, and so comprehensibility is king.
Setup-less. Itās tied to the warehouse, not to your environment. Meaning you donāt have to handle environments, and you can rest assured your query will generally return the same results for you as it will for other folks.
Unlike Python, you donāt need to work to ensure that what youāve done can be reproduced, understood, or used by anyone else. You get universal usability for free.
Why SQL fails, and the future of analytics
Still, it wouldnāt be fair to kill Python without pointing out that there have always been a few serious shortcomings to bare-metal SQL:
Lack of visualization capabilities.
Inability to DRY.
That said, the world is moving to one in which Iām hopeful these will be addressed shortly. On the visualization front, iterative notebooking tools like Hyperquery generally have rich visualization capabilities (or can jump to Python if you prefer the flexibility of code). With Vega-lite quickly becoming the standard for modern BI tools, you donāt have to worry about getting too accustomed to the interfacial oddities of different GUI visualization editors.
On the DRY end of things, we again are starting to have options. Within most SQL-based vendors, Jinja is becoming more common, allowing users to define macros and functions. For more permanent fixtures, warehouse vendors generally allow you to define re-usable UDFs.
So for now, Iām going to evangelize the SQL-by-default camp, at least until someone topples the decades of momentum that have dug us into this hole3.
==
, anyone?
Not to mention: this sort of reactive data work is not great for analytics as a domain.
Yes, itās a hole. Iām touting SQL, but honestly, the current state of SQL isnāt optimal ā vendor-locked extensions, no standardization, etc. Itās extraordinarily painful, but the benefits still stand.
i love your blog but your font choice is off. Perhaps consider a new one. Thanks for sharing your experiences, regardless.
I enjoyed this read so much!
Thanks man