Does the Modern Data Stack work at scale?
I don't know, but I sure hope so.
👋 Hello! I’m Robert, CPO of Hyperquery and former data scientist + analyst. Welcome to Win With Data, where we talk weekly about maximizing the impact of data. As always, find me on LinkedIn or Twitter — I’m always happy to chat. And if you enjoyed this post, I’d appreciate a follow/like/share. 🙂
Over the last 5 years, the Modern Data Stack transformed analytics. Thanks to dbt, we gained the ability to construct data models and pipelines with plain SQL. With cloud data warehouses, we dramatically reduced the risk of taking down our infrastructure with poorly optimized queries. With ETL and reverse ETL tools, we gained the ability to get data and ship it off without engineering support, enabling us to accomplish in minutes what would otherwise devour days of developer time. In short, the Modern Data Stack multiplied our capabilities tenfold. Where we were once only masters of the quick data pull, we suddenly became masters of the data itself — getting it, codifying it, activating it, and of course, yes, answering quick questions, but with more confidence and greater transparency. The Modern Data Stack gave us superpowers.
That said, regardless of how transformative this has been for ICs, I’ve been wondering lately how well the MDS scales. The majority of the Fortune 1000, after all, are still not on the MDS. And even more worrying: the toolchains often look dramatically different. While I once argued enterprise stacks were just antiquated, I find myself wondering more and more whether the needs are just different. While you’ll never hear talk of data lakes (Delta Lake, EMR, Trino, etc.) in our little MDS bubble, they and their related ecosystems seem to proliferate at enterprise scale, and in many instances, have maintained their hold against Snowflake/Bigquery. Orchestration is just as likely to take place in dbt and Prefect/Dagster/Airflow as it is to be bundled into Alteryx or Domo or Databricks. And Tableau, which is all but absent in early stage startups, still reigns mighty in the Salesforces of the world. While there’s a lot to unpack here, let’s explore the curious relationship that the Modern Data Stack seems to have with scale. And for now, I’ll anchor my musings around the production layer — ETL, storage, and transformation.1
Thanks for reading Win With Data! Subscribe for free to receive new posts and support my work.
Cold, hard truth about the Modern Data Stack.
Here’s the cold hard truth about the Modern Data Stack: it’s expensive at scale. The most viable business model in this space is one in which companies charge a markup against compute resources, while abstracting away complexity in a way that makes products appealing to set up and maintain. As a result, tools are astoundingly cost-efficient when usage is low, but markup costs eat up the cost efficiency as usage grows. Snowflake markups against the raw EC2 instances are so immense that entire industries have sprung up around optimizing costs. In the ETL world, Fivetran has a reputation for being even worse. And even our beloved dbt recently made a shift to usage-based pricing as well.
Below a certain scale, of course, the pricing makes sense for a customer. When you’re a small company, paying a markup to avoid the hassle of managing infrastructure is a no-brainer — you'd need to use a laughable number of Snowflake credits to balance out the cost of a full-time employee, managing your self-hosted data warehouse. But at some point, the economics don’t quite make sense.
It seems the data tooling industry has already figured this out. There seem to be one of three paths here to reduce costs:
Direct cost reduction. E.g. Snowflake cost management tooling. If MDS continues to have a stranglehold over mindshare over the next few years, expect to see this supporting industry grow.
Open source. You can always host something yourself. Replace Snowflake or BigQuery with Trino, Clickhouse, Redshift, perhaps. Fivetran → Airbyte, Meltano, Singer. dbt cloud → dbt core. Replace the SaaS cost with human cost.
More bundled tooling. E.g. data lakes (which, by simply removing the need for a large number of ETL connections, bundles in some of the ETL costs), Databricks, Alteryx, Domo, cloud vendor tools.
The MDS offers a better developer experience.
That said, while the Modern Data Stack is ultimately more expensive, there’s more to life than cost. The primary benefit you get with the Modern Data Stack is a better user/developer experience.2 Putting aside its astronomical cost, data folks tend to love Snowflake. And for all the flack we’ve been giving it as of late, dbt is still the darling of our era.
And for good reason — a profitable company dedicated to building the best data warehousing/transformation/etc experience is inevitably going to build a better data warehousing/transformation/etc experience than either a fully open-source tool or a bundled product3, where focus is diffuse. Design by committee doesn’t produce the best tools, and neither does bundling. If you need great UX, vendors are generally the way to go, and the unbundled nature of the Modern Data Stack ensures that each piece of the stack is maximally usable.
UX doesn’t matter, but UX is all that matters.
Unfortunately, here’s the rub: developer experience/UX doesn’t compare well to cost. The objective of CDOs in the buying process is different from the objectives of practitioners buying tools. The most palpable KPI at the executive level is cost reduction — subjective measures take a backseat. Even beyond cost, CDOs have a long list of priorities about UX — security, data activation, governance, SSOT.
That said, even if UX never explicitly enters the discussion, it hardly means UX isn’t present in those discussions. UX kicks off a flywheel that subverts all of that — some flavor of enterprise mimesis. Snowflake offers a strong model here: by garnering product love, they’ve won community devotion to a level where they’ve started to establish themselves as a standard, and the purchasing dynamics change once something becomes the de facto correct solution. You no longer need hard counterfactual proof of value — you just need to believe that the purported efficiency boost is worth the cost, and some back-of-the-napkin calculations should be enough. 10% more efficient analysts. More ML, sprinkled all over. And when your competitors are using a tool and you’re not, risk of missing out on leverage starts to trump all other arguments.
And this is the flywheel that the Modern Data Stack is leveraging to win: with great UX comes customer love, and customer love pumps up community buzz and, consequently, standard creation. And once some companies start to crack, competitive forces drive others to fall in tow, simply to remove one variable stochastically influencing one’s stance in the market. And were the MDS to win, this would be why.
Still, horizontally bundled vendors might realize that UX is how they ultimately compete, and UX + lower cost will always defeat UX alone. Given the bundling effect already gives a UX boost, it’s not so far-flung to think the right company might ultimately overtake the MDS (Databricks seems to be doing quite well here). I can picture a world where some such vendor starts to focus agonizingly on UX, but, by packaging the whole thing into something that can optimally benefit from economies of scale, they can keep costs and complexity low at scale, while still establishing a new standard of usage for smaller companies.
The MDS transformed our expectations around how workflows should work, but it still feels unclear whether the SaaS companies that have sprung up around the components of the Modern Data Stack will survive. Will they establish dramatically more ergonomic standards that protect them from the advances of their bundled competition? Or will the true cash cows of the business — data companies serving the biggest pain points of large enterprises — take up the UX mantle and re-bundle according to the MDS vision, capitalizing against the inefficiencies of the Modern Data Vendors as-is? Or something else entirely?
All that said, and complaints about the Modern Data Stack aside, there’s something refreshing that the Modern Data Stack gave to us. At times, I thought it was predicated on an open, inclusive paradigm. At others, I thought it was the segmentation itself. But I’m coming to realize the best thing the Modern Data Stack has done for us is this: it’s taught us that we don’t have to tolerate poorly designed tools. Having to learn new DSLs, internalize dozens of new concepts, and read miles of documentation shouldn’t be the norm — we should have tools that augment us, not the tooling equivalents of Dwarf Fortress4. And regardless of what toolchain investor capital crowns king next, that’s something to be grateful for.
Thanks for reading Win With Data! Subscribe for free to receive new posts and support my work.
And so, all subsequent references to the Modern Data Stack will be comments around this production layer. The consumption side is still a mess, anyway.
And, if self-hosting is your counterfactual, stability. But I’ll ignore that line of reasoning for now.
I’m talking fully open source tools. Notice that, of the open-source tools available, the most usable ones are venture-backed companies, and they’ll likely scale up and rift their product into a second-class, community maintained version and an enterprise-licensed version with all the bells and whistles.
If you don’t get this reference, you are not the kind of person to get nerd-sniped. Be grateful for that.