Data Drunk

Are multi-billion corporate investments in data mining about to do damage?

Jun 15, 2025

I love corporate obsessions.

They follow a predictable pattern:

There is a corporate offsite. These events often carry names so absurd that only the employees’ cowardice (also known as “compliance”) prevents a wave of collective ridicule. If I ever write a novel about one of those offsites (which is unlikely), I’ll call it Camp KPI.
An external speaker is invited — for a fee that could’ve funded the raises many employees thought they’d earned.
The CEO and other “leaders” listen to the speaker in awe. The in-house expert also listens — less amused, though.
A large chunk of investments is promptly rerouted to the “new” idea, usually at the expense of less glamorous but actually necessary initiatives.
At the business review, six months later, the CEO pretends to never have ordered such a splurge. This is standard — unless the investment turns out to be spectacularly successful. In that case, the speaker’s name is swiftly forgotten, and the idea is something the CEO “has been saying for a decade at least.” Their bonus swells to ten times the speaker’s fee, while the underpaid employees who kept things afloat quietly hope there won’t be another “idea” anytime soon.

When the obsession goes economy-wide, its cycle tends to be longer — but its essence remains the same. Think: the dot-com bubble, corporate boards quoting Greta Thunberg, or more recent waves still crashing over us — AI as messiah, the digital-twin craze, and of course... “data is the new oil.”

…Ever so slippery

Of course, data matters. So no, I’m not suggesting companies abandon efforts to understand their internal dynamics or external environment using quantitative methods.

What I am saying is that we may not see returns so spectacular as to justify the enormous investments poured into corporate data capabilities. In fact, at least in the early stages, these data-driven strategies may sometimes lead to decisions that turn out to be poorly judged.

According to some studies (this one, for example), corporate investment on big data is set to continue its rapid growth, reaching between USD 400–550 billion by 2028. That’s roughly the size of Austria’s or Sweden’s entire economy — or about as much as NVIDIA stock might trade hands during particularly frenzied sessions. Pick your metric.

The compound annual growth rate (CAGR) for corporate data investment is projected at 10–15% across various sources. But take a look at the chart below:

Over the next decade, the volume of data generated globally is projected to grow by 28% annually.

At first glance, this makes corporate investment in big data look like a bargain — capturing value from just a small slice of what's being produced. But look again: if data is growing at 28% and the associated investment grows at only half that pace, then we’re implicitly assuming that the productivity of our data tools will double every year to keep up. Is that plausible?

Perhaps.

The devil is in the data

But even if we take a conservative, optimistic view — say, expecting our investments to double productivity annually — and even if we confine ourselves to “just” 2,142 zettabytes of data, there are still reasons for concern.

Predictive AI performs best when prediction is, frankly, easy: when the underlying data-generating process is stable, with only occasional outliers that algorithms can handle without much trouble. But its performance deteriorates — often dramatically — in the face of a well-known challenge in machine learning: concept drift.

Economists like me have long had a different name for it: “regime shift.” Though admittedly, “concept drift” sounds cooler — less like central banking, more like cyberpunk.

Unlike large, isolated peaks and troughs (which we could neatly handle with dummy variables in econometric models), regime shifts are subtler — and far more damaging. Subtle because the early data doesn’t look dramatically different from pre-shift trends, so the change often goes unnoticed until it's too late. Damaging because they render most conventional predictive techniques unreliable, forcing us to adopt alternatives — like Markov-Switching models or Time-Varying Parameter models — whose predictive track records range from modest to totally bogus.

The same holds for machine learning. Concept drift refers to a change in the statistical relationship between inputs (features) and outputs (the target variable). After such a shift, the "concept" the model learned during training no longer applies — and predictions become biased, often severely so.

Concept drift is especially harmful because the consequences of even small errors can grow exponentially over time — driven by feedback loops (where model outputs influence future inputs) and automation chains (where early misclassifications trigger flawed actions, corrupt downstream data, and degrade model performance with each retraining cycle). In such systems, even minor, undetected drift can quietly accumulate into systemic failure.

And when applied to sensitive domains like healthcare, such failure doesn’t just lead to public embarrassment for scientists (see here for what concept drift means when AI is tasked with saving lives).

True, we have countermeasures. Concept drift is a well-known issue, and researchers have developed models and algorithms to detect and correct it before it spirals. Some have been tested successfully in specific use cases. But when it comes to broader, non-specialized applications, we still have — ironically — little reliable data on how these countermeasures perform.

More Data, Even More Drifty

Are these methods really ready to perform their magic as the data landscape explodes in the coming years? And more crucially: are they equipped for a world where the frequency of concept drift is set to soar?

Because we are, undeniably, living in an increasingly noisy world. And concept drift is, at its core, a function of noise — political, economic, geopolitical, and social instability. On all fronts, we have plenty.

At the risk of provoking my friend Marco Annunziata — who has strong and relatable reservations about this particular measure of uncertainty — see the chart below.

Source: Federal Reserve Bank of St Louis FRED

So we’re not just dealing with an increasing volume of data that anti–concept drift models must process — we’re also seeing a rise in the probability that any given additional data point is part of a concept drift.

What’s more, concept drift detection methods become far less effective when the drift is artificially induced — for example, during a cyberattack, leading to a form of adversarial AI known as data poisoning. In these cases, the drift doesn’t reflect a real change in the world but a simulated one. When correction algorithms respond to it, they fall into a trap: the AI model adapts to a fabricated pattern, and begins interpreting new, real-world data through a distorted lens — one that assumes a fundamentally different statistical relationship between variables that never happened. The result is a significant increase in standard errors and model instability.

According to this study, poisoning just 0.001% of training tokens with medical misinformation is enough to increase error rates in AI medical models by more than 7%. If that’s not alarming enough, consider this: the same study shows that such damage doesn’t require a 24/7 assault by an army of hackers.

“Replacing just one million of 100 billion training tokens (0.001%) with vaccine misinformation [generated for just US$5.00] led to a 4.8% increase in harmful content. A similar attack against the 70-billion parameter LLaMA 2 LLM, trained on 2 trillion tokens, would require […] under US$100.00 to generate. The net cost of poisoned data would remain well under US$1,000.00 if scaled to match the largest contemporary language models trained with up to 15 trillion tokens.

More broadly, it's an illusion to think that ‘garbage in’ data only originates from professional cybercriminals. Through social media and its echo chambers, each of us has the potential to generate fake data — often scandalous, often engineered for likes — and then watch as confirmation bias amplifies it, sometimes all the way to the level of national security. Each one of us. All 8.2 billion.

Be patient, my friend, be patient

In conclusion, concept drift is a serious enough matter to deserve to be fully understood at every level. From my conversation with corporate leaders, this is not often the case. It’s more dangerous than ‘regime shifts’ because at least the latter are noted by economists who (are supposed to) understand them. But concept drift — a challenge that affects AI broadly — can cause widespread harm, often impacting a much larger population, including corporate actors who may not even be aware of the issue. Worse still, some may invest heavily in black-box algorithms that claim to ‘handle it,’ mistakenly believing this absolves them from the responsibility of understanding the problem themselves.

As I said, my intention is not to devalue the importance of corporations understanding both their internal dynamics and external environment through data. As an economist, I welcome any effort to look beyond the day-to-day and to inform corporate strategy with meaningful insights drawn from the broader world.

Precisely because I want to see more intelligence applied in the management of corporations, I’m concerned that these efforts may be abandoned as hastily as they were embraced. Corporate leaders may not see an immediate return on investment (and may even observe a temporary decline in performance). In fact, they likely will — before any improvements emerge. But data is no magic wand, and if the generous budgets allocated to it are driven by that illusion, then a more gradual approach — grounded in a deeper understanding of the models we deploy — might be far more sustainable.

A final point on this, and productivity. Many lament the lag between groundbreaking innovation and observable productivity gains. This may be a meaningful issue at the macroeconomic or policy level. But for corporations, what really matters is the return on an additional dollar invested in innovation — its marginal capital intensity return on productivity. By this measure, the data shower for corporate leaders is likely to be even colder.

It will get warmer, eventually.

Marco Annunziata

Dear Luca, consider me provoked, even antagonized -- not just by your quoting the (in my unenlightened view) totally bogus uncertainty index, but also by your last paragraph!

Sadly, I have to admit this is an excellent post making a very important point, so rather than fighting it I will add just one comment:

The initial wave of "data is the new oil" was driven by justified enthusiasm at finally having better tools for discovering the "signal" hidden in the data and not visible to the naked eye.

The current AI craze ignores GenAI's inability to see the "noise" and misinformation hidden in plain sight in the data (even when they are obvious to the naked eye of common sense). Which of course come from these AIs not having a model of reality or any...how do we usually call it?...intelligence.

We humans, sadly, are often prone to seeing intelligence where there is none and missing it where it's abundant. In retrospect, the Turing test seems naively misguided.

But I'm in no hurry, I will be patient.

Expand full comment

Solutions for the End of Times

Discussion about this post