Ask HN: What Happened to Big Data?

Hacker News - Thu Jun 23 13:12

Ask HN: What Happened to Big Data?
16 points by night-rider 47 minutes ago | hide | past | favorite | 28 comments
There was a buzzword Big Data which regularly popped up in tech news. I haven’t seen the word used much lately. What advances are being made on it?

I imagine with various privacy scandals it fell out of favour since your data should be /your/ data only.

And many have talked about data being the ‘new oil’ when really it should be reframed as radioactive waste.

What happened to using this term to hype up your brand: ‘We use Big Data to infer information about how to improve and go forward’?

Was it just a hyped up buzzword?

Like a lot of fads in IT, Big Data sounded like "if you have a lot of data you can monetize it" so companies threw 7+ figures at the technology and then realised that you can have too much data to know what to do with and couldn't really monetize it (obviously some did/still do). Even at a simple level, working at a data collection company, it is very clear that lots of people want to collect as much as possible only then to do precisely nothing useful with the results.

Then Machine Learning comes along and these same people think that means that you can just feed the beast with your big data and it will be clever enough to tell you what you want to know and then the same companies will realise that they still don't have the skills to work out what to tell the ML algorithm to do.

It became ubiquitous.

Now it is in many places. Enterprises use it each moment.

A laptop hard disk is now capable of holding databases with tens of millions of rows.

Traditional "Data Science" and modern Deep Learning rely entirely on it. Millions of datapoints are used to create models everyday.

A sensor on human wrist collects and stores thousands of data points each day.

So do refrigrators, cars, and your washing machine with ubiquity of IoT.

Giant tech cos use billions of rows each day to show users products, or sell their attention as products.

Big Data became ubiquitous. And it became so common that no nody calls it that anymore.

Tools like BigQuery, Dask, and even Pandas and SQL can handle hundreds of thousands to hundreds of millions of rows (or other structure) with normal, regular programming, command, etc.

Compliance and web3 arrived.

A lot of companies didn't even need to go the hadoop route. CSVs, jupyter notebooks and SQL databases are very powerful tools for most companies.

It's just considered "data" these days. We just look at the Vs of the data and adjust based on those. High velocity? Do X. High volume? Accommodate Y. High variety? ... The other side of things is the underlying data quality often had tons of issues, so there's been a lot of focus on the data observability (which isn't sexy at all).

Still tons of folks out there using Hadoop (ew), Snowflake, etc. New technologies coming out include things like Trino, Apache Iceberg, etc. So it's there ... just no one cares about the moniker .. just getting things done.

Big Data has always been a marketing paradigm. We've always had lots of data we just didn't process it for business intelligence before.

"The advances in computing have made it easier to accomplish tasks that were completely unnecessary before"

It's simply become the norm. Companies store and analyze lots of data all the time. It's no longer special but simply table stakes. Look at the valuations of Snowflake and Databricks.

I disagree. Big data came associated with a new swap of algorithms. To big to handle? Use new algorithms, maybe not 100% accurate but can handle the load. And streaming data as opposed to static data.

The are a lot of approaches like Change Data Capture CDC or HyperLogLog - but the norm? Far from.

I think the marketing BS fell out of fashion when every database designer became a data scientist, but that's another issue.

Those algorithms and improvements in large data processing got bundled away into a platform/infra layer a developer or user interfaces with unaware of what's going on in the background to produce the results they want.

In addition to the skeptical comments, I think infrastructure and best practice also caught up such that what used to be big data is not so big anymore.

Storing data on S3 or using BigQuery remove a lot of the challenges as opposed to doing this stuff in the data centre. You then also have services such as EMR, Databricks and Snowflakes to acquire the tooling and platforms as an IaaS/SaaS. The actual work then moves up the stack.

Businesses are doing more with data than ever before and the volumes are growing. I just think the challenge moved on from managing large datasets as result of new tooling, infrastructure and practices.

A cursory search of DDG News & Goggle News reveals "big data" is still a widely used buzzword in headlines. I don't think it went anywhere.

Data maybe the new oil, which I don't agree with, but it looked about like this

You can call data the new oil when someone invades a country to secure a data center.

I don't think anything fell out of favor and things are a long way from data being, "your data only" although you have been given some rights in that regard.

Nothing happened to it. Big data always represented pushing the boundaries of what could be done when dealing with large amounts of data. After a while the technology matured to the point where working with large datasets just became something you did. There was a lot of hype to it and many organizations unnecessarily went along for the ride. It's also a balance betwee current technology and economics of compute, storage, networking, etc. As the balance changes what and how you do things also changes.

> You can call data the new oil when someone invades a country to secure a data center

State-sponsored hacking for purposes of data exfiltration has been going on for years.

Well, with the introduction of kubernetes as a platform and other cloud solutions, most "big data" just became "data".

Its amazing to see that nowadays the persistent volume claim used for logging, is on average now much bigger than the average dedicated machine was about 10 years ago.

These days I hear a lot of "AIML"

It doesn't make much sense to me as I've never seen anyone use anything you'd find in an AI book that you wouldn't also find in a Machine learning book

For big data, I think that the terminology waned but data engineers internalized the desire to scale everything they make to handle big data. So data engineering teams are still using things like Spark (or databricks) even if their datasets aren't big enough to need that

Big Data was about the giddy excitement of being able to run some fancy predictive model on a large amount of data and get some sort of incredible benefit. Now most organizations have tried it. The few that actually benefit now take it for granted. The rest have moved on, although they still have a team of data engineers babysitting a legacy Hadoop cluster.

Now people throw everything in “data lakes”. It’s already so complicated to handle the ingestion that they don’t even want to try to do anything with that much data.

Everyone who has data is still doing it, the buzzword just went out of fashion. Now it’s data science, analytics, ML eng. What truly ended is “big data” meaning “we’ll come take your logs and magically transform your business.”

It was just a hyped up buzzword, and new ones have been substituted now. "machine learning" had a broader appeal, many places didn't have that much data; the ones that did have "big" data largely didn't find much signal in the noise even with AI/ML tools.

Two things:

1. It morphed into ML as the dirty secret to most ML projects is that they're predominantly about data. Put another way you can't derive a model from nothing.

2. You mentioned privacy scandals, but things like CCPA + GDPR legitimately did make larger corporations pause and ask "Do we actually need this information?" where prior to that everyone was a hoarder "just in case"

- hyped buzzword

- catch-all excuse to record everything forever without having an idea how to use that data

- actually hard problems

Here's a rule of thumb: anything with the name "big" before it is bad.

Big oil. Big bad. Big lie. Big brother. Big apple. Etc...