Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: All of you working with Big Data, what is your Data?
117 points by sysk on Jan 24, 2015 | hide | past | favorite | 94 comments
Big data is a trending topic these days and I'd like to get my hands dirty both out of curiosity and to make myself more relevant on the marketplace. That being said, I'm not sure which data sets are both interesting to play with and easily accessible. My question is:

For those of you already working with big data, what kind of data do you work with?




If you want to work in a "big data"-type role as a developer, I wouldn't worry about finding huge data sets. There's a dearth of candidates, especially ones who actually have hands-on experience, and having deep knowledge of (and a little experience with) a broad range of tools will make you a pretty good candidate:

Fire up a VM with a single-node install on it [1] and just grab any old CSVs. Load them into HDFS, query them with Hive, query them with Impala (Drill, SparkQL, etc.). Rinse and repeat for any size of syslog data, then JSON data. Write a MapReduce job to transform the files in some way. Move on to some Spark exercises [2]. Read up on Kafka, understand how it works and think about ways to get exactly-once message delivery. Hook Kafka up to HDFS, or HBase, or a complex event processing pipeline. You'll probably need to know about serialization formats too, so study up on Avro, protobuf and Parquet (or ORCfile, as long as you understand columnar storage).

If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work. If you're focused on analytics you can get away with less of the above, but knowing some of it, plus stats and BI tools (or D3 if you want to roll your own visualization) is a plus.

[1] http://www.cloudera.com/content/cloudera/en/downloads/quicks... [2] http://ampcamp.berkeley.edu/5/


Is that really all that "big data" positions need? In past experience (Google), there were all sorts of problems that working with actual huge data sets introduced that wasn't handled by the frameworks available (not even MapReduce, which by most accounts is significantly more advanced than Hadoop). Things like:

1. With a big data set, there is no easy way to verify the correctness of your algorithms. The data is too big to hand inspect, and so assuming your code is syntactically well-formed and doesn't crash, you will get an answer. Is your answer correct? Well, you don't actually know, and any number of logic errors might throw it off without causing a detectable programming error.

2. Big data is messy. There will be some records in your data set that are formatted differently than you expect, or contain data that means something semantically different than you expect. Best case, your Hadoop job crashes 4 hours in. Worst case, it silently succeeds, and you have no idea that your results were polluted by spurious results that you had no idea existed.

3. Big data will expose basically every code path and combination of code paths in your analysis program, so it all better be bulletproof. Learn how to write code correctly the first time, or you're going to be spending a lot of time waiting for the script to run and then fixing crashes several hours in.

4. Big data contains outliers. Oftentimes, the outliers will dominate your results, and so if you don't have a way of filtering them out or making your algorithm less sensitive to them, you will get garbage as your final answer.

There are techniques to deal with these, but they are techniques that are built into your workflow as a data scientist, and not the tools that are available. One thing that always amazed me at Google was how much time the data scientists on staff spent not writing code. Writing your MapReduces takes perhaps 5-10% of your day; most of the rest of it is mundane stuff like staring at data and compiling golden sets.


Definitely - playing with any data is the best way to learn the tools.

Just for giggles I built a tool that extracts all oracle permissions, sums that up into relationship information using PIG (this schema owner reads from here, writes to there, etc) and used first R/ggplot2 and then later Gephi to plot the results.

None of the data sets could be called big data by any stretch, and I could have done the processing more quickly with perl or python, or even a mix of shell commands. But that wasn't the point. It was to expand on the one day of training I'd had and help cement the ideas, and frankly it was to have fun.

Find something that you're passionate about or just plain sounds like fun and then use the tool you want to learn to solve your problem.


If someone with a typical web dev background (knows how to handle databases like oracle or MySQL, but nothing about tools like Hadoop etc), could you recommend a course/book to start big data with? Also, with so much to learn, how does one go about deciding what field within big data to specialize in?


>how does one go about deciding what field within big data to specialize in?

In my experience your job generally dictates what you specialize in. I ended up being more data engineer than scientist since my job had a lot of tricky data warehousing problems.


There's a lot of stuff under the "Big Data" umbrella: I focused on Hadoop below because that's my focus right now. I'm sure I'm missing some roles here, but the specialities I can think of are:

- getting data out of production systems and transforming it (infrastructure or ETL) - analytical querying and reporting - system administration - machine learning

There's also the wide world of NoSQL data stores, which people lump in with big data, but which require vastly different skills.

The Hadoop VM I linked to above is good for working through exercises for all of the above.

As a starting point, this book[1] walks through the motivation behind Hadoop, and then gets a little into internals and use cases. It's out of date, but you can work through it and get into the right frame of mind, understand HDFS, etc. It's a good starting point.

AMP Camp (that I linked to above) is an introduction to Spark for people with a little Hadoop experience. Spark is getting a lot of attention, you could run into it in a number of roles.

If you're going to be planning the whole pipeline, or doing any sort of infrastructure role, I recommend Hadoop Application Architecture[2] for more modern tools and design patterns. This blog post[3] is a pretty good overview of distributed logs, which are essential for horizontal scale. Understanding Kafka and ZooKeeper is really useful for infrastructure roles, maybe less so for admins.

If you're planning to be in the reporting layer, having a deep understanding of SQL and data warehousing is useful. This book[4] is old hat, but I would say it's expected knowledge for anyone planning a warehouse, and it's interesting to understand best practices. Most places will also expect knowledge of Tableau or a similar BI tool, but that's tougher to learn on your own since licenses are brutal. Visualization with D3 is nice to have in this space, especially if you're coming from a web background - Scott Murray's tutorials [5] are a good starting place.

It's harder to point to resources for sysadmins - if you weren't a sysadmin before, you need to understand a lot of other concepts before you worry about Hadoop stuff. ML is similar - you need to understand the principles and be able to work on a single node. There's lots of good resources out there about getting started in data science.

1. http://shop.oreilly.com/product/0636920021773.do

2. http://shop.oreilly.com/product/0636920033196.do

3. http://engineering.linkedin.com/distributed-systems/log-what...

4. http://ca.wiley.com/WileyCDA/WileyTitle/productCd-0471200247...

5. http://alignedleft.com/tutorials


Wow! I loved the way you explained it so clearly. Is it possible I could contact you off the site to get further guidance?


alanctgardner@gmail.com, feel free :)


>If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work.

This is sadly true, for now. I don't think folks here disagree with the "true" part. Let me explain why it is "sadly" and "for now".

The biggest issue with big data is most of it sits unused. In many organizations, HDFS ends up being an alternative to NetApp storage servers, storing terabytes of data with the hopes of them being useful one day.

In fact, if you already get to that stage of using HDFS as a storage server, you must have a decent ETL team that can put data into HDFS with a menacing combination of ad hoc scripts and a workflow that looks like a cobweb produced by a deranged spider. For now, knowing the ins and outs of various semi-functional open source components and the tenacity, patience and skill to deal with the gnarliest of ETL tasks get you a high-paying data engineering job.

But, in the long term, there will be a big change.

1. Tools are getting better: many data practitioners are realizing there are huge gaps between different data infrastructure components, and they are trying to fill these gaps. There is a lot of attention given to query execution engines (Presto, Impala, Spark, etc.) but I find data collection/workflow management tools are just as critical (if not higher leverage) right now. Tools like Fluentd (log collector) [1], Luigi (workflow engine) are OSS software in this direction.

2. Data-related cloud services are becoming really, really good: huge kudos to services like AWS, GCP, Heroku (through Addons). They are quickly building a great ecosystem of data processing/analysis/database components that frankly work better than most self-administered OSS counterparts. (Disclaimer: my perception might be colored here since I work for a data processing/collaboration SaaS myself [3])

So, back to the question. I think aspiring data engineers have two distinct career paths:

1. Becoming an expert in a particular data engineering component: this would be building a query execution engine, designing a distributed stream processing system, etc. (It would be awesome if you decide to release as open source)

2. Becoming an expert on quickly and effectively deploying cloud services to get the job done: this is the skill most desired among data engineers at startups.

What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>. These folks will be wiped out in the next decade or so.

[1] https://www.fluentd.org [2] http://luigi.readthedocs.org [3] http://www.treasuredata.com


> The biggest issue with big data is most of it sits unused

This is really variable. If you're at a place where they jumped on the bandwagon, then yes. There are also lots of companies (and not just Google/FB/LinkedIn) that build mission critical reporting and ML infratstructure on Hadoop. These companies appreciate the value of workflow coordination, and they wouldn't move ahead without (at least) Oozie/Azkaban in place to give some visibility into their workflow.

> But, in the long term, there will be a big change.

I think more types of work will become commoditized. If you just want log processing, there are lots of on-premises and cloud options. Splunk has been doing this forever. Ostensibly with good-enough BI software you could just focus on ingest, and everything else is drag and drop. On a long enough time frame, hand-rolling pipelines will become obsolete. This is like a 10+ year timeline for any player to get significant market share. In the meantime, people have to actually get stuff done, and their skills will be transferable because they understand distributed systems, ETL, warehousing, and a lot of other stuff that hasn't really changed in a decade.

> Becoming an expert in a particular data engineering component

Are you advocating that nobody writes Spark Streaming jobs, because they should rewrite Spark instead? Don't learn to work with Impala, learn to rewrite Impala? I disagree, the tools are only getting better, and it's going to take more and more work to replace the entrenched players. Working on top of solid tools will make you far more productive than engaging in NIH and making your own SQL engine.

> Becoming an expert on quickly and effectively deploying cloud services to get the job done

Like RedShift, EMR and Amazon Data Pipeline? They're hardly turn-key solutions. Amazon's Kinesis is just Kafka with paid throughput - you can absolutely re-use your skills in the cloud, without having to cave and get locked in to a single vendor serving one specific use-case.

> What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>

So in your mind you either pick a vendor to handle all your data for you, or you're an "OSS DIY bigot"? Something like owning your entire user analytics pipeline isn't mission critical for a startup, it's stupid to build it yourself?

> These folks will be wiped out in the next decade or so.

Even though Oracle is amazing and great, lots of people still use Postgres, MySQL, etc. There's always going to be a continuum from "We should buy his turnkey thing" to "we started by rolling our own SQL query engine". You need to be able to identify when each is appropriate, not shoehorn in a one-size-fits-all solution.


The twitter social graph (follow connections between people) is my data source, I extract it from the API and cache it in a database.

The mariadb table storing this information currently takes a bit more than 500GB, it has about 4 billion rows (based on the statistics, I don't run SELECT count(*) on it anymore).

I usually don't use the term "big data" because the buzzword is so popular that it doesn't mean anything anymore.


I'm looking into doing something similar as well, may I know how long have you been collecting the data, and how do you decide which data to collect?

I have been collecting data from the twitter API for a few days as well. I wanted to get an idea of average tweeting pattern, but without access to the firehose API, I got a feeling that the sample I have isn't very "neutral" as I have been mostly pulling from the popular and local tweets endpoints.

Any advice on how should I approach these kind of data collection?


It all depends on your use case.

If user streams are not releavant to you, you may use the `filter` endpoint instead of `sample`, and focus on keywords describing a relevant niche for your analysis.

In case you want to limit yourself to tweets geo-located in a certain location, you have to be aware that the bounding box filter of twitter is buggy: it will return you tweets geo-located outside of the desired area, and you are not sure to be getting all tweets.


Won't SELECT count(*) be super slow? Isn't SELECT Count(some_primary_key) a better idea?


It should be the same. count(*) only needs to return the number of columns (regardless of its value), so it can use only indexes, while count(column) must only count non-null values. But since the primary is non-null, it should end up taking the same time.


Wow, just tested this against a table with some pretty wide columns and ~ 7M rows. They take exactly the same time! I would have thought COUNT() would be like SELECT , but I guess the query planner is smart enough to know what to do.

Thanks!


Day job: Web site user sessions and offline retail sales data.

Side project: Poll responses on http://www.correlated.org


Here's some stuff I have done in the past year, I work for a small company, but run a personal computing cluster of 167 servers that I pay for out of my own pocket. I really enjoy loading "big" datasets into them and working on improving algorithms or gaining insight into the data.

I (try and) network around London and offer my services for free to people who have interesting problems.

- Very high resolution FMRI data. A single scan can be 10-20GB

- Infringing URLs for a piracy company, 4 billion rows

- DNA sequences and Protein Data, lots of variation in sizes, from a few hundred MB's of string data, to hundreds of GBs

- RAW radio data for a military skunkworks project (10's of GB / min)

I would really like to find an investor who could take me off my full time job, I have 3 quite large projects I would like to build, one I have almost finished.


Do you actually have 167 servers running in your home or do you rent them on e.g. EC2 when you need them? If it's the former case, is it because it makes financial sense (I'd be surprised) or is it for the experience/fun?


I have 3 blade servers in my house, these act as command and control machines, the 167 are rented on EC2


What do you pay for such cluster if I may ask? I assume it's not in EC2? Have you thought of adding a web-crawling service on the top of that?


It is on EC2. I pay about £2500 a month - but the monthly cost is lowered because a lot of the machines are on reserved instances - with a big chunk paid up front (sometimes by the people I work with) ... I usually work for free because I'm not really interested in financial gain but more interested in interesting problems ... but I gotta pay the bills sometimes, and the people I work with are usually more than happy to chip in if I need it :)


If you're looking for data sets to play with, check out Kaggle [0]. Companies post data sets there along with questions they want answered, and people compete to find the best way to answer them.

[0] www.kaggle.com


Interesting idea. I'm going to start posting my company's workload online in the form of competitions, letting people work for the possibility of being compensated at sub-market rates.


I work at Prezi. We have about a petabyte of data. It's usage data coming from the product and the website. Clicks in the editor and such. Then we have a data warehouse with cleaned and accurate datasets, that's much less. We are on AWS, we use S3, EMR for Hadoop, Pig, Redshift for SQL, chartio, etc. We have our own hourly ETL written in Go which we will opensource this year.

I recently talked at Strata, here's the Prezi:

https://prezi.com/d1889jmlziks/strata-2014/


Retail transaction/loyalty, network traffic, financial, and health data.

To be clear 'big data' is poorly defined, and I mostly do not work with terabyte+ data sets, but rather with highly dimensional data in moderate volume. Data is only 'big' relative to the algorithms you try to use on it.


You can actually think an interesting application and generate your own data. For example, we were developing a product for processing network events in real-time. There were 6-10K events per second, and we were creating alerts for several different scenarios. For testing purposes, we actually wrote a program to simulate those events, with 20K events per second. It was generating fake (but realistic) data with the right format.

Application idea from top of my head: Generate turnstile data for different subway stations (enter/exit, time) and wrote an application to show the density of those stations with times. You can create a scenario where a certain station is more dense than others, and this could be your test. And this application could be your proof of concept


Treasure Data, 400k records per second. For us it's less about the data we manage, but how easy we make it for customers to store and query it.

Data consists of IoT devices ranging from wearables to cars to frickin' windmills, analytics from various websites and mobile games.


I am the data modeler for an organization which lends to small businesses. In my experience "big data" is all in the eye of the beholder, and it's not all about how many gigabytes of data you work with, how wide, or how long it is. The challenges are the same: how to use the data in relevant ways to forward organizational goals. In my case the days isn't particularly long in terms of number of rows, but it is exceptionally wide in terms of potential variables. It's enough data that I have to spend a reasonable amount of time thinking about the most efficient way to model (statistically) and data mine. The issues are similar to other data oriented jobs I've had: how to determine which variables are relevant, clean and transform the data... And ultimately how to turn a big pile of data into a model which effectively predicts likelihood of charge off if the loan were to be approved. Scintillating stuff, but obscenely difficult. Of course, it's harder too because I'm the only modeler and am fairly inexperienced. My last experience building predictive models was a couple classes in college... Which was also my last experience using R (which I prefer to SAS.

To answer your implied question, I'd recommend picking up ANY size real world data and playing with it. Build statistical models (predictive or otherwise), apply supervised and unsupervised machine learning methods to it, but above all develop a foundation of experience working with real world data. In class in college we used "canned" data sets which were already cleaned, validated, organized, and so forth. This made it unrealistically easy to model. In the real world just working with the data effectively is a hard won skill. So from the get go you need to learn how to explore data, visualize it, interpret plots and statistics, clean/transform/normalize it, formulate a question your data can answer, and apply the relevant methods in pursuit of the answers you seek. Once you have the fundamentals down the size of the data is immaterial--only requiring you to put additional thought into what you can computationally achieve (for instance, how to determine which of 150 candidate variables are statistically relevant).


OT but I've never liked the term 'big data' precisely because it's so ill-defined. Most people I speak with on this think they have 'big data'. Anything they can't comprehend is "big data". Anything that makes their Excel 97 crash is "big data". It's pervasive a term enough that people have heard it, and use it wrongly.

Colleague of mine is at a company that's advertising for someone with "big data" experience. Collectively, for more than 10 years in business, they have maybe 100g of data. They just do not know how to organize the data sanely in a relational database, and actively refuse to consider normal data structures.


There is a definition for "very large database" (VLDB) on Wikipedia- the precursor to the term "big data".. although it is somewhat outdated.


Financial data (tick to EOD), network traffic data (TCP packet level sends / receives) and farm data (sensor + farm ERP data)

All of them are basically time series with some master data, none of them is more than a few dozen GB

So in any case, I think time series data is worth a look.


You can get vast amounts of stock option data from http://www.orats.com/.

I don't think of the options stream as big data. It is more of a "fast" data problem, where at the open you need to handle 100,000 ticks per second, doing a implied volatility calculation on each tick. It is pretty demanding, when, for example, you consider that when the price of IBM moves by one penny, you get 400 options quote right then. You end up with a billion quotes in a typical day, if memory serves correctly.

I didn't think of this as Big Data, somehow. It seems to me that big data is more complex than that. You most often just process the data, say for backtesting, in a strict series, from the start to the end, or for an interval. I think of Big Data having more structure.


I run backtesting for options trading as a hobby and storing EOD tick data, querying it and extracting it is a pain. I dl my source data currently from a retail historical data provider, then store it in MongoDB in AWS. What would you recommend tech stackwise to do backtesting on time seriea data?


I also first used MongoDB but I switched to cassandra and I find it much better.

MongoDB was bad for this case because it devours memory and it lacks a primary key (so saving a tick that's already creates a second tick).

cassandra is better (it has a PK, although it's a bit weird because of its distributed-first attitude) but in the long run I think postgres would be even better (because it's space efficient).

Apart from that I use some Java libraries and clojure/incanter and program the rest around it myself.


QuantConnect.com has tick data for every stock in the US universe back to 1998 -- its roughly 4TB of data.


They don't have options data, but you might check out QuantConnect or Quantopian if you're interested in stocks and FX.


Have you tried quantopian.com ?


Governmental health records and survey data. A lot of the really big stuff we use requires academic licenses, but there's still a lot of publicly accessible data.

For the U.S. try

- CDC's National Center for Health Statistics: http://www.cdc.gov/nchs/

- CDC WONDER: http://wonder.cdc.gov/

- NIH's Unified Medical Language System: http://www.nlm.nih.gov/research/umls/

And for global try the Global Health Data Exchange: http://ghdx.healthdata.org


There are organizations that collect big data at various locations and shared among them. Have a look: http://webscience.org/web-observatory/


I think the most interesting datasets are within reach but require curation yourself. For example there are extremely powerful scraping libraries in just about every popular language today, not to mention APIs such as Twitter's.

If you're looking for a cool dataset to play with, I think it is more productive to ask yourself what questions you want to answer and then find/curate the data VS find a dataset and then ask "what questions can I answer?". The former approach will also keep motivations high if you're driven by curiosity.


> I think it is more productive to ask yourself what questions you want to answer

I second that. An old remark is, "We often find that a good question is more important than a good answer.", as I recall, due to Richard Bellman, say, the leading proponent of dynamic programming, i.e., usually a case of optimal control, either for the deterministic or stochastic (the system gets random exogenous inputs while we are trying to control it). Bellman was into a lot in pure and applied mathematics, engineering, medicine, etc. Bright guy. As I recall, his Ph.D. was in stability of solutions of initial value problems for ordinary differential equations, from Princeton.


Human and machine-generated structured event streams, via Snowplow (https://github.com/snowplow/snowplow).

The largest open-access event stream archive I know about is from GitHub, I think it's about 100Gb: https://www.githubarchive.org/


Data collected from devices and it is large, but not big. Around 40-60TB and very repetitive data. Find some open set of data that interests you and just do something to get familiar with the tools.

I think most data sets could be handled via RDBMS and Big Data is just another choice. The more interesting thing to me is what you accomplish and if a new tech can get you there faster or cheaper, etc.


Job: Data Science Consultant, Governments and NGOs

For me it's primarily population data. It's not exactly 'big' data in the raw form, but what makes it bigger are the variations analyzing, applications of predictive models, and new metadata values extracted from it.

The data grows faster than we're collecting it exponentially because of all the analysis.


Working with information about attacks all the way down the killchain. Everything from IDS sigs, english descriptions, attribution, ip/host reputation.

AlienVault is hiring security researchers.

edit: we have some limited data sets that we make public, incase you're interested, hence the name 'open threat exchange'.


I work on the relationships between hashtags and between hashtags and influencers: http://hashtagify.me

For this analysis, we collect the data from Twitter's public API


Working on reducing Big Data to help network security engineers investigate threats faster and respond more accurately.

The data, currently, is mainly from various Network Security Monitoring appliances & SIEMs.


I work at Localytics. We have analytics data from billions of mobile and web users, including specific user actions, usage in general, and user profiles. It really is a fascinating dataset.


Are you really supposed to look at your customers data?


I never said I looked at it—we work with it, meaning we develop products which allow customers to operate on their own data.


Have worked on a few Big Data projects: - Sensor Data from haul trucks to predict their failures and optimize their routes in the mines - Telematics data for insurance companies


Mostly logfiles, and other machine generated data (of which 99% can be thrown away, but that's what "big data" does for me, filter out what's important).


Banking and brokerage portal access data.

Utilizing Splunk as analytics and alerting platform to correlate real time financial activity events with multiple threat intelligence feeds.


IMDB and Boxoffice mojo data. Thinking of moving to mongodb.


In search engine companies, we sort out cookies every day.


I'm in devops but support the Hadoop cluste. We are a adtech company that has close to a 2 petabyte cluster that is around 76% full.


Perhaps the biggest of big data problems - imagery (photos & video) We build algorithms to extract value in imagery.


Currently sentiment analysis on business reviews (Yelp, Google, Citysearch, Facebook, OpenTable, TripAdvisor).


Genetic sequence data, mostly from cancer.

(Tools are terrible, data sizes up to hundreds of gigabytes per patient)


Hotel reservations, prices, and numerous market indicators. for thousands of hotels.

PriceMatch is hiring in Paris!


GPS and transport data.


Same. >1TB/hour of vehicles and sensor data. Including video, gps and radar


I work with genomics data. The data is more complex than big.


Log data of browsing history. 500k requests per sec.


How big is each event? How much indexing? How much infrastructure?


working with ~1PB of remote sensing data and station derived data. never used the word 'Big Data' in any work context.


clickstream logs to build recommendation systems.


Genomics!


I think I would consider anything 100Tb and up to be big data. There is no big data that is "easily accessible"; that's why it's "big" because it requires extremely powerful hardware and advanced techniques to work on. Otherwise its just "data".

NOTE: there are people in the world who would laugh at my definition and say that big data starts at 1Pb.


In practice it's more a marketing term, and how big is big depends on what the nature of the data is and what you're doing with it.

If it fits in RAM on your laptop, it isn't big data.

If you can't process/handle it in a reasonable time on a single machine and your methods need to explicitly worry about how to scale to handle the data volumes it probably is "Big Data".

Problems that are embarrassingly parallel need far more data before I'd consider them big (I'd be in the >10PB camp), whereas for relational data I'd say >1TB.


I've worked on 50Tb relational databases, I don't consider myself to be a "big data" guy.


I've worked on relational databases of similar size. There's two challenges. The first is maintaining the relational model at that scale is quite tricky, tradeoffs need to be made. The second is the systems-level management of that large a deployment requires a bit more than standard configuration management.

These days Amazon have Multi-AZ RDS, which should handle the 2nd item.


The problem with databases that are 50tb or more is that you soon run into limits with the relational model. I have been reading up on different modeling techniques for converting relational models into cassandra's column family stores.


The issues are a bit more fundamental.

You can't practically fit 50TB on one machine and have reasonable performance, that means multiple machines with the data spread across them.

There's then two potential issues: 1) You're doing 1-to-1 joins across tables in a query, network latency may be an issue at high query rates 2) You're going 1-to-many or many-to-many joins across tables in a query, the resulting combinatorial explosion of data is too much to handle

You want to have your inner loops/joins as deep down in the stack as possible. If you can structure things so all the heavy lifting stays inside one rack/machine/NUMA node/ processor/core you'll be able to scale a good bit further further.

Designing things not to require joins, denormalising and putting it in a column store like Cassandra is also a good approach.


No, you run into limits with some relational engines. I work in a Teradata shop and we handle relational models of this size just fine.


I think that needing disk parallelism because you have a workload that demands table scans and maintaining indexes is impractical due to dynamism in the data is one feature.

Another is not having pockets deep enough to solve it with intellectual property, either in the form of a parallel proprietary rdbs (spensive) or the need to implement clever stuff.

Big data as a technology is about dumb as brick, cheap as chips, brute force.


Actually you can normally get away without resorting to "big data" techniques at 1PB scale, it's probably around 10PB that you're really forced to start thinking about things like hadoop style map-reduce/rack-locality.


Probably depends on what you're doing with the data?


Right, I'm thinking of the "embarrassingly parallel" sort of problems, where you can shard the data (for example by city, country, company or some other obvious classifier) and give one shard to one server.

I'm you were trying to do anything that's O(n.log n) or O(n^2) (think graph processing) then you'll run into trouble at much smaller scales.


Count me in as someone who says the word better start with at least "peta"

It's not just eh size though, IMO it implies a certain dimensionality and/or lack of structure. At work were sittin on several petabytes and I don't view it as Big Data because it's actually pretty simple. We share many of the same problems as Big Data but not all


Those numbers make me think of "store all the things" rather than useful statistical data.

1PB is arguably enough data to store genetic variation across all human beings.


1pb / 100gb / genome = 10,000 genome sequences. And that's just raw data from one platform. If you're interested in e.g. splicing diversity you would want to do long read RNA sequencing. Leaving room for intermediate results (alignments, assemblies) you would only have room for a thousand people.


Thanks for picking up on this :) As I said, it's arguable.

I was working on the principle that the effective population size of humans is 10,000.

(And your genome is oversized, no? 3 billion base-pairs is less than 1 Gigabyte)


I'm referring to guys I know who have worked or presently work at CERN.


In addition to CERN, the LSST is expected to generate 15 terabytes per day.


What I really meant by "easily accessible" was "publicly available".


>NOTE: there are people in the world who would laugh at my definition and say that big data starts at 1Pb.

I commend them for having a larger penis ^H^H^H^H^H^H data stack than you.

I thought big data was less about the actual size of the data store and more about where it comes from (typically passive collection from user activity) and how it's accessed (through some kind of large map-reduce style framework) and used (to inform product decisions or learn more about human behavior)?


From the technology standpoint, where it comes from and how it's used doesn't really make a difference - if it can be processed on a single very beefy machine when done properly, then the appropriate/efficient way to work with this data is by avoiding big data techniques.

If it cannot, then you pay the price of all the complexity and overheads of big data processing techniques so that you can get your processing done.

It's correlated with data size, bot not so strictly - you can get, for example, NLP processing problems where you need a painful pipeline split over a huge cluster for a single gb of input data, and you can have problems where the best way to process a petabyte dataset is just to stick a single powerful machine to get the performance benefits of locality and low latency, and avoid managing splits/failed nodes/whatever.

So, in the first problem you would need to use Big Data techniques and the second problem you don't, it's not related to big data and the recommendations on how best to do that won't help people who need to do big data processing.


So some other definition of "big" than umm "big" then?


Yes, absolutely. When people talk about big data, more often than not it's a measure of complexity and difficulty, not size.


Yeah, for everyone but physicists it's really "big enough" data: it's a big enough data set that you've started recording things you didn't even try to record.

An excellent example was on HN the other day, using the NYC taxi data to determine which drivers are observant Muslims. It's not something anyone set out to record, but the data set has gotten so large that if you turn it sideways and shake, random facts like that fall out.


Do you think that when people make 1000+ table relational databases it's because a) it's fun b) they're stupid or c) because it's modelling something that is inherently complex?

Big data is neither big nor particularly complex.


I have seen several news articles using it that way. I think that's going to become an alternate definition soon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: