Hacker News new | past | comments | ask | show | jobs | submit login

Data scientists shouldn't be writing anything to the Data Lake. Data Lakes store raw datasets (sort of like Event Streaming databases store raw events.) In academic terms, they store primary-source data.

Once data has been through some transformations at the hands of a Data Scientist, it's now a secondary source—a report, usually—and exists in a form better suited to living in a Data Warehouse.

Data Lakes need a priesthood to guard their interface, like DBAs are for DBMSes. The difference being that DBAs need to guard against misarchitected read workloads, while the manager of a Data Lake doesn't need to worry about that. They only need to worry about people putting the wrong things (= secondary-source data) into the Data Lake in the first place.

In most Data Lakes I've seen, usually there are specific teams with write privilege to it, where "putting $foo in the Data Lake" is their whole job: researchers who write scrapers, data teams that buy datasets from partners and dump them in, etc. Nobody else in the company needs to write to the Data Lake, because nobody else has raw data; if your data already lives in a company RDBMS, you don't move it from there into the Data Lake to process it; you write your query to pull data from both.

An analogy: there is a city by a lake. The city has water treatment plants which turn lakewater into drinking water and pump it into the city water system. Let's say you want to do an analysis of the lake water, but you need the water more dilute (i.e. with fewer impurities) than the lake water itself is. What would you do: pump the city water supply into the lake until the whole lake is properly dilute? Or just take some lake water in a cup and pour some water from your tap into the cup, and repeat?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: