Hacker News new | past | comments | ask | show | jobs | submit | kmarc's comments login

That's "literally" not it.

The piece discusses the neurobiological background, social effects, shows examples, and closes with a recommendation on what to try if you are a victim of doomscrolling.

In fact, this writing does have a feel of 10-15y ago blog posts (where you can say, it's too short compared to books written 100-150y ago on a certain topic). I wish some people read this today instead of spending that 8minutes mindlessly scrolling TikTok


    > I wish some people read this today instead of spending that 8minutes mindlessly scrolling TikTok
No trolling here: I fail to understand the difference between scrolling TikTok and scrolling shitty cable TV in 2000. It's not that different to me.


With cable TV, you might click around while finding something to watch, but when you found something to watch, you'd watch it, for multiple minutes, until commercial break. Then, you'd either sit through the commercials, or press JUMP to go back to your "backup channel", with something else you'd want to watch while you estimated the duration of the commercial block on the first channel.

The point is, you were being exposed to novel content at a much slower rate than you are when you scroll through your short-video social media feeds. It already wasn't a great situation to begin with, as far as attention spans are concerned, but this new age of short video social media apps has undoubtedly exacerbated the issue.


I do see some.

TV shows are longer. And I don't mean double the length, but literally 100xtimes longer. There is a chance you accidentally find a channel with a program that really triggers your curiosity, and then you stop scrolling. On TikTok, this is not possible, after a couple seconds the mildly interesting content is gone. Good luck finding it again, or even remembering what you found interesting in the morning sitting on the toilet swearing "you'll check out this topic later"

Another aspect is the quality of the content, which, chances are, higher. Not all of the programmes, obviously, but, again, you might find a documentary, an art piece (aka movie), etc. On TikTok/instagram/yt/Facebook reels anyone can post any rubbish, without any curation, and it's up to some algorithm to decide if it's shown to you.

Not to mention the regimen you create, the ritual if you like, of sitting down in front of the TV at the time your beloved programme is aired.

I used to think a lot about the "it's the same as TV / newspaper / books used to be" argument, and I definitely believe it's not. Especially since I dated the "book reader" and the "insta story addict" persons and tried to have conversations with them about topics. Day and night.

In fact, I started going back to TV. Not the smartphone-controlled / streaming kind, but using the remote. And scrolling through the handful of channels I added to favorites. I do sometimes stop at a random program (yesterday I just learned that there are venomous beetles in Switzerland!). Sometimes I won't and switch the TV off, and just stare out on the windows.

TV doomscrolling is bad (news channels to the front). A bit less worse than "social media" (scrolling addiction). I kinda think the suggested meta observation (10-10-10) is worth to try.


Scrolling shitty cable is much less fun and therefore less addictive. It’s also not always available.

But both are terrible ways to spend your time and bad for your brain.


If you curate your feed, TikTok offers plenty of useful content, way beyond what you could find on 2000's cable tv, or anywhere else even today.

Piggy backing to comment on:

https://news.ycombinator.com/item?id=40559854

How can tech people misunderstand social media ML algorithms so badly....gee, I wonder what could cause that.


That doesn’t refute parent‘s point. It’s still a barrage of different ideas, opinions, or jokes, that’s hitting your brain in very short intervals. You don’t even get to deeply interact with one of the concepts before you’re being shown the next one. Even though you might feel you only watch quality content, that’s only the dopamine addiction speaking. If you don’t believe me, try reading a book for an hour straight, without touching your phone or doing something else. Most people that using social media regularly will struggle doing this.


> Even though you might feel you only watch quality content, that’s only the dopamine addiction speaking.

Did you learn what is happening inside my mind, in fact, by reading a book, or via some more esoteric approach?


Attention has become the currency of our society and being able to captivate it means you have power. So it's possible someone does a TikTok video about this article, pretty much like some woman did a video about how she found out the comment section in Instagram is completely different between a woman's account and a man's account.


They seem to be pivoting on quality, at least in Switzerland, as well. I rarely go to Aldi, but whenever I do, the unknown(-to-me) labeled products feel to be better quality, tastier. I'm always surprised.

The "optimized" workflow is a plus. Most of the (countryside) Aldi's layout is also the very same, I myself am quicker to finish my grocery shopping in Aldi.


I am outcompeted. Probably never climb further up in the "corporate ladder". My comp is "enough" (actually, more than enough) for me. I don't care about the labels (senior, partner, manager, lead, etc.)

I find fun in the engineering work I do, while taking loooooooong workation travels to somewhere warmer during cold and dark European winters. I have almost zero work stress, no reporting pressure up- or downstream, and obviously no time wasted on a daily commute.

I still maintain rigid work hours. Well, not rigid compared to office 9-5, but to the "freedom" I could have: I strictly _try_ timeframe my work hours. 90% of the days it works. This is important so that 1) I ensure work-life balance 2) I suppress this inexplicable guilt for me basically beach holidaying while most of my colleagues are sitting in the office.

If promotions, etc is important for you, and the company isn't fully remote-only, then I don't think you have good chances, though, as a remote "aywhere in the world"


Exactly my thoughts when I saw this new thing first tonight (except for the "trustworthiness of Google brand" part... That's long gone in my opinion).

Interested in the answers here, hopefully some anonymous Google rs would share some insights.


When I was hiring data scientists for a previous job, my favorite tricky question was "what stack/architecture would you build" with the somewhat detailed requirements of "6 TiB of data" in sight. I was careful not to require overly complicated sums, I simply said it's MAX 6TiB

I patiently listened to all the big query hadoop habla-blabla, even asked questions about the financials (hardware/software/license BOM) and many of them came up with astonishing tens of thousands of dollars yearly.

The winner of course was the guy who understood that 6TiB is what 6 of us in the room could store on our smart phones, or a $199 enterprise HDD (or three of them for redundancy), and it could be loaded (multiple times) to memory as CSV and simply run awk scripts on it.

I am prone to the same fallacy: when I learn how to use a hammer, everything looks like a nail. Yet, not understanding the scale of "real" big data was a no-go in my eyes when hiring.


One thing that may have an impact on the answers: you are hiring them, so I assume they are passing a technical interview. So they expect that you want to check their understanding of the technical stack.

I would not conclude that they over-engineer everything they do from such an answer, but rather just that they got tricked in this very artificial situation where you are in a dominant position and ask trick questions.

I was recently in a technical interview with an interviewer roughly my age and my experience, and I messed up. That's the game, I get it. But the interviewer got judgemental towards my (admittedly bad) answers. I am absolutely certain that were the roles inverted, I could choose a topic I know better than him and get him in a similarly bad position. But in this case, he was in the dominant position and he chose to make me feel bad.

My point, I guess, is this: when you are the interviewer, be extra careful not to abuse your dominant position, because it is probably counter-productive for your company (and it is just not nice for the human being in front of you).


From the point of view of the interviewee, it's impossible to guess if they expect you to answer "no need for big data" or if they expect you to answer "the company is aiming for exponential growth so disregard the 6TB limit and architect for scalability"


It doesn't matter. The answer should be "It depends, what are the circumstances - do we expect high growth in the future? Is it gonna stay around 6TB? How and by whom will it be used and what for?"

Or, if you can guess what the interviewer is aiming for, state the assumption and go from there "If we assume it's gonna stay at <10TB for the next couple of years or even longer, then..."

Then the interviewer can interrupt and change the assumptions to his needs.


The interview question made it clear it was a maximum of 6TiB. With this, one can assume it’s not growing yet still think the interviewer either doesn’t know this isn’t big data or that they want to test their knowledge.


From what the comment tells, the interview question said that the maximum IS 6TiB. There was no further information given, so I assume it didn't make any assumptions about how it might change.

Even if it would say "it will stay at 6TiB" I would probably prefer a senior candidate to briefly question it, such as "It is surprising that we know it will stay at 6TiB and if this were a real project I'd try to at least sanitycheck that this requirement is really correct, but for now I'll assume this is a given..."

At least if, as the interviewer, I told them to treat the challenge similar to a real request/project and not to not-question the given numbers etc.


You shouldn’t guess what they expect, you should say what you think is right, and why. Do you want to work at a company where you would fail an interview due to making a correct technical assessment? And even if the guess is right, as an interviewer I would be more impressed by an applicant that will give justified reasons for a different answer than what I expected.


>Do you want to work at a company where you would fail an interview due to making a correct technical assessment?

How much do they pay? How long has it been since my last proper meal? How long until my rent is due?


FWIW, it's a 2.5 second extra to say "Although you don't need big data, but if you insist, ..." and gimme the hadoop answer.


That's great, but it's really just desiderata about you and your personal situation.

E.g., if a HN'er takes this as advice they're just as likely to be gated by some other interviewer who interprets hedging as a smell.

I believe the posters above are essentially saying: you, the interviewer, can take the 2.5 seconds to ask the follow up, "... and if we're not immediately optimizing for scalability?" Then take that data into account when doing your assessment instead of attempting to optimize based on a single gate.

Edit: clarification


This is the crux of it. Another interviewer would’ve marked “run on a local machine with a big SSD” - as: this fool doesn’t know enough about distributed systems and just runs toy projects on one machine


That is what I think interviewers think when I don’t immediately bring up kubernetes and sqs in an architecture interview


The interview goes both ways.

The few times I've ignored red flags from a company in an interview I've been in a world of pain afterwards.


depending on the shop? For some kinds of tasks, jumping to kubernets right away would be a minus during interview.


Recommending kubernetes immediately is an insta-rejection for me, if I was an interviewer


Exactly!!! This is an interviewer issue. You should be guiding them. Get the answers for each scenario out of both sets of interviewees and see if they are facile enough to roll with it. Doing a one shot "quiz" is not that revealing....


So give both answers. Itd would happen all the time in college, or in a thesis defense, where one disagrees with the examiner and one finds oneself in a position of "is this guy trying to trick me!"

Give both answers, and explain why the obvious "hard" answer is wrong


> E.g., if a HN'er takes this as advice they're just as likely to be gated by some other interviewer who interprets hedging as a smell.

If people in high stakes environments interpret hedging as a smell - run from that company as fast as you can.

Hedging is a natural adult reasoning process. Do you really want to work with someone who doesn't understand that?


Agreed. I don't really understand the mindset of someone who would consider this sort of hedging a smell. A candidate being able to take vague requirements and clearly state different solutions for different scenarios is an excellent candidate. That would be considered a positive signal for myself and pretty much all the interviewers I've worked with.


I once killed the deployment of a big data team in a large bank when I laid out in excruciating details exactly what they'd have to deal with during an interview.

Last I heard theyd promoted one unix guy on the inside to baby sit a bunch of chron jobs on the biggest server they could find.


Sure, but as you said yourself: it's a trick question. How often does the employee have to answer trick questions without having any time to think in the actual job?

As an interviewer, why not asking: "how would you do that in a setup that doesn't have much data and doesn't need to scale, and then how would you do it if it had a ton of data and a big need to scale?". There is no trick here, do you feel you lose information about the interviewee?


Trick questions (although not known as such at the time) are the basis of most of the work we do? XY problem is a thing for a reason, and I cannot count the number of times my teams and I have ratholed on something complex only to realize we were solving for the wrong problem, i.e. A trick question.

As a sibling puts it though, it's a matter of level. Senior/staff and above? Yeah, that's mostly what you do. Lower than that, then you should be able to mostly trust those upper folks to have seen through the trick.


> are the basis of most of the work we do?

I don't know about you, but in my work, I always have more than 3 seconds to find a solution. I can slowly think about the problem, sleep on it, read about it, try stuff, think about it while running, etc. I usually do at least some of those for new problems.

Then of course there is a bunch of stuff that is not challenging and for which I can start coding right away.

In an interview, those trick questions will just show you who already has experience with the problem you mentioned and who doesn't. It doesn't say at all (IMO) how good the interviewee is at tackling challenging problem. The question then is: do you want to hire someone who is good at solving challenging problems, or someone who already knows how to solve the one problem you are hiring them for?


If the interviewer expects you to answer entire design question in 3 seconds, that interview is pretty broken. Those questions should take longish time (minutes to tens of minutes), and should let candidate showcase their thought process.


I meant that the interviewer expects you to start answering after 3 seconds. Of course you can elaborate over (tens of) minutes. But that's very far from actual work, where you have time to think before you start solving a problem.

You may say "yeah but you just have to think out loud, that's what the interviewer wants". But again that's not how I work. If the interviewer wants to see me design a system, they should watch me read documentation for hours, then think about it while running, and read again, draw a quick thing, etc.


Being able to ask qualifying questions like that, or presenting options with different caveats clearly spelled out, is part of the job description IMO, at least for senior roles.


Depends on the level you're hiring for. At a certain point, the candidate needs to be able to identify the right tool for the job, including when that tool is not the usual big data tools but a simple script.


because the interview is supposed to ask same questions as real job, and in real job there are rarely big hints like you are describing.

On the other hand, "hey I have 6TiB data, please prepare to analyze it, feel free to ask any questions for clarification but I may not know the answers" is much more representative of a real-life task.


Once had a coworker write a long proposal to rewrite some big old application from Python to Go. I threw in a single comment: why don't we use the existing code as a separate executable?

Turns out he was laid off and my suggestion was used.

(Okay, I'm being silly, the layoff was a coincidence)


Is this like interviewing for a chef position for a fancy restaurant and when asked how to perfectly cook a steak, you preface it with “well you can either go to McDonald’s and get a burger, or…”

It may not be reasonable to suggest that in a role that traditionally uses big data tools


I see it more like "it's 11pm and a family member suddenly wants to eat a steak at home, what would you do?"

The person who says "I'm going drive back to the restaurant and take my professional equipment home to cook the steak" is probably offering the wrong answer.

I'm obviously not a professional cook, but presumably the ability to improvise with whatever tools you currently have is a desirable skill.


Hmm I would say that the equivalent to your 11pm question is more something like "your sister wants to backup her holiday pictures on the cloud, how do you design it?". The person who says "I ask her 10 millions to build a data center" is probably offering the wrong answer :-).


I’m not sure if you are referencing it intentionally or not, but some chefs (Gordon Ramsey for one) will ask an interviewee to make some scrambled eggs; something not super niche or specialized but enough to see what their technique is.

It is a sort of “interview hack” example that’s been used to emphasize the idea of a simple unspecialized skill-test that went around a while ago. I guess upcoming chefs probably practice egg scrambling nowadays, ruining the value of the test. But maybe they could ask to make a bit of steak now.


This is sort of the chef equivalent of fizzbuzz or "reverse a binary tree" - there's no gimmicks, something "everyone" should know how to do, it's nothing fancy, just the basics of "can you iterate over data structures and write for loops competently" - or in this case "can you not under/over cook the eggs and can you deliver them in the style you say you're going to.


The egg omelet is a classic for testing French chefs. It is much less about the ingredients (quality, etc), and much more about pure skill to cook it perfectly.

What would be the equivalent for a technical interview? Perhaps: Implement a generic linked list or dynamic array.


The fancy cluster is probably slower for most tasks than one big machine storing everything in RAM. It's not like a fast food burger.


Idk, in this instance I feel pretty strongly that cloud, and solutions with unecessary overhead, are the fast food. The article proposes not eating it all the time.


I think more like, how would you prepare and cook the best five course gala dinner for only $10. That requires true skill.


It's almost a law "all technical discussions devolve into interview mind games", this industry has a serious interview/hiring problem.


Engineering for scalability here is the single server solution that you throw away later when scale is needed. The price is so small (in this case) for the simple solution that you should basically always start with it.


Its great if the interviewer actually takes time to sort out the questions you have, cause seemingly simple questions to you have a lot of assumptions you made.

I had an interview "design an app store". I tried asking, ok an app store has a ton of components, which part of the app store are you asking exactly? The response I got was "Have you ever used an app store? Design an app store". Umm ok.


Red flag - walk away. Interview did it's job.


But the interview is for a data science position. Why play games with "Ha ha 6 TiB can fit on my PC we actually don't need you?"


I would have had that uncertainty that you are describing when I was a junior dev.

But now as a senior, I have the same questions and answers regardless if I’m being interviewed or not.


You don't need to guess, you can just ask the interviewer. (Shocker, I know.)


Architecting for scalability doesn't mean disregarding resource costs, though - rather, the opposite. In fact, resource costs are even more important at higher scale, because more resources cost more.


> just that they got tricked in this very artificial situation where you are in a dominant position and ask trick questions

DING DING DING!

About 30% of all interviews I've had in my career the person doing the technical interview was using it as a means to stroke their ego. I got the impression they don't get the power they want there and this was a brief reprieve where they are "the expert" and get to thumbs up or thumbs down in their kingly intellectual facade.

TBF, I personally cut them some slack because they may actually be right about a lot and not given the authority to do what they know is right.

Still, if we're subjected to AI resume filtering, then how about we replace the technical interview process with AI too and eliminate the BS ego trips?


I want my boss to be straight with, I need those below me to call me out of Im talking bullshit.

Tell me this isn't big data, and then, if you must, tell me about hadoop (or whatever big data is)


https://x.com/garybernhardt/status/600783770925420546 (Gary Bernhardt of WAT fame):

> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

This is from 2015...


I wonder if it's fair to revise this to 'your data set fits on NVME drives' these days. Astonishing how fast and how much storage you can get these days.


Based on a very brief search: Samsung's fastest NVME drives [0] could maybe keep up with the slowest DDR2 [1]. DDR5 is several orders of magnitude faster than both [2]. Maybe in a decade you can hit 2008 speeds, but I wouldn't consider updating the phrase before then (and probably not after, either).

[0] https://www.tomshardware.com/reviews/samsung-980-m2-nvme-ssd...

[1] https://www.tomshardware.com/reviews/ram-speed-tests,1807-3....

[2] https://en.wikipedia.org/wiki/DDR5_SDRAM


I think the point is that if it fits on a single drive, you can still get away with a much simpler solution (like a traditional SQL database) than any kind of "big data" stack.


I always heard it as "if the database index fits in the RAM of a single machine, it's not big data". The reason being that this makes random access fast. You always know where a piece of data is.

Once the index is too big to have in one place, thing get more complicated.


The statement was "fits on", not "matches the speed of".


980 is an M.2 drive, PCIe 3.0 x4, 3 years old, up to 3500MB/s sequential read.

You want something like PM1735: PCIe 4.0 x8, up to 8000 MB/s sequential read.

And while DDR5 is surely faster the question is what the data access patterns are there.

In almost all cases (ie mix of random access, occasional sequential reads) just reading from the NVMe drive would be faster than loading to RAM and reading from there. In some cases you would spend more time processing the data than reading it.

PS all these RAM bandwidth rates are good for the sequential access, as you go random access the bandwidth drops.

https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1733-...


Several gigabytes per second, plus RAM caching, is probably enough though. Latency can be very important, but there exist some very low latency enterprise flash drives.


Your data access patterns are fast enough in NVME. You own me $10,000 for saving you $250,000 (in ram).

The value of our data skills are getting eroded!


You can always check available ram: https://yourdatafitsinram.net/



Plenty of people get offended if you tell them that their data isn’t really “big data”. A few years ago I had a discussion with one of my directors about a system IT had built for us with Hadoop, API gateways, multiple developers and hundreds of thousands of yearly cost. I told him that at our scale (now and any foreseeable future) I could easily run the whole thing on a USB drive attached to his laptop and a few python scripts. He looked really annoyed and I was never involved again with this project.

I think it’s part of the BS cycle that’s prevalent in companies. You can’t admit that you are doing something simple.


In most non-tech companies, it comes down to the motive of the manager and in most cases it is expansion of reporting line and grabbing as much budget as possible. Using "simple" solutions runs counter to this central motivation.


- the manager wants expansion

- the developers want to get experience in a fancy stack to build up their resume

Everyone benefits from the collective hallucination


This is also true of tech companies. Witness how the "GenAI" hammer is being used right now at MS, Google, Meta, etc.


Assuming that it’s not me that’s being an ass (saying “I could do it with a few Python off a USB drive” would be being an ass), if my interviewer is offended because I offered the practical solution, I am not working there, especially if they’re from the team I’m going to work on.

It’s like a reverse signal.


Resume Driven Development is real. Yes, you can optimize the solution to just a USB drive. But that wouldn't check the boxes of people looking at your resume.


That's the tech sector in a nutshell. Very few innovations actually matter to non-tech companies. Most companies could survive on Windows 98 software.


The flipside of this is that people at some places are fully aware of this and get very suspicious of consultants offering to handle their 'big data' challenges with loads of highly proprietary stuff.

Data organisation can be more of an issue, but the general issue with this is often a lack of internal discipline on the data owners to carefully manage their data. But that isn't nearly as attractive for management. Putting 'brought in new cloud vendor / technology' looks better than 'improved data organisation' on a CV even if the new vendor was a waste of money.


I can appreciate the vertical scaling solution, but to be honest, this is the wrong solution for almost all use cases - consumers of the data don't want awk, and even if they did, spooling over 6TB for every kinda of query without partitioning or column storage is gonna be slow on a single cpu - always.

I've generally liked BigQuery for this type of stuff - the console interface is good enough for ad-hoc stuff, you can connect a plethora of other tooling to it (Metabase, Tableau, etc). And if partitioned correctly, it shouldn't be too expensive - add in rollup tables if that becomes a problem.


Hes hiring data scientists not building a service though. This might realistically be a one off analysis for those 6tb. At which point you are happy your data scientists has returned statistical information instead of spending another week making sure the pipeline works if someone puts a greek character in a field.


Even if I'm doing a one off, depending on the task it can be easier/faster/more reliable to load 6TiB into a big query table than waiting hours for some task to complete and fiddling with parallelism and memory management.

It's a couple hundred bucks a month and $36 to query the entire dataset, after partitioning thats not terrible.


A 6T hard drive and Pandas will cost you a couple hundred bucks, one time purchase, and then last you for years (and several other data analysis jobs). It also doesn't require that you be connected to the Internet, doesn't require that you trust 3rd-party services, and is often faster (even in execution time) than spooling up BigQuery.

You can always save an intermediate data set partitioned and massaged into whatever format makes subsequent queries easy, but that's usually application-dependent, and so you want that control over how you actually store your intermediate results.


I wouldn't make a purchase of either without knowing a bit more about the lifecycle and requirements.

If you only needed this once, the BQ approach requires very little setup and many places already have a billing account. If this is recurring then you need to figure out what the ownership plan of the hard drive is (what's it connected to, who updates this computer, what happens when it goes down, etc.).


This brings up a good point: why is the data scientist being asked architecture questions anyway? This seems more like the desired answer for a posting like "hiring for a scrappy ML engineer / sysadmin".


And here we see this strange thing that data science people does in forgetting that 6TB is small change for any SQL server worth it's salt.

Just dump it into Oracle, postgre, mssql, or mysql and be amazed by the kind of things you can do with 30year old data analysis technology on an modern computer.


you wouldn't have been a 'winner' per OP. real answer is loading it on their phones not on sqlserver or whatever.


To be honest OP is kind of making the same mistake in assuming that the only real alternatives is "new data science products" and old school scripting exists as valuable tools.

The extend people goes to to not recognize how much the people creating the SQL language and the relational database engines we now take for granted actually knew what they were doing, are a bit of an mystery to me.

The right answer to any query that can be defined in SQL is pretty much always an SQL engine even if it's just sqlite running on an laptop. But somehow people seems to keep comming up with reasons not to use SQL.


I agree with this. BigQuery or AWS s3/Athena.

You shouldn't have to set up a cluster for data jobs these days.

And it kind of points out the reason for going with a data scientist with the toolset he has in mind instead of optimizing for a commandline/embedded programmer.

The tools will evolve in the direction of the data scientist, while the embedded approach is a dead end in lots of ways.

You may have outsmarted some of your candidates, but you would have hired a person not suited for the job long term.


It is actually pretty easy to do the same type of processing you would do on a cluster with AWS Batch.


Possibly, but it seems like overkill for the type of analysis that the OP expected the interviewee to do with awk.

SQL should be fine for that.

Actually, I have a feeling that the awk solution will struggle if there are many unique keys.

For example if they in that dataset have a million customers and want to extract the top 10. Then there is an intermediate map stage that will be storage or memory consuming.

It is like matrix multiplication. Calculating the dot product is trivial, but when the matrix has n:m dimensions and n,m starts to grow, it becomes more and more resource heavy. And then the laptop will not be able to handle it.

(in the example, m is the number of rows, and n is the number of unique customers. The dot product is just a sum over one dimension, while the group by customer id is the tricky part)


I agree completely for this scale. I did want to point out that it's fairly easy these days to do the kinds of things one would do on a cluster, which I learned just a few months ago myself :)


quick addition: there are modules (eg cloudknot) for Python that make it possible to run a Python callable that launches an AWS Batch environment and job with a single method, which you could do anywhere that runs Python.


Once you understand that 6tb fits on a hard drive, you can just as well put it in a run-of-the-mill pg instance, which metabase will reference just as easily. Hell, metabase is fine with even a csv file...


I worked in a large company that had a remote desktop instance with 256gb ram running a PG instance that analysts would log in to to do analysis. I used to think it was a joke of setup for such a large company.

I later moved to a company with a fairly sophisticated setup with Databricks. While Databricks offered some QoL improvements, it didn't magically make all my queries run quickly, and it didn't allow me anything that I couldn't have done on the remote desktop setup.


you can scale vertically with a much better tech than awk.

enter duckdb with columnar vectorized execution and full SQL support. :-)

disclaimer: i work with the author at motherduck and we make a data warehouse powered by duckdb


A moderately powerful desktop processor has memory bandwidth of over 50TB/s so yeah it'll take a couple of minutes sure.


Running awk on an in-memory CSV will come nowhere even close to the memory bandwidth your machine is capable of.


The slow part of using awk is waiting for the disk to spin over the magnetic head.

And most laptops have 4 CPU cores these days, and a multiprocess operating system, so you don’t have to wait for random access on a spinning plate to find every bit in order, you can simply have multiple awk commands running in parallel.

Awk is most certainly a better user interface than whatever custom BrandQL you have to use in a textarea in a browser served from localhost:randomport


I haven't been using spinning disks for perf critical tasks for a looong time... but if I recall correctly, using multiple processes to access the data is usually counter-productive since the disk has to keep repositioning its read heads to serve the different processes reading from different positions.

Ideally if the data is laid out optimally on the spinning disk, a single process reading the data would result in a mostly-sequential read with much less time wasted on read head repositioning seeks.

In the odd case where the HDD throughput is greater than a single-threaded CPU processing for whatever reason (eg. you're using a slow language and complicated processing logic?), you can use one optimized process to just read the raw data, and distribute the CPU processing to some other worker pool.


> The slow part of using awk is waiting for the disk to spin over the magnetic head.

If we're talking about 6 TB of data:

- You can upgrade to 8 TB of storage on a 16-inch MacBook Pro for $2,200, and the lowest spec has 12 CPU cores. With up to 400 GB/s of memory bandwidth, it's truly a case of "your big data problem easily fits on my laptop".

- Contemporary motherboards have 4 to 5 M.2 slots, so you could today build a 12 TB RAID 5 setup of 4 TB Samsung 990 PRO NVMe drives for ~ 4 x $326 = $1,304. Probably in a year or two there will be 8 TB NVMe's readily available.

Flash memory is cheap in 2024!


You can go further.

There are relatively cheap adapter boards which let you stick 4 M.2 drives in a single PCIe x16 slot; you can usually configure a x16 slot to be bifurcated (quadfurcated) as 4 x (x4).

To pick a motherboard at quasi-random:

Tyan HX S8050. Two M.2 on the motherboard.

20 M.2 drives in quadfurcated adapter cards in the 5 PCIe x16 slots

And you can connect another 6 NVMe x4 devices to the MCIO ports.

You might also be able to hook up another 2 to the SFF-8643 connectors.

This gives you a grand total of 28-30 x4 NVME devices on one not particularly exotic motherboard, using most of the 128 regular PCIe lanes available from the CPU socket.


A high-end regular desktop is around 100 GB/s to DRAM, 750 GB/s to L3, 1.5 TB/s to L2, 4 TB/s to L1. 50 TB/s would require about 1000 channels of RAM.


.parquet files are completely underrated, many people still do not know about the format!

.parquet preserves data types (unlike CSV)

They are 10x smaller than CSV. So 600GB instead of 6TB.

They are 50x faster to read than CSV

They are an "open standard" from Apache Foundation

Of course, you can't peek inside them as easily as you can a CSV. But, the tradeoffs are worth it!

Please promote the use of .parquet files! Make .parquet files available for download everywhere .csv is available!


Parquet is underdesigned. Some parts of it do not scale well.

I believe that Parquet files have rather monolithic metadata at the end and it has 4G max size limit. 600 columns (it is realistic, believe me), and we are at slightly less than 7.2 millions row groups. Give each row group 8K rows and we are limited to 60 billion rows total. It is not much.

The flatness of the file metadata require external data structures to handle it more or less well. You cannot just mmap it and be good. This external data structure most probably will take as much memory as file metadata, or even more. So, 4G+ of your RAM will be, well, used slightly inefficiently.

(block-run-mapped log structured merge tree in one file can be as compact as parquet file and allow for very efficient memory mapped operations without additional data structures)

Thus, while parqet is a step, I am not sure it is a step in definitely right direction. Some aspects of it are good, some are not that good.


Parquet is not a database, it's a storage format that allows efficient column reads so you can get just the data you need without having to parse and read the whole file.

Most tools can run queries across parquet files.

Like everything, it has its strengths and weaknesses, but in most cases, it has better trade-offs over CSV if you have more than a few thousand rows.


> Parquet is not a database.

This is not emphasized often enough. Parquet is useless for anything that requires writing back computed results as in data used by signal processing applications.


> 7.2 millions row groups

Why would you need 7.2 mil row groups?

Row group size when stored in HDFS is usually equal to HDFS bock size by default, which is 128MB

7.2 mil * 128MB ~ 1PB

You have a single parquet file 1PB in size?


Parquet is not HDFS. It is a static format, not a B-tree in disguise like HDFS.

You can have compressed Parquet columns with 8192 entries being a couple of tens bytes in size. 600 columns in a row group is then 12K bytes or so, leading us to 100GB file, not a petabyte. Four orders of magnitude of difference between your assessment and mine.


some critiques of parquet by andy pavlo

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf


Thanks, very insightful.

"Dictionary Encoding is effective across data types (even for floating-point values) because most real-world data have low NDV ratios. Future formats should continue to apply the technique aggressively, as in Parquet."

So this is not critique, but assessment. And Parquet has some interesting design decisions I did not know about.

So, let me thank you again. ;)


What format would you recommend instead?


I do not know a good one.

A former colleague of mine is now working on a memory-mapped log-structured merge tree implementation and it can be a good alternative. LSM provides elasticity, one can store as much data as one needs, it is static, thus it can be compressed as well as Parquet-stored data, memory mapping and implicit indexing of data do not require additional data structures.

Something like LevelDB and/or RocksDB can provide most of that, especially when used in covering index [1] mode.

[1] https://www.sqlite.org/queryplanner.html#_covering_indexes


Nobody is forcing you to use a single Parquet file.


Of course.

But nobody tells me that I can hit a hard limit and then I need a second Parquet file and should have some code for that.

The situation looks to me as if my "Favorite DB server" supports, say, only 1.9 billions records per table and if I hit that limit I need a second instance of my "Favorite DB server" just for that unfortunate table. And it is not documented anywhere.


> They are 50x faster to read than CSV

I actually benchmarked this and duckdb CSV reader is faster than parquet reader.


I would love to see the benchmarks. That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

CSV underperforms in almost every other domain, like joins, aggregations, filters. Parquet lets you do that lazily without reading the entire Parquet dataset into memory.


> That is not my experience, except in the rare case of a linear read (in which CSV is much easier to parse).

Yes, I think duckdb only reads CSV, then projects necessary data into internal format (which is probably more efficient than parquet, again based on my benchmarks), and does all ops (joins, aggregations) on that format.


Yes, it does that, assuming you read in the entire CSV, which works for CSVs that fit in memory.

With Parquet you almost never read in the entire dataset and it's fast on all the projections, joins, etc. while living on disk.


> which works for CSVs that fit in memory.

what? Why CSV is required to fit in memory in this case? I tested CSVs which are far larger than memory, and it works just fine.


The entire csv doesn't have to fit in memory, but the entire csv has to pass through memory at some point during the processing.

The parquet file has metadata that allows duckdb to only read the parts that are actually used, reducing total amount of data read from disk/network.


> The parquet file has metadata that allows duckdb to only read the parts that are actually used, reducing total amount of data read from disk/network.

this makes sense, and what I hoped to have. But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.


>But in reality looks like parsing CSV string works faster than bloated and overengineered parquet format with libs.

Anecdotally having worked with large CSVs and large on-disk Parquet datasets, my experience is the opposite of yours. My DuckDB queries operate directly on Parquet on disk and never load the entire dataset, and is always much faster than the equivalent operation on CSV files.

I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples comparison because we usually don't do this with Parquet -- there's no CREATE TABLE step. At most there's a CREATE VIEW, which is lazy.

I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.


> I think your experience might be due to -- what it sounds like -- parsing the entire CSV into memory first (CREATE TABLE) and then processing after. That is not an apples-to-apples

original discussion was about CSV vs parquet "reader" part, so this is exactly apple to apple testing, easy to benchmark and I stand my ground. What you are doing downstream, it is another question which is not possible to discuss because no code for your logic is available.

> I've seen your comments bashing Parquet in DuckDB multiple times, and I think you might be doing something wrong.

like running one command from DuckDB doc.

Also, I am not "bashing", I just state that CSV reader is faster.


For how many rows?


10B


Agreed. The abstractions on top of parquet are quite immature yet, though, and lots of software assumes that if you use Parquet - you also use Hive, Spark and stuff.

Take Apache Iceberg for example. It is essentially a specification to how to store parquet files for efficient use and exploration of data, but the only implementation... depends on Apache Spark!


> They are 10x smaller than CSV. So 600GB instead of 6TB.

how? lossless compression? under what scenario?

vague headlines like this just beg more questions


Likely this assumes that parquet has internal compression applied, and CSV is uncompressed.


You kind of can peek into parquet files with a tiny command line utility: https://github.com/manojkarthick/pqrs


Why is .parquet better than protobuf?


Parquet is columnar storage, which is much faster for querying. And typically for protobuf you deserialize each row, which has a performance cost - you need to deserialize the whole message, and can't get just the field you want.

So, of you want to query a giant collection of protobufs, you end up reading and deserializing every record. For parquet, you get much closer to only reading what you need.


Thank you.


Parquet ~= Dremel, for those who are up on their Google stack.

Dremel was pretty revolutionary when it came out in 2006 - you could run ad-hoc analyses in seconds that previously would've taken a couple days of coding & execution time. Parquet is awesome for the same reasons.


Please promote the use of .parquet files!

  apt-cache search parquet
  <nada>
Maybe later


Parquet is a file format, not a piece of software. 'apt install csv' doesn't make any sense either.


If you want to shine with snide remarks, you should at least understand the point being made:

    $ apt-cache search csv | wc -l
    225
    $ apt-cache search parquet | wc -l
    0


There is no support for parquet in Debian, by contrast

  apt-cache search csv | wc -l
  259


apt search would return tons of libparquet-java/c/python packages if it was popular.


It's more like "sudo pip install pandas" and then Pandas comes with Parquet support.


Pandas cannot read parquet files itself, it uses 3rd party "engines" for that purpose and those are not available in Debian


Ah yes, that's true though a typical Anaconda installation will have them automatically installed. "sudo pip install pyarrow" or "sudo pip install fastparquet" then.


Third consecutive time in 86 days that you mention .parquet files. I am out of my element here, but it's a bit weird


Sometimes when people discover or extensively use something they are eager to share in contexts they think are relevant. There is an issue when those contexts become too broad.

3 times across 3 months is hardly astroturfing for big parquet territory.


FWIW I am the same. I tend to recommend BigQuery and AWS/Athena in various posts. Many times paired with Parquet.

But it is because it makes a lot of things much simpler, and that a lot of people have not realized that. Tooling is moving fast in this space, it is not 2004 anymore.

His arguments are still valid and 86 days is a pretty long time.


I've downloaded many csv files that were mal-formatted (extra commas or tabs etc.), or had dates in non-standard formats. Parquet format probably would not have had these issues!


no need to be so suspicious when its an open standard not even linked to a startup?


Blows my mind. I am a backend programmer and a semi-decent sysadmin and I would have immediately told you: "make a ZFS or BCacheFS pool with 20-30% redundancy bits and just go wild with CLI programs, I know dozens that work on CSV and XML, what's the problem?".

And I am not a specialized data scientist. But with time I am wondering if such a thing even exists... being a good backender / sysadmin and knowing a lot of CLI tools has always seemed to do the job for me just fine (though granted I never actually managed a data lake, so I am likely over-simplifying it).


To be fair on candidates, CLI programs create technical debt the moment they're written.

A good answer that strikes a balance between size of data, latency and frequency requirements is a candidate who is able to show that they can choose the right tool that the next person will be comfortable with.


True on the premise, yep, though I'm not sure how using CLI programs like LEGO blocks creates a tech debt?


I remember replacing a CLI program built like Lego blocks. It was 90-100 LEGO blocks, written over the course of decades, in: Cobol; Fortran; C; Java; Bash; and Perl, and the Legos "connected" with environmental variables. Nobody wanted to touch it lest they break it. Sometimes it's possible to do things too smartly. Apache Spark runs locally (and via CLI).


No no, I didn't mean that at all. I meant a script using well-known CLI programs.

Obviously organically grown Frankenstein programs are a huge liability, I think every reasonable techie agrees on that.


Well your little CLI-query is suddenly in production and then... it easily escalates.


I already said I never managed a data lake and simply got stuff when it was needed but if you need to criticize then by all means, go wild.


True but it's typically less debt than anything involving a gui, pricetag, or separate server.


Configuring debugged, optimized software, with a shell script is orders of magnitude cheaper than developing novel software.


> But with time I am wondering if such a thing even exists

Check out "data science at the command line":

https://jeroenjanssens.com/dsatcl/


> just go wild with CLI programs, I know dozens that work on CSV and XML

...or put it into SQLite for extra blazing fastness! No kidding.


That's included in CLI tools. Also duckdb and clickhouse-local are amazing.


clickhouse-local had been astonishingly fast for operating on many GB of local CSVs.

I had a heck of a time running the server locally before I discovered the CLI.


I need to learn more about the latter for some log processing...


Log files aren’t data. That’s your first problem. But that’s the only thing that most people have that generates more bytes than can fit on screen in a single spreadsheet.


Of course they are. They just aren't always structured nicely.


Everything is data if you are brave enough.


> make a ZFS or BCacheFS pool with 20-30% redundancy bits and just go wild with CLI programs

Lol. Data management is about safety, auditablity, access control, knowledge sharing and who bunch of other stuff. I would've immediately shown you the door as someone who i cannot trust data with.


> Lol. Data management is about safety, auditablity, access control, knowledge sharing and who bunch of other stuff. I would've immediately shown you the door as someone who i cannot trust data with.

No need to act smug and superior, especially since nothing about OP's plan here actually precludes having all the nice things you mentioned, or even having them inside $your_favorite_enterprise_environment.

You risk coming across as a person who feels threatened by simple solutions, perhaps someone who wants to spend $500k in vendor subscriptions every year for simple and/or imaginary problems... exactly the type of thing TFA talks about.

But I'll ask the question.. why do you think safety, auditablity, access control, and knowledge sharing are incompatible with CLI tools and a specific choice of file system? What's your preferred alternative? Are you sticking with that alternative regardless of how often the work load runs, how often it changes, and whether the data fits in memory or requires a cluster?


> No need to act smug and superior

I responded with the same tone that gp responded with. "blows my mind" ( that people can be so stupid) .


Another comment mentions this classic meme:

> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

A lot of industry work really does fall into this category, and it's not controversial to say that going the wrong way on this thing is mind-blowing. More than not being controversial, it's not confrontational, because his comment was essentially re: the industry, whereas your comment is directed at a person.

Drive by sniping where it's obvious you don't even care to debate the tech itself might get you a few "sick burn, bro" back-slaps from certain crowds, or the FUD approach might get traction with some in management, but overall it's not worth it. You don't sound smart or even professional, just nervous and afraid of every approach that you're not already intimately familiar with.


i repurposed the parent comment

"not understanding the scale of "real" big data was a no-go in my eyes when hiring." , "real winner" ect.

But yea you are right. I shouldn't have directed it at commenter. I was miffed at interviewers who use "tricky questions" and expect people to read their minds and come up with their preconceived solution.


The classic putting words in people's mouths technique it is then. The good old straw man.

If you really must know: I said "blows my mind [that people don't try simpler and proven solutions FIRST]".

I don't know what do you have to gain to come here and pretend to be in my head. Now here's another thing that blows my mind.


> that people don't try simpler and proven solutions FIRST

Well why don't people do that according to you ?

Its not 'mind blowing' to me because you can never guess what angle interviewer is coming at you. Especially when they use the words like ' data stack'.


> you can never guess what angle interviewer is coming at you

Why would you guess in that situation though?

It’s an interview, there’s at least 1 person talking to you — you should talk to them, ask them questions, share your thoughts. If you talking to them is a red flag, then high chances that you wouldn’t want to work there anyway.


I don't know why and this is why I said it's mind-blowing. Because to me trying stuff that can work on most laptops comes naturally in my head as the first viable solution.

As for interviews, sure, they have all sorts of traps. It really depends on the format and the role. Since I already disclaimed that I am not actual data scientist and just a seasoned dev who can make some magic happen without a dedicated data team (if/when the need arises) then I wouldn't even be in a data scientist interview in the first place. ¯\_(ツ)_/¯


Thats fair. My comment wasn't directed at you. I was trying to be smart and write an inverse of original comment. Where I as an interviewer was looking for a proper 'data stack' and interviewee responded with a bespoke solution.

"not understanding the scale of "real" big data was a no-go in my eyes when hiring."


Sure, okay, I get it. My point was more like "Have you tried this obvious thing first that a lot of devs can do for you without too much hassle?". If I were to try for a dedicated data scientist position then I'd have done homework.


Abstractly, "safety, auditablity, access control, knowledge sharing" are about people reading and writing files: simplifying away complicated management systems improves security. The operating system should be good enough.


What about his answer prevents any of that? As stated the question didn't require any of what you outline here. ZFS will probably do a better job of protecting your data than almost any other filesystem out there so it's not a bad foundation to start with if you want to protect data.

Your entire post reeks of "I'm smarter than you" smugness while at the same time revealing no useful information or approaches. Near as I can tell no one should trust you with any data.


> Your entire post reeks of "I'm smarter than you"

unlike "blows my mind" ?

> As stated the question didn't require any of what you outline here.

Right. OP mentioned it was "tricky question" . What makes it tricky is that all those attributes are implicitly assumed. I wouldn't interview at google and tell them my "stack" is "load it on your laptop". I would never say that in an interview even if I think that's the right "stack" .


"blows my mind" is similar in tone yes. But I wasn't replying to the OP. Further the OP actually goes into some detail about how he would approach the problem. You do not.

You are assuming you know what the OP meant by tricky question. And your assumption contradicts the rest of the OP's post regarding what he considered good answers to the question and why.


Honest question: was "blows my mind" so offensive? Thought it was quite obvious I meant that "it blows my mind people don't try the simpler stuff first, especially having in mind that it works for much bigger percentage than cloud providers would have you believe"?

I guess it wasn't but even if so, it would be legitimately baffling how people manage to project so much negativity in three words that are slightly tongue-in-cheek casual comment on the state of affairs in an area whose value is not always clear (in my observations, only after you start having 20+ data sources it starts to pay off to have dedicated data team; I've been in teams only 3-4 devs and we still managed to have 15-ish data dashboards for the executives without too much cursing).

An anecdote, surely, but what isn't?


I generally don't find that sort of thing offensive when combined with useful alternative approaches like your post provided. However the phrase does come with a connotation that you are surprised by a lack of knowledge or skill in others. That can be taken as smug or elitist by someone in the wrong frame of mind.


Thank you, that's helpful.


I already qualified my statement quite well by stating my background but if it makes you feel better then sure, show me the door. :)

I was never a data scientist, just a guy who helped whenever it was necessary.


> I already qualified my statement quite well by stating my background

No. You qualified it with "blows my mind" . Why would it 'blow your mind' if you don't have any data background.


He didn't say he didn't have any data background. He's clearly worked with data on several occasions as needed.


Are you trolling? Did you miss the part where I said I worked with data but wouldn't say I'm a professional data scientist?

This negative cherry picking does not do your image any favors.


Edit: for above comment.

My comment wasn't directed at parent. I was trying to be smart and write an inverse of original comment. Opposite scenario Where I as an interviewer was looking for a proper 'data stack' and interviewee responded with a bespoke solution.

"not understanding the scale of "real" big data was a no-go in my eyes when hiring."

i was trying to point out that you can never know where the interviewer is coming from. Unless i know interviewer personally i would bias towards playing it safe and go with 'enterpisey stack'


this is how you know when someone takes themself too seriously

buddy, you're just rolling off buzzwords and lording it over other people


buddy you suffer from NIH syndrome upset that no one wants your 'hacks'.


I think I've written about it here before, but I imported ≈1 TB of logs into DuckDB (which compressed it to fit in RAM of my laptop) and was done with my analysis before the data science team had even ingested everything into their spark cluster.

(On the other hand, I wouldn't really want the average business analyst walking around with all our customer data on their laptops all the time. And by the time you have a proper ACL system with audit logs and some nice way to share analyses that updates in real time as new data is ingested, the Big Data Solution™ probably have a lower TCO...)


> And by the time you have ... the Big Data Solution™ probably have a lower TCO...

I doubt it. The common Big Data Solutions manage to have a very high TCO, where the least relevant share is spent on hardware and software. Most of its cost comes from reliability engineering and UI issues (because managing that "proper ACL" that doesn't fit your business is a hell of a problem that nobody will get right).


> ...managing that "proper ACL" that doesn't fit your business is a hell of a problem that nobody will get right...

I'm not sure there is a way to get this right unless there is a programmatic integration into the org chart, and ability to describe and parse in a declarative language the organizational rules of who has access to what, when, under what auth, etc. It has otherwise been for me an exercise in watching massive amounts of toil manually interpreting between the SOT of the org chart and all the other applications mediated by many manual approval policies and procedures. And at every client I've posed this to, I've always been denied that programmatic access for integration.

A lot of sites try to avoid this by designing ACL's around certain activity or data domains because those are more stable than organizations, but this breaks down when you get to the fine-grained levels of the ACL's so we get capped benefits from this approach.

I'd love to hear how others solve this in large (10K+ staff) organizations that frequently change around teams.


you probably didn't do joins for example on your dataset, because DuckDB is OOMing on them if they don't fit memory.


The funny thing is that is exactly the place I want to work at. I've only found one company so far and the owner sold during the pandemic. So far my experience is that amount of companies/people that want what you describe is incredibly low.

I wrote a comment on here the other day that some place I was trying to do work for was using $11k USD a month on a BigQuery DB that had 375MB of source data. My advice was basically you need to hire a data scientist that knows what they are doing. They were not interested and would rather just band-aid the situation for a "cheap" employee. Despite the fact their GCP bill could pay for a skilled employee.

As I've seen it for the last year job hunting most places don't want good people. They want replaceable people.


Problem is possibly that most people with that sort of hands-on intuition for data don't see themselves as data scientists and wouldn't apply for such a position.

It's a specialist role, and most people with the skills you seek are generalists.


Yeah it’s not really what you should be hiring a data scientist to do. I’m of the opinion that if you don’t have a data engineer, you probably don’t need a data scientist. And not knowing who you need for a job causes a lot of confusion in interviews.


> requirements of "6 TiB of data"

How could anyone answer this without knowing how the data is to be used (query patterns, concurrent readers, writes/updates, latency, etc)?

Awk may be right for some scenarios, but without specifics it can't be a correct answer.


Those are very appropriate follow up questions I think. If someone tasks you to deal with 6 TiB of data, it is very appropriate to ask enough questions until you can provide a good solution, far better than to assume the questions are unknowable and blindly architect for all use cases.


> The winner of course was the guy who understood that 6TiB is what 6 of us in the room could store on our smart phones, or a $199 enterprise HDD (or three of them for redundancy), and it could be loaded (multiple times) to memory as CSV and simply run awk scripts on it.

If it's not a very write heavy workload but you'd still want to be able to look things up, wouldn't something like SQLite be a good choice, up to 281 TB: https://www.sqlite.org/limits.html

It even has basic JSON support, if you're up against some freeform JSON and not all of your data neatly fits into a schema: https://sqlite.org/json1.html

A step up from that would be PostgreSQL running in a container: giving you the support for all sorts of workloads, more advanced extensions for pretty much anything you might ever want to do, from geospatial data with PostGIS, to something like pgvector, timescaledb etc., while still having a plethora of drivers and still not making your drown in complexity and having no issues with a few dozen/hundred TB of data.

Either of those would be something that most people on the market know, neither will make anyone want to pull their hair out and they'll give you the benefit of both quick data writes/retrieval, as well as querying. Not that everything needs or can even work with a relational database, but it's still an okay tool to reach for past trivial file storage needs. Plus, you have to build a bit less of whatever functionality you might need around the data you store, in addition to there even being nice options for transparent compression.


Now, you have to consider the cost it takes for you whole team to learn how to use AWK instead of SQL. Then you do these TCO calculations and revert back to the BigQuery solution.


Not necessarily. I always try to write to disk first, usually in a rotating compressed format if possible. Then, based on something like a queue, cron, or inotify, other tasks occur, such as processing and database logging. You still end up at the same place, and this approach works really well with tools like jq when the raw data is in jsonl format.

The only time this becomes an issue is when the data needs to be processed as close to real-time as possible. In those instances, I still tend to log the raw data to disk in another thread.


For someone who is comfortable with sql we are talking minutes to hours to figure out awk well enough to see how its used or use it.


I have been using sql for decades and I am not comfortable with awk or intend to become so. There are better tools.


It is not only about whether people can figure it out awk. It is also about how supportable the solution is. SQL provides many features specifically to support complex querying and is much more accessible to most people - you can't reasonably expect your business analysts to do complex analysis using awk.

Not only that, it provides a useful separation from the storage format so you can use it to query a flat file exposed as table using Apache Drill or a file on s3 exposed by Athena or data in an actual table stored in a database and so on. The flexibility is terrific.


With the exception of regexes- which any programmer or data analyst ought to develop some familiarity with anyway- you can describe the entirety of AWK on a few sheets of paper. It's a versatile, performant, and enduring data-handling tool that is already installed on all your servers. You would be hard-pressed to find a better investment in technical training.


No, if you want SQL you install postgresql on the single machine.

Why would use use bigquery just to get SQL?


sqlite cli


About $20/month for chatgpt or similar copilot, which really they should reach for independently anyhow.


And since the data scientist cannot verify the very complex AWK output that should be 100% compatible with his SQL query, he relies on the GPT output for business-critical analysis.


Only if your testing frameworks are inadequate. But I belive you could be missing or mistaken on how code generation successfully integrates into a developer and data scientist's work flow.

Why not take a few days to get familiar with AWK, a skill which will last a lifetime? Like SQL, it really isn't so bad.


It is easier to write complex queries in SQL instead of AWK. I know both AWK and SQL, and I find SQL much easier for complex data analysis, including JOINS, subqueries, window functions, etc. Of course, your mileage may vary, but I think most data scientists will be much more comfortable with SQL.


Many people have noted how when using LLMs for things like this, the person’s ultimate knowledge of the topic is less than it would’ve otherwise been.

This effect then forces the person to be reliant on the LLM for answering all questions, and they’ll be less capable of figuring out more complex issues in the topic.

$20/mth is a siren’s call to introduce such a dependency to critical systems.


Even if a 6 terabyte CSV file does fit in RAM, the only thing you should do with it is convert it to another format (even if that's just the in-memory representation of some program). CSV stops working well at billions of records. There is no way to find an arbitrary record because records are lines and lines are not fixed-size. You can sort it one way and use binary search to find something in it in semi-reasonable time but re-sorting it a different way will take hours. You also can't insert into it while preserving the sort without rewriting half the file on average. You don't need Hadoop for 6 TB but, assuming this is live data that changes and needs regular analysis, you do need something that actually works at that size.


Yeah, but it very well not be data that needs random access or live insertion. A lot of data is basically just one big chunk of time-series that just needs some number crunching run over the whole lot every half a year.


It's astonishing how shit the cloud is compared to boring-ass pedestrian technology.

For example, just logging stuff into a large text file is so much easier, performant and searchable that using AWS CloudWatch, presumably written by some of the smartest programmers who ever lived.

On another note I was once asked to create a big data-ish object DB, and me, knowing nothing about the domain, and a bit of benchmarking, decided to just use zstd-compressed json streams with a separate index in an sql table. I'm sure any professional would recoil at it in horror, but it could do literally gigabytes/sec retrieval or deserialization on consumer grade hardware.


Wait, how would you split 6 TiB across 6 phones, how would you handle the queries? How long will the data live, do you need to handle schema changes, and how? And what is the cost of a machine with 15 or 20 TiB of RAM (you said it fits in memory multiple times, right?) - isn’t the drive cost irrelevant here? How many requests per second did you specify? Isn’t that possibly way more important than data size? Awk on 6 TiB, even in memory, isn’t very fast. You might need some indexing, which suddenly pushes your memory requirement above 6 TiB, no? Do you need migrations or backups or redundancy? Those could increase your data size by multiples. I’d expect a question that specified a small data size to be asking me to estimate the real data size, which could easily be 100 TiB or more.


How would six terabytes fit into memory?

It seems like it would get a lot of swap thrashing if you had multiple processes operating on disorganized data.

I'm not really a data scientist and I've never worked on data that size so I'm probably wrong.


>How would six terabytes fit into memory?

What device do you have in mind? I've seen places use 2TB RAM servers, and that was years ago, and it isn't even that expensive (can get those for about $5K or so).

Currently HP allows "up to 48 DIMM slots which support up to 6 TB for 2933 MT/s DDR4 HPE SmartMemory".

Close enough to fit the OS, the userland, and 6 TiB of data with some light compression.

>It seems like it would get a lot of swap thrashing if you had multiple processes operating on disorganized data.

Why would you have "disorganized data"? Or "multiple processes" for that matter? The OP mentions processing the data with something as simple as awk scripts.


“How would six terabytes fit into memory?”

A better question would be:

Why would anyone stream 6 terabytes of data over the internet?

In 2010 the answer was: because we can’t fit that much data in a single computer, and we can’t get accounting or security to approve a $10k purchase order to build a local cluster, so we need to pay Amazon the same amount every month to give our ever expanding DevOps team something to do with all their billable hours.

That may not be the case anymore, but our devops team is bigger than ever, and they still need something to do with their time.


Well yeah streaming to the cloud to work around budget issues is a while nother convo haha.


I'm having flashbacks to some new outside-hire CEO making flim-flam about capex-vs-opex in order to justify sending business towards a contracting firm they happened to know.


Straight to jail


More likely straight to bonus and eventual golden parachute!


I mean if you're doing data science the data is not always organized and of course you would want multi-processing.

1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards.


>I mean if you're doing data science the data is not always organized and of course you would want multi-processing

Not necessarily - I might not want it or need it. It's a few TB, it can be on a fast HD, on an even faster SSD, or even in memory. I can crunch them quite fast even with basic linear scripts/tools.

And organized could just mean some massaging or just having them in csv format.

This is already the same rushed notions about "needing this" and "must have that" that the OP describes people jumping to, that leads them to suggest huge setups, distributed processing, multi-machine infrastructure, for use cases and data sizes that could fit on a single server with redundancy and be done it.

DHH has often written about this for their Basecamp needs (scalling vertically where others scale horizontally having worked for them for most of their operation), there's also this classic post: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

>1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards.

Not that specialized, I've work with server deployments (HP) with 1, 1.5 and 2TB RAM (and > 100 cores), it's trivial to get.

And 5 or even 30 grand would still be cheaper (and more effective and simpler) than the "big data" setups some of those candidates have in mind.


Yeah I agree about over engineering.

Im just trying to understand the parent to my original comment.

How would running awk for analysis on 6TB of data work quickly and efficiently?

They say it would go into memory but its not clear to me how that would work as would still have paging and thrashing issues if the data didnt have often used sections of the data.

am I overthinking it and they were they just referring to buying a big ass Ram machine?


>How would running awk for analysis on 6TB of data work quickly and efficiently?

In that 6TB is not that huge of an amount

That's their total dataset, and there's no "real time" requirement.

They can start a batch process, process the data, and be done with it.

Here's an example of someone using awk (read further down for the relevant section):

https://livefreeordichotomize.com/posts/2019-06-04-using-awk...

"I was now able to process a whole 5 terabyte batch in just a few hours."

>They say it would go into memory but its not clear to me how that would work as would still have paging and thrashing issues if the data didnt have often used sections of the data

There's no need to have paging and thrashing issues if you can fit all (or even most) of your data in memory. And you can always also split, process, and aggregate partial results.

>am I overthinking it and they were they just referring to buying a big ass Ram machine?

Yeah, they said one can buy a machine with several TB of memory.


There are machines that can fit that and more: https://yourdatafitsinram.net/

I'm not advocating that this is generally a good or bad idea, or even economical, but it's possible.


I'm trying to understand what the person I'm replying to had in mind when they said fit six terabytes in memory and search with awk.

is this what they were referring to just by a big ass Ram machine?


It would easy fit in ram: https://yourdatafitsinram.net/


6 TB does not fit in memory. However, with a good storage engine and fast storage this easily fits within the parameters of workloads that have memory-like performance. The main caveat is that if you are letting the kernel swap that for you then you are going to have a bad day, it needs to be done in user space to get that performance which constrains your choices.


Per one of the links below, IBM Power System E980 can be configured for up to 64Tb RAM.


I agree that keeping data local is great and should be the first option when possible. It works great on 10GB or even 100GB, but after that starts to matter what you optimize for because you start seeing execution bottlenecks.

To mitigate these bottlenecks you get fancy hardware (e.g oracle appliance) or you scale out (and get TCO/performance gains from separating storage and compute - which is how Snowflake sold 3x cheaper compared to appliances when they came out).

I believe that Trino on HDFS would be able to finish faster than awk on 6 enterprise disks for 6TB data.

In conclusion I would say that we should keep data local if possible but 6TB is getting into the realm where Big Data tech starts to be useful if you do it a lot.


I wouldn't underestimate how much a modern machine with a bunch of RAM and SSDs can do vs HDFS. This post[1] is now 10 years old and has find + awk running an analysis in 12 seconds (at speed roughly equal to his hard drive) vs Hadoop taking 26 minutes. I've had similar experiences with much bigger datasets at work (think years of per-second manufacturing data across 10ks of sensors).

I get that that post is only on 3.5GB, but, consumer SSDs are now much faster at 7.5GB/s vs 270MB/s HDD back when the article was written. Even with only mildly optimised solutions, people are churning through the 1 billion rows (±12GB) challenge in seconds as well. And, if you have the data in memory (not impossible) your bottlenecks won't even be reading speed.

[1]: https://adamdrake.com/command-line-tools-can-be-235x-faster-...


> I agree that keeping data local is great and should be the first option when possible. It works great on 10GB or even 100GB, but after that starts to matter what you optimize for because you start seeing execution bottlenecks.

The point of the article is 99.99% of businesses never pass even the 10 Gb point though.


I agree with the theme of the article. My reply was to parent comment which has a 6 TB working set.


I'm on some reddit tech forums and people will say "I need help storing a huge amount of data!" and people start offering replies for servers that store petabytes.

My question is always "How much data do you actually have?" Many times you they reply with 500GB or 2TB. I tell that that isn't much data when you can get 1TB micro SD card the size of a fingernail or a 24TB hard drive.

My feeling is that if you really need to store petabytes of data that you aren't going to ask how to do it on reddit. If you need to store petabytes you will have an IT team and substantial budget and vendors that can figure it out.


This feels representative of so many of our problems in tech, overengineering, over-"producting," over-proprietary-ing, etc.

Deep centralization at the expense of simplicity and true redundancy; like renting a laser cutter when you need a boxcutter, a pair of scissors, and the occasional toenail clipper.


I ask a similar question on screens. Almost no one gives a good answer. They describe elaborate architectures for data that fits in memory, handily.


I think that’s the way we were taught in college / grad school. If the premise of the class is relational databases, the professor says, for the purpose of this course, assume the data does not fit in memory. Additionally, assume that some normalization is necessary and a hard requirement.

Problem is most students don’t listen to the first part “for the purpose of this course”. The professor does not elaborate because that is beyond the scope of the course.


FWIW if they were juniors, I would've continued the interview and direct them with further questions, and observer their flow of thinking to decide if they are good candidates to pursue further.

But no, this particular person had been working professionally for decades (in fact, he was much older than me).


Yeah. I don’t even bother asking juniors this. At that level I expect that training will be part of the job, so it’s not a useful screener.


I took a Hadoop class. We learned hadoop and were told by the instructor we probably wouldn’t’t need it, and learned some other Java processing techniques (streams etc)


People can always find excuses to boot candidates.

I would just back-track from a shipped product date, and try to guess who we needed to get there... given the scope of requirements.

Generally, process people from a commercially "institutionalized" role are useless for solving unknown challenges. They will leave something like an SAP, C#, or MatLab steaming pile right in the middle of the IT ecosystem.

One could check out Aerospike rather than try to write their own version (the dynamic scaling capabilities are very economical once setup right.)

Best of luck, =3


It's really hard because I've failed interviews by pitching "ok we start with postgres, and when that starts to fall over we throw more hardware at it, then when that fails we throw read replicas in, then we IPO, then we can spend all our money and time doing distributed system stuff".

Whereas the "right answer" (I had a man on the inside) was to describe some wild tall and wide event based distributed system. For some nominal request volume that was nowhere near the limits of postgres. And they didn't even care if you solved the actual hard distributed system problems that would arise like distributed transactions etc.

Anyway, I said I failed the interview, really they failed my filter because if they want me to ignore pragmaticism and blindly regurgitate a YouTube video on "system design" FAANG interview prep, then I don't want to work there anyway.


As a point of reference, I routinely do fast-twitch analytics on tens of TB on a single, fractional VM. Getting the data in is essentially wire speed. You won't do that on Spark or similar but in the analytics world people consistently underestimate what their hardware is capable of by something like two orders of magnitude.

That said, most open source tools have terrible performance and efficiency on large, fast hardware. This contributes to the intuition that you need to throw hardware at the problem even for relatively small problems.

In 2024, "big data" doesn't really start until you are in the petabyte range.


> "most open source tools have terrible performance and efficiency on large, fast hardware."

What do you use?


If you were hiring me for a data engineering role and asked me how to store and query 6 TiB, I'd say you don't need my skills, you've probably got a Postgres person already.


The problem with your question is that they are there to show off their knowledge. I failed a tech interview once, question was build a web page/back end/db that allows people to order let's say widgets, that will scale huge. I went the simpleton answer route, all you need is Rails, a redis cache and an AWS provisioned relational DB, solve the big problems later if you get there sort of thing. Turns out they wanted to hear all about microservices and sharding.


It depends on what you want to do with the data. It can be easier to just stick nicely-compressed columnar Parquets in S3 (and run arbitrarily complex SQL on them using Athena or Presto) than to try to achieve the same with shell-scripting on CSVs.


how exactly is this solution easier than putting the very Parquet files on a classic filesystem. Why does the easy solution require an amazon-subscription?


This is adjacent to "why would I need EC2 when I can serve from my laptop?"

In terms of maturity of solution and diversity of downstream applications you'll go much further with BigQuery/Athena (at comically low cost) for this amount of data than some cobbled together "local" solution.

I thoroughly agree with the author but the comments in this thread are an indication of people who haven't actually had to do meaningful ongoing work with modest amounts of data if they're suggesting just storing it as plain text on personal devices.

I'm not advocating for complicated or expensive solutions here, BigQuery and Athena are very low complexity compared to any of the Hadoop et-al tooling (yes Athena is Trino is in the family, but it is managed and dirt cheap).


You have 6 TiB of ram?


If my business depended on it? I can click a few buttons and have a 8TiB Supermicro server on my doorstep in a few days if I wanted to colo that. EC2 High Memory instances offer 3, 6, 9, 12, 18, and 24 TiB of memory in an instance if that's the kind of service you want. Azure Mv2 also does 2850 - 11400GiB.

So yes, if need to be, I have 6 TiB of RAM.


You can have 8TB RAM in a 2U box for under 100K. grab a couple and it will save you millions a year compared to over-engineered bigdata setup.


Bigquery and snowflake are software. They come with a sql engine, data governance, integration with your ldap, auditing. Loading data into snowflake isn't overegineering. What you described is over-engineering.

No business is passing 6tb data around on their laptops.


So is ClickHouse your point being ? Please point out what a server being able to have 8TB of RAM has to do with laptops.


I wonder how much this costs: https://www.ibm.com/products/power-e1080

And how that price would compare to the equivalent big data solution in the cloud.


1U box too.





I personally don't but our computer cluster at work as around 50,000 CPU cores. I can request specific configurations through LSF and there are at least 100 machines with over 4TB RAM and that was 3 years ago. By now there are probably machines with more than that. Those machines are usually reserved for specific tasks that I don't do but if I really needed it I could get approval.


You don’t need that much ram to use mmap(2)


To be fair, mmap doesn't put your data in RAM, it presents it as though it was in RAM and has the OS deal with whether or not it actually is.


Right, which is why you can mmap way more data than you have ram, and treat it as though you do have that much ram.

It’ll be slower, perhaps by a lot, but most “big data” stuff is already so god damned slow that mmap probably still beats it, while being immeasurably simpler and cheaper.


Really depends on the shape of the data. mmap can be suboptimal in many cases.

For CSV it flat out doesn't matter what you do since the format is so inefficient and needs to be read start to finish, but something like parquet probably benefits from explicit read syscalls, since it's block based and highly structured, where you can predict the read patterns much better than the kernel can.


We are decomming our 5-year old 4TB systems this year and could have been ordered with more


The "(multiple times)" part probably means batching or streaming.

But yeah, they might have that much RAM. At a rather small company I was at we had a third of it in the virtualisation cluster. I routinely put customer databases in the hundreds of gigabytes into RAM to do bug triage and fixing.


Indeed, what I meant to say is that you can load it in multiple batches. However, now thinking, I did play around with servers of TiBs of memory :-)


If you look at the article the data space is more commonly 10GB which matches my experience. For these sizes definitely simple tools are enough.


What kind of business just has a static set of 6TiB data that people are loading on their laptops.

You tricked candidates with your nonsensical scenario. Hate smartass interviewers like this that are trying some gotcha to feel smug about themselves.

Most candidates don't feel comfortable telling ppl 'just load on your laptops' even if they think thats sensible. They want to present a 'professional solution', esp when you tricked them with the word 'stack'. which is how most of them prbly perceived your trick question.

This comment is so infuriating to me. Why be assholes to each other when world is already full of them.


Well put. Whoever asked this question is undoubtedly a nightmare to work with. Your data is the engine that drives your business and its margin improvements, so why hamstring yourself with a 'clever' cost saving but ultimately unwieldy solution that makes it harder to draw insight (or build models/pipelines) from?

Penny wise and pound foolish, plus a dash of NIH syndrome. When you're the only company doing something a particular way (and you're not Amazon-scale), you're probably not as clever as you think.


I disagree with your take. Your surly rejoinder aside, the parent commenter identifies an area where senior level knowledge and process appropriately assess a problem. Not every job interview is satisfying checklist of prior experience or training, but rather assessing how well that skillset will fit the needed domain.

In my view, it's an appropriate question.


What did you gather as 'needed domain' from that comment. 'needed domain' is often implicit, its not a blank slate. candidates assume all sorts of 'needed domain' even before the interview starts, if i am interviewing at bank I wouldn't suggest 'load it on your laptops' as my 'stack'.

OP even mentioned that it his favorite 'tricky question' . It would def trick me because they used the word 'stack' which has specific meaning in the industry. There are even websites dedicated to 'stack's https://stackshare.io/instacart/instacart


> What kind of business just has a static set of 6TiB data that people are loading on their laptops.

Most business have static sets of data that people load on their PCs. (Why do you assume laptops?)

The only weird part of that question is that 6TiB is so big it's not realistic.


Maybe you don't realize, but a well-done interview isn't a test, it's a conversation. You can absolutely ask clarifying questions to help shape your answer (this is something I always remind people of in interviews, personally, before and after giving the question).

You don't get extra credit points for mind-reading, if anything you get more esteem for requirements-gathering, which would lead you towards either a professional solution or a laptop solution: whichever fits the business needs.

It might be a totally unreasonable question if it's provided context-free as a form on a screen, but it is a perfectly reasonable conversation-starter in an interview.


Big data companies or those that work with lots of data.

The largest dataset I worked with was about 60TB

While that didn't fit in ram most people would just load the sample data into the cluster when I told them it would be faster to load 5% locally and work off that.


I am a big fan of these simplistic solutions. In my own area, it was incredibly frustrating as what we needed was a database with a smaller subset of the most recent information from our main long-term storage database for back end users to do important one-off analysis with. This should've been fairly cheap, but of course the IT director architect guy wanted to pad his resume and turn it all into multi-million project with 100 bells and whistles that nobody wanted.


Reminds me an old story from Steve Yegge:

I gave him the "find the phone numbers in 50,000 html files" question, and he decided to write a huge program with an ad-hoc state machine. When I asked how long it would take to write the program, he said he'd have to hit N files, with M lines per file, so... I interrupted him: no, WRITE. How long to WRITE the program? Oh. He estimated it at 5 days of work. At this point I was 50% ready to throw him out.


There’d still have to be some further questions, right? I guess if you store it on the interview group’s cellphones you’ll have to plan on what to do if somebody leaves or the interview room is hit by a meteor, if you plan to store it in ram on a server you’ll need some plan for power outages.


That makes total sense if you're archiving the data, but what happens when you want to have 10,000 people have access to read/update the data concurrently. Then you start to need some fairly complex solutions.


This thread blew up a lot, and some unfriendly commenters made many assumptions about this innocent story.

You didn't, and indeed you have a point (missing specification of expected queries), so I expand it as a response here.

Among the MANY requirements I shared with the candidate, only one was the 6TiB. Another one was that it was going to be serving as part of the backend of an internal banking knowledge base, with at maximum 100 request a day (definitely not 10k people using it).

To all the upset data infrastructure wizards here: calm down. It was a banking startup, with an experimental project, and we needed the sober thinker generalist, who can deliver solutions to real *small scale* problems, and not the one who was the winner on the buzzword bingo.

HTH.


Thanks for the follow up. I've always felt any questions is good for an interview if it starts a conversation. Your thread did just that so I'd consider it a success!


This load is well handled by a Postgres instance and 15-25k thrown at hardware.


I dont know anything but when doing that I always end up next Thursday having the same with 4TB and the next with 17 at which point I regret picking a solution that fit so exactly.


I would've said "Pandas with Parquet files". If you're hiring a DS it's implied that you want to do some sort of aggregate or summary statistics, which is exactly what Pandas is good for, while awk + shell scripts would require a lot of clumsy number munging. And Parquet is an order of magnitude more storage efficient than CSV, and will let you query very quickly.


My smartphone cannot store 1TiB. <shrug>


Somewhat ironically, I'm pretty sure I failed a system design interview at a big tech last year for not drastically over building to solve a problem that probably had far less data (movie showtimes, and IIRC there are less than 50k screens in the US).


I have lived through the hype of Big data it was a time of HDFS+HTable I guess and Hapoop etc.

One can't go wrong with DuckDB+SQLite+Open/Elasticsearch either with 6 to 8 even 10 TB of data.

[0]. https://duckdb.org/


This is a great test / question. More generally, it tests knowledge with basic linux tooling and mindset as well as experience level with data sizes. 6TiB really isn't that much data these days, depending on context and storage format, etc. of course.


It could be a great question if you clarify the goals. As it stands it’s “here’s a problem, but secretly I have hidden constraints in my head you must guess correctly”.

The OPs desired solution could have been found from probably some of those other candidates if asked “here is the challenge, solve in most McGuyver way possible”. Because if you change the second part, the correct answer changes.

“Here is a challenge, solve in the most accurate, verifiable way possible”

“Here is a challenge, solve in a way that enables collaboration”

“Here is a challenge, 6TiB but always changing”

^ These are data science questions much more than the question he was asking. The answer in this case is that you’re not actually looking for a data scientist.


In my context 99% of the problem is the ETL, nothing to do with complex technology. I see people stuck when they need to get this from different sources in different technologies and/or APIs.


6TB - Snowflake

Why?

That's the boring solution. If you don't have a use case, what kind of queries you would run then opt for maximum flexibility with the minimum setup of a managed solution.

If cost is prohibitive on the long run, you can figure out a more tailored solution based on the revealed preferences.

Fiddling with CSVs is the DWH version of the legendary "Dropbox HN commenter".


I'm not even in data science, but I am a slight data hoarder. And heck even I'd just say throw that data on a drive and have a backup in the cloud and on a cold hard drive.


On the other hand if salaries are at 300k then 10k compared to that is not a huge cost. If a scalable tool can make you even 10 percent more effective it would be worth 30k.


I can’t really think of a product with the requirement of max 6TiB data. If the data is big as TiB, most products have 100x TiB rather than a few ones.


And how many data scientists are familiar with using awk scripts? If you’re the only one then you’ll have failed at scaling the data science team.


"... or a $199 enterprise HDD"

External or internal? Any examples?

"... it could be loaded (multimple times) to memory"

All 6TiB at once, or loaded in chunks?


https://diskprices.com yields https://www.amazon.com/dp/B0C363Y5BQ, fwiw. (16TB for $129.99 at time of writing)


That 1st site is great. Many thanks.


>> "6 TiB of data"

is not somewhat detailed requirements, as it depends quite a bit on the nature of the data.


Can you get a single machine with more than 6TiB of memory these days?

That's quite a bit..


Huh? How are you proposing loading a 6TB CSV into memory multiple times? And then processing with awk, which generally streams one a line at a time.

Obviously we can get boxes with multiple terabytes of RAM for $50-200/hr on-demand but nobody is doing that and then also using awk. They’re loading the data into clickhouse or duckdb (at which point the ram requirement is probably 64-128GB)

I feel like this is an anecdotal story that has mixed up sizes and tools for dramatic effect.


Awk doesn't load things into memory. It processes one line at a time. So memory usage is basically zero. That said awk isn't that fast. I mean your looking at "query" times in the range of at least 30 minutes if not more.

Awk is imo a poor solution. I use awk all the time and I would never use it for something like this. Why not just use postgres. Its a lot more powerful, easy to setup and you get SQL which is extremely powerful. Normally I might even go with sqllite but for me 6TB is too much for sqllite.


> How are you proposing loading a 6TB CSV into memory multiple times? And then processing with awk, which generally streams one a line at a time.

Ramdisk would work.


Storing 6TB is easy.

Processing and querying it is trickier.


Would probably try https://github.com/pola-rs/polars and go from there lol


Not code but I have encountered a doppelgänger software engineer.

He was an external contractor on a (sub-) project I de facto managed. I'd meet him twice a week, checking how the feature implementation is going, and prioritize upcoming work. He delivered exactly what I told him, in a way I would have done it. Even better, whenever he had more time until our next meeting, he would proactively implement some new stuff, which I.would have asked for. Sometimes when I was speaking to him, just because speech is slower than thought, and I haven't expressed my thought yet, he'd "finish my sentence", point out exactly the corner cases I was about to, and also, eerily get stuck at the same problems I would be.

I loved working with him, I kept telling even to my friends outside of work what a great companion I found and how happy I'm at work. The project was "finished" and we don't collaborate anymore. He is a great guy, and definitely there were times I had this doppelgänger-feeling.


I converted 140+ CLion users to VS Code / vim / Emacs (mostly vscode) at our company.

It's not about the "text editor". It's about the LSP and implementations for it. Like it or not, before LSP, code intelligence tooling was quite bad in the "text editor" land (except for some tools like clangd or jedi). Today it's a bliss to use a performante, small footprint editor and offload code intelligence. Although, IDE-lover users swear that their IDE's built-in code intelligence is superior (which, in some cases, is objectively true).

Thanks to Microsoft, I guess. Thanks to the FOSS world for the implementations, too.


Was this before or after Nova? I’ve found CLion quite pleasant coming from Emacs - it’d been years since I wrote C++ and I’d never used CMake so I was finding myself spending too long feeding Emacs to get an environment I was happy with. It’s nowhere near as powerful as a modern Java or .NET IDE (I have indulged this new guilty pleasure and played with the whole JetBrains suite at this point) but I’ve found it plenty powerful. I still default to Emacs for ecosystems I know well though, and expect I’ll migrate back fully at some point.


We were among the first to try Nova (talking directly with the Developers) and while it's slightly better than CLion classic, the double size package and still no remote index support didn't warrant rolling it out, unfortunately.

We created a clangd-based setup with remote indexes so that our developers can open their editor (vim / vscode) and within 10 seconds they get access to full code intelligence on the 20M+ SLOC monorepo. This is unimaginable with CLion, where developers have to scope down their "view" on the monorepo to a handful of directories, and still wait half an hour for the indexing to happen


I'd love to know what product/monorepo has 20M+ SLOC if you'd be comfortable sharing that


Internal. 30+y of development with multiple languages, and I'm still a rookie when I look at the code and the progress of it, and part of the revenue (maybe even most of it) doesn't even come from this monorepo codebase.

My understanding is that 20M+ SLOC is not even considered a super large code base, though, when it comes to those "textbook" monorepo examples (urban legends?)


Gotcha, nicely done.


I read it before I became a daily reader of HN.

I always thought and still think it's a great book. It may miss some writer's instruments to make it "a good book" in the classical sense, but to me it wasn't missing at all.

If anyone here neglects this book due to being a "sci-fi booksthat HN recommends", please give it a try!


I was a huge Asimov fan as a kid. Re-reading some of my favorites, I noticed they lacked a lot of classical "good book" elements. But they are still great in their own way.


This is unnecessarily complicated. Just use `SHELL`

    SHELL ["/bin/sh", "-o", "nounset", "-c"]
    RUN printf "DEB=${DEBVER}\nCAP=${CAPVER}\n"


Follow-up question: Does the SHELL need to be set within each stage of a multistage build? Ideally, it could be set once to remove this footgun, but I’m guessing the footgun gets reloaded with each stage.


Yes.

In general, I think of multi-stage builds as if multiple docker files were concatenated; if ARG or SHELL doesn't exist in the theoretical split Dockerfile, then it wouldn't exist in the multistage build.

Docker(file) is a thing which brought in some quite neat new ideas, and a lot of these "footguns". I did read the docs, and I am comfortable now building even complicated "Dockerfile-architectures" with proper linting, watching out for good caching, etc. I wish there were better alternatives; I saw images being built with Bazel, and thanks, but that is even more convoluted, with presumably a different set of footguns.

I am not saying that Dockerfiles are good. But I also think that knowing the tool you use differentiates you from being a rookie to an experienced professional.


> I am not saying that Dockerfiles are good. But I also think that knowing the tool you use differentiates you from being a rookie to an experienced professional.

I agree in general, though I'd also add that a tool requiring somebody to know and avoid a large number of footguns is what differentiates a tool from being a prototype and being production-ready. While Docker has a large number of features, the number of gotchas makes it feel like something that should only be used in hobby projects, never in production.

In the end, I ended up making several scripts to automate the generation and execution of dockerfiles for my common use cases. I trust myself to proceed cautiously around the beartraps once. I don't trust myself to sprint through a field of beartraps if there's a short deadline.


> Empty, selfish

This is sad.

I wish you the best for your non empty, non selfish life. I also hope once you open your mind and develop some sympathy for people who think differently. Not to mention those who didn't choose this path, yet you put them into the empty, selfish box.

Hope this is not your best self.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: