In my personal experience, having interviewed dozens of candidates (data science...

overgard · on Feb 15, 2020

I feel the need to nitpick your example.

> does the candidate want to code it up or understands that they could do: cut | sort| head

Piping together commands is undoubtedly programming, just in ancient shell script. So in a sense you're really testing for bash expertise. Which is maybe really relevant to you but I wouldn't say you're really "avoiding" coding by knowing to use those commands together, you just decided to code it with obscure shell commands. I might fail your test by choosing to write that sort of thing in python, because I could write it much more reliably in 10 minutes than deal with all the incredibly weird things that can happen in bash. I mean the final product in bash might be slightly more elegant, but your terminal history is probably littered with "man cut" and various attempts at it.

But for your example, my answer might be: if you're querying this file once, you may be querying it again, it's probably just way simpler to load the thing into sqlite instead of trying to imitate sql with some janky unix commands.

> everyone with real work experience has a story about sorting.

I am 34 and an experienced coder and I literally have no stories about sorting, and I've never once wanted to sort a CSV file on the terminal.

glofish · on Feb 15, 2020

sqlite answer - excellent, that is exactly what I am looking for, things people did, potential solutions let me understand the candidate's real background - not the buzzwords

what is not well received is the judgmental tone, passing judgment about me for things you cannot possibly know, no need for that either, simple questions also irritate some, very important to weed those people out too,

I expect you would fail the test because of the attitude, even though if this were a real job interview you'd do your best to suppress it, it would come out

invalidOrTaken · on Feb 15, 2020

>what is not well received is the judgmental tone, passing judgment about me for things you cannot possibly know

But this is the irony---a job interview is a judgment. Why do you think feelings on this run so high?

ip26 · on Feb 15, 2020

I expect you would fail the test because of the attitude

How is this relevant? Now you're just taking cheap shots.

glofish · on Feb 15, 2020

I don't get it, honestly.

I would not recommend a candidate, who, when asked if they could do this with cut|sort|head would reply something like:

heh, what a pathetic question, I bet your history is full of "man cut"

it is not the right answer, it is needlessly obnoxious and indicates a person that can barely bottle up their emotions and quickly gives in under pressure. Usually not a good match to any team - unless they also bring in something massively beneficial.

x3n0ph3n3 · on Feb 15, 2020

Culture fit. Is the candidate likely to reject/scoff at certain tasks because they think it's below them?

joshuamorton · on Feb 15, 2020

"it appears you're implicitly looking for bash knowledge, which is unfair".

"I wouldn't hire you with that attitude"

I'm getting more judgemental vibes from interviewer than interviewee.

blub · on Feb 15, 2020

I do believe overgard was just making a point and not expecting a job offer from you. I find these "I'd never hire you" comments so obnoxious... At least you could tell us where you work and your name so we could ask for another interviewer if we ever apply there :-)

satisfaction · on Feb 18, 2020

Whew, My first thought was import the thing to database, use database engine to sort.

hogFeast · on Feb 15, 2020

Dear God, I hope you don't do interviews. Talking about the "attitude" of someone you don't know, taking criticism personally, thinking that you are weeding people out by trying to "irritate some", suggesting they are trying to suppress it, and (of course) that you will be detect it. Every thread on hiring about HN has these weird passive-aggressive interviewers...interviewing is hard, it is really something that people should be trained to do (and some people really won't ever be able to do it).

solotronics · on Feb 15, 2020

I was with you until you called unix tools janky.. how dare you! semi joking :]

there is some simple elegance and power to these old C tools

SkyPuncher · on Feb 15, 2020

> I am 34 and an experienced coder and I literally have no stories about sorting, and I've never once wanted to sort a CSV file on the terminal.

I'm a bit younger, but have done this dozens and dozens of times.

----

A lot of one off processes are way easier to handle with a bunch of terminal commands and pipes.

Aeolun · on Feb 15, 2020

Only if you do them often enough that you don’t forget the flags each time.

scottlocklin · on Feb 15, 2020

Dude, do you even data science?

Not knowing Unix tools like cut and sort is a hard fail on a senior individual contributor in data science role, as is using sqlite which totally doesn't scale the way sort and cut does. Separates sheep from goats in data science land. You should really learn them if you're in the field and work with reasonably big data sets. Frankly you should learn them if you work with data at all, ever.

I've literally seen FAANGs recommendation engines powered with these tools running nightly on someone's desktop.

ansgri · on Feb 15, 2020

Well, they do work, but streaming processing in Python is not difficult and it’s much easier to have graceful failure processing and self-documenting code. Not to say of extending the logic if it would ever be needed.

But maybe that’s more about separating sheep from shepherds.

Aeolun · on Feb 15, 2020

I’m not sure if I should be horrified or not. Both by the fact this happens, and by the fact that you seem proud of it.

Learning sort and cut takes literally takes all of 10 minutes, so if it makes you pass over an otherwise qualified candidate you have your priorities completely backwards.

scottlocklin · on Feb 15, 2020

Feel free to be horrified: a data scientist who doesn't understand where and why to use unix command line tools for data preparation and ETL is about as useful to me as one who doesn't understand the conditions where a t-test breaks down or what a ROC curve is.

Generally speaking, people like this have never actually dealt with large data sets, never dealt with issues involved with installing "unapproved software" on a machine (ridiculously common in The Real World), has probably never cleaned a dirty data set (what do you do when your giant csv is formatted in a way that Wes McKinney didn't think of?), and will in a senior role be a long term liability for a data science team that works on serious problems. Sure at one point I didn't know about them either: I wasn't a senior data scientist then. I submit that if you don't know about them and haven't actually used them, you aren't either.

blub · on Feb 15, 2020

I think that the people not being impressed by cut and sort are approaching this from the Linux end of things, where those tools are nothing special at all. I guess we kind of expected that the data science wizards would be using fancier tools.

scottlocklin · on Feb 16, 2020

Yeah, well, people who have enough self regard to think of themselves as "wizards" are super unlikely to be able to actually do the day to day grind of getting, cleaning and preparing data for feature generation, which is about 95% of the job.

Another good weeder for a person claiming to be senior: discuss how you would fix the performance of the default R naive Bayes implementation in e1071. It's numerically more or less correct, but written by deranged ape-men who don't understand how computers work (a problem in a lot of the R ecosystem; in the Python ecosystem, the problem is nobody has yet written algorithms for X, which ends up being a very similar problem: aka it's your job to code up sane algorithms).

whalabi · on Feb 15, 2020

So I think this is a perfect example of what happens in tech interviews.

OP is using knowledge of a specific technology as a heuristic for "has experience in role x"

But this always makes me wonder, couldn't you see that experience from a resume? If the candidate filled a data science role at somewhere reputable for 3 years, and you verify that they successfully filled that role, why rely on that heuristic?

As you say testing for the specific technology, when it can be learnt in 10 minutes, does not seem logical.

VRay · on Feb 17, 2020

Don't worry, he responded with this very data-driven explanation:

> Generally speaking, people like this have never actually dealt with large data sets

BlueTemplar · on Feb 15, 2020

Hmm, tell us more? Databases are generally known for their good performance (under locking conditions - this is the relevant factor I guess?), after all...

redis_mlc · on Feb 15, 2020

> Databases are generally known for their good performance

If the data is on a filesystem, then sed, grep and cut pipelines will likely be your fastest option (Yahoo! processed petabytes of logfiles for decades that way.)

If the data is already inside a database table and indexed well, that could be fast enough. But generally speaking, the ETL is often a bottleneck. And DBAs are $$$$ compared to "the UNIX way."

Source: DBA

mattmanser · on Feb 15, 2020

A column orientated file? What format? Excel? CSV? What is cut, sort or head? Are you only accepting Linux candidates then, a tiny percentage of users?

This sort of utter nonsense question, heavily loaded to your "standard" experience, which is anything but, is even worse than the questions cited in the article.

All you're doing is filtering for people who are in your tribe, who followed the same path as you and think like you, use the same tools as you and the same OS as you.

Pretend all you want, but you're filtering not by "experience" but by trying to find people in your tribe, which is naturally heavily weighted.

glofish · on Feb 15, 2020

Interesting how negative your reaction is. Also how far off target all that anger is.

I am not selecting for a tribe, I am selecting for a job. The questions are loaded, of course. Among the many duties, the jobs do require processing large files, sometimes with cut, Python or C. I want the candidate to use the most appropriate tool as needed. I'd rather not have people implement functionality that already exists in the 'comm' command.

Of course, I want the candidate to ask me what the column separator is. That's why the question is formulated that way.

The right answer will depend on the column separator. Proposing the UNIX cut if the file is CSV is not such a good answer, but for tab-separated files, it is just fine. If the file is CSV and they tell me about cut, my next question would be if that is a good universal solution for CSV files in general.

When someone that knows about the pitfalls of using cut when parsing CSV it shows me they have indeed had experience with that.

Do you see why this question is the best... the possibilities are endless, and the rabbit hole much deeper than it may seem

amznthrowaway5 · on Feb 15, 2020

> The right answer will depend on the column separator. Proposing the UNIX cut if the file is CSV is not such a good answer, but for tab-separated files, it is just fine. If the file is CSV and they tell me about cut, my next question would be if that is a good universal solution for CSV files in general.

TSV and CSV have the same limitations. A tab-separated file could still have tabs inside a field depending on the quoting convention. Either separator could be used with cut. I can't believe you are so confident in your partially truthful answers.

buzzkillington · on Feb 15, 2020

The only true Unix column format is ascii delimited text.

Oddly no one has heard of that, the only reason why I found out about it is because I had to read in punched tapes with 7 character ascii from an experiment done in the 80s during my undergrad.

https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

amznthrowaway5 · on Feb 15, 2020

Fascinating, thanks. In all of the large projects I've worked on, the CSV format variants and their inconsistencies have caused disasters. Who knew that Pandas and Spark handle CSV settings differently by default and spark has a hard time with newlines in its CSV output. CSVs inconsistencies result in a lot of data corruption, doubly so in large teams.

Replacing CSV with flat JSON or parquet depending on the use case has been a good move for avoiding these issues. The risks of CSV are usually just too high.

BlueTemplar · on Feb 15, 2020

Quad-spaced facepalm.... How come these were dropped out of usage ?!?

efficax · on Feb 15, 2020

Seems like you're completely missing the point of the interview question which is to see how someone would approach a problem, investigate its requirements, propose a solution, examine its drawbacks and how they would take feedback on that solution and its possible advantages and disadvantages.

amznthrowaway5 · on Feb 15, 2020

I did not miss that. I was pointing out that glofish said he asked questions like these repeatedly to candidates, but he doesn't even understand the basics of the subject matter. That would not make for a good interview.

AtlasBarfed · on Feb 15, 2020

Bingo!

DethNinja · on Feb 15, 2020

As someone with background in Mechanical Engineering, software interview questions as above seem so wild and absurd to me. Why would you care about someone knowing the intricacies of CSV or bash? I would expect a good engineer to provide best possible solution to your problem within an hour of googling / research. I really don’t see the point of asking such specific questions on interviews as it has no correlation with finding a good engineer. I wish software field would move closer to interview process of other engineering disciplines but it seems to be getting wilder each year.

streb-lo · on Feb 15, 2020

Because it's used in the real world all the time?

The amount of times I've had to write my own sorting algorithm in my career: 0.

The amount of grep/sed/awk I've used? Countless.

Someone who is familiar with how powerful and flexible these tools are is likely to accomplish something that can benefit from them quicker than someone who isn't aware of them.

Also in my experience software devs that shy away from the command line because they don't like it rarely pan out.

vander_elst · on Feb 15, 2020

Come on man ` cut -d "," ` I like the question and how you think about the rabbit hole, but you need to sharpen your knife A LOT before being able to ask such questions, you need to be prepared for all the kinds of answers, which might be right even if you don't have a clue about what the candidate is talking about...

tom_ · on Feb 15, 2020

But that is exactly why you can't use cut in general for parsing CSVs - the usual CSV syntax includes quoted fields, in which commas don't separate the values. But cut only examines the input charwise:

    % echo 'a,"b,c",d' | cut -f 1-3 -d ","
    a,"b,c"

I don't think there's a good solution to this using standard tools, but I'm sure there are various CSV packages available. (Which I've never used! - I'm just familiar with this problem from seeing people try to work with CSV data in code using exactly the char-by-char approach taken by cut.)

vsareto · on Feb 15, 2020

I think the interviewer-escape-hatch for that is to only have delimiter commas in the file (so cut will work in this situation), and withhold that information to see if the candidate asks about it. If they do ask, that's points in their favor.

mixmastamyk · on Feb 15, 2020

I had to read a few times to figure out what “column-oriented” meant before figuring it out. May have not have been able to do it under pressure in an interview. If you’d said “ordered by column” I’d have understood much more quickly.

i.e. Be careful with your phrasing. That is a bias in itself.

benmmurphy · on Feb 15, 2020

I would assume column oriented would mean the data would be formatted something like:

    R1C1,R2C1,R3C1
    R1C2,R2C2,R3C2

So values that have the same column are clustered together. Whereas normally in a file values that have the same row are clustered together. This is what column-oriented usually means when you are referring to databases.

I guess this is not what the questioner meant because they referred to using cut | sort | head as a solution. Though, I don't understand why head would be at the end of either problems solution so maybe I'm missing something. head could be a useful way of peeling out the column you want in the column-oriented problem.

ptr · on Feb 15, 2020

Me too, I thought a “column oriented” file was a file where the data for column #1 comes before #2; ie, structure of arrays rather than array of structures. “cut” doesn’t work with that afaik. I’m not sure I’d ask for clarification here (as to me, this is what “column oriented” means), and probably fail the interview.

RobRivera · on Feb 15, 2020

I don't see anger or negativity. I see valid feedback for identifying bias with passion. To paint it in a negative light is to introduce bias.

AtlasBarfed · on Feb 15, 2020

Well the person was exposed as someone just validating themselves in interviews, so legitimate criticism works be painted as an attack.

All tech interviews start with a need to legitimize and reinforce the interviewers as successful and talented ....

Even if we're basically all terrible

andrewprock · on Feb 15, 2020

Can I just use csvcut?

ramraj07 · on Feb 15, 2020

He didn't ask to use cut, you could use whatever windows function you have used as well. Unless, of course, you've never have had to unmangle a weird text file to get something out of it in your computing history. Then perhaps I don't want you at all?

OP asked a very open ended question, merely made a suggestion that it could also be done with some standard Unix programs (I grew up on windows and even I know about cut and awk because you spend enough time anywhere in tech you will know these). Why it triggered you, not sure, but perhaps the question it's doing its job after all.

ip26 · on Feb 15, 2020

Then perhaps I don't want you at all?

I've watched a number of new grad hires pick up bash, vim, and version control from scratch in a month or two and go on the be very successful. For better or worse some good schools don't cover those sorts of ancillary skills, and not every good candidate will tinker with Linux as a hobby.

Igelau · on Feb 15, 2020

> Pretend all you want, but you're filtering not by "experience" but by trying to find people in your tribe, which is naturally heavily weighted in chauvinism, racism and elitism.

I didn't realize Linux Users counted as a race now :D

BlueTemplar · on Feb 15, 2020

We're a penguin-related species...

SkyPuncher · on Feb 15, 2020

`cut`, `sort`, and `head` are basic Unix commands. Even in a Windows world, they're easily available (either running Ubuntu terminal or cygwin).

This is like complaining that you can't read the basics of a new programming language.

rtikulit · on Feb 15, 2020

Around 1990 I had to do some interviewing and I used this kind of intentionally underspecified simple problem to weed people out. (It seemed radical at the time.) Did they realize the problem was underspecified? Did they try to elicit the underlying “why”? Did they collaborate with me to complete the spec? Were they able to restate the spec coherently? Could they articulate a workable implementation in some technology domain? Basic stuff. Interesting thing is that there were people who really didn't do very well, who we hired anyway, who went on to be reasonably productive and reliable developers. In retrospect I concluded the cognitive and social basis of their ability was incomprehensibly different from mine. But I'd still do those questions today.

arp242 · on Feb 15, 2020

Different people are good at different stuff. In my experience the best teams are those with a diverging skill set: Alice may be really good at asking these kind of "why" questions, whereas Bob is really good at quickly implementing things from a spec, and Chris may be very good at databases, etc.

minimaxir · on Feb 15, 2020

As someone who has had bad data science interviews before getting my current data science job, the process is highly variable. I've had interviews where the interviewer is looking for a specific right answer, with the answer being a binary you-know-it-or-you-don't thing that can't be talked through or worked out in a dialogue with the interviewer.

An example was a whiteboard problem requiring the BETWEEN syntax for SQL window functions, which is very uncommon. After I asked for a hint, the interviewer replied "You don't know the BETWEEN syntax for window functions? Everyone knows that."

hobs · on Feb 15, 2020

My favorite iteration of this is also when the interviewer has a suboptimal answer to this question, and expects you to parrot the wrong thing back to them.

I could tell I annoyed my interviewer when they told me I was wrong and I demurred, and politely asked them to look it up since there was some question about the facts. They did not look it up.

lostcolony · on Feb 15, 2020

I had that from a Microsoft interviewer. Thought my code was O(n^2) because he was a C guy...whereas I was writing it in Java (something I had checked would be okay with both the recruiter and the interviewer). Querying the length of a string is an O(1) operation, not O(n), so while you could make the case it's suboptimal (since a function call per loop instead of just a variable lookup), it's not quadratic in behavior. And when he asked what I would do if a number overflowed and I said "...let the exception bubble up because based on the function you asked me to write there is nothing I can do cleanly within it" it was pretty clear the interview was over.

Good times.

ansgri · on Feb 15, 2020

TIL even sqlite has window functions.

minimaxir · on Feb 15, 2020

MySQL and sqlite only got window functions very recently, years after the aforementioned interview.

On a take-home test around that time, the question specifically said to use PostgreSQL just because the answer required a window function.

bambo222 · on Feb 27, 2020

Why would you do cut|sort|head? You should instead just ask the k-sorted merge question about external sorting.

As a FAANG data scientist, I've never once wanted to use cut|sort|head nor have I wanted to work with CSV's. Everything is already sharded and encoded as a schema-enforced binary encoding like protobuf or thrift. The file is so large its better to favor Apache Beam or equivalent to parallelize the aggregations of particular fields over very large amounts of data. But, hopefully you just use some SQL-like interface such as BigQuery that when pointed to sharded files, can easily do aggregations for you with SQL-like language (which, kicks of distributed computing jobs under the hood and is not truly relational). Unless you're streaming data, then that's another question.

Testing unix commands is narrow minded IMO. If you want to test divide and conquer plus streaming, then just ask a flavor of that Leetcode question.

zemo · on Feb 16, 2020

the responses to this are breathtaking. I think I would decline to hire the majority of HN users because of their arrogance.

My partner is learning about data science now so I asked them if I could try this question on them in the context of a data science interview, first thing in the morning and without coffe. They looked at me and said "being asked data science interview questions by your spouse right after waking up is the worst thing in the world but I dunno I would load it into pandas and put it in a data frame". And like honestly that's not how I would do it (I would do awk | sort | head because I always forget the cut column options) but the whole point is that the answer prompts further discussion. Now I know to ask about python and pandas (the thing the candidate uses and knows), and not, I dunno, scala and cascading/scalding or whatever (the thing that I know or use). Good questions investigate what the candidate knows. Bad questions investigate whether or not the candidate knows the things you already know.

People on this site are way too concerned with "being correct".

thedance · on Feb 15, 2020

How do you sort an infinite stream?

ummonk · on Feb 15, 2020

You could have a balanced data structure and keep inserting into it.

nutjob2 · on Feb 15, 2020

Sorry, you failed the interview! Your algorithm doesn't terminate, it's no better than an empty infinite loop.

If a list is sorted, then you'd be able to return the largest value. Since that is impossible the correct answer is that it's impossible.

throwlaplace · on Feb 15, 2020

by using an N element max/min heap and evicting the max/min when a new element comes in that's less than the max/greater than the min (note this is also a leetcode problem https://leetcode.com/problems/k-closest-points-to-origin/)

glofish · on Feb 15, 2020

nothing wrong with knowing the answer or where to look it up, I don't get what your point is

the purpose of the interview is to filter out people that do not know the answer or have pre-learned something and don't fully understand its applications. When you are in a dialog it is a very different dynamic,

people that would not ask a question because it is already posted on leetcode are the problem the OP complains about

downerending · on Feb 17, 2020

Since no one else got it, I'd start by allocating an infinite array to ingest the stream into.

J5892 · on Feb 15, 2020

O(∞)

jakemal · on Feb 15, 2020

It's actually O(n) where n is ∞

pb7 · on Feb 15, 2020

You know a O(n) sort algorithm?

durovo · on Feb 15, 2020

O(n) sort algorithms do exist (Counting sort)

glofish · on Feb 15, 2020

the post was getting too long to put in all the details, but basically, what I was getting at, imagine the data comes in batches and you have to retain the N largest seen so far

minimaxir · on Feb 15, 2020

IMO, that's not an appropriate question for a data scientist. Maybe it's better for a data engineer or a SRE.

glofish · on Feb 15, 2020

Basically you are saying that you cannot think of a way of doing this, other than the most inefficient way --> hiring a new person

This answer would make you fail the interview :-)

nutjob2 · on Feb 15, 2020

The question is ambiguous and misleading or just simply nonsensical.

This sort of thing is sadly common in interviews, where the interviewer some arbitrary answer in mind and expects you to read his mind, which is possible only some of the time.

By definition you can't sort an infinite list. You've conveniently turned the question, in your mind, into something like "how do you efficiently maintain an ordered list of incoming items?"

aprdm · on Feb 15, 2020

It's honestly pretty ridiculous seeing your tone in the comments and how you think that highly about your bad interview questions and your flawed conceptions about the solutions, including CSV and TSV confusion. This smile makes it embarrassing even.

Hope I never pass your interview!

nhumrich · on Feb 15, 2020

Or pass the interview if it was a management role

throwlaplace · on Feb 15, 2020

lol but this is literally a leetcode style problem so what's your point?

https://leetcode.com/problems/merge-k-sorted-lists/

https://www.geeksforgeeks.org/external-sorting/

and it's literally called "merge sorted files" (page 175 of elements of programming interviews).

glofish · on Feb 15, 2020

the point is to see how people think,

it is not so simple to regurgitate pre-learned answers when you alter a problem one small attribute at a time, in each case a different answer becomes optimal, thus you can see if the person understand what needs to be done or not

throwlaplace · on Feb 15, 2020

>the point is to see how people think

i feel very strongly this is a disingenuous claim. clinical psychologists go to school for a long time to learn how to assess people's abilities to think. why should i believe that you a random software engineer have any competency whatsoever. in reality this is exactly the reason standardized exams (or standardized interviews such as leetcode style problems) exist - because average person can't accurately make that call.

enjoylife · on Feb 15, 2020

Upon completion of the interview the problem would indeed look like the result of the leetcode problem. However the take away from the op is in how the question has a narrative. It enables a dialogue over time. Each sub problem provides n ways of exploring a solution. Each problem providing a sounding board for the candidate to pronounce their thoughts. Such questions are effective at providing more signal(read as talking out loud) from candidates.