Hacker News new | past | comments | ask | show | jobs | submit login
How to Analyze Every Reddit Submission and Comment, in Seconds, for Free (minimaxir.com)
138 points by minimaxir on Oct 3, 2015 | hide | past | favorite | 65 comments



Question: why the hell do people use CSVs? Use tabs, goddammit. Documents are littered with commas. It's trivial to parse a TSV with awk - not so a CSV. There is literally no reason to prefer CSV to TSV, and yet it seems the former is the default for many people. Why? As a guy who parses text files all day, I never ever want to see a CSV again. End rant.


Pretty much every programming language in existence has a utility for parsing CSVs into tabular data while handling commas sanely. (using awk for processing is not a typical pipeline)

More importantly, CSVs are handled natively by statistical programs such as Excel and R (you can change the expected delimiter while importing, but in my experience that leads to unexpected results)


Microsoft often fails with CSVs actually. I've found I can't copy/paste from Sql Server Management Studio into Excel if you have a mismatched double-quote in a cell - it fails silently and garbles your data.

That was embarrassing.

The CSV format is so messy that even big corporate implementations will have subtle bugs like that. Do not trust it.


Microsoft excel has the worst CSV defaults. The separating character is even locale-specific! Many locales use semicolon ";" instead of "," as a separator. Great stuff when creating web app exporters.


> using awk for processing is not a typical pipeline

except it is. (g)awk is more often part of the pipe(d)line than other programming langueage(s) interpreter(s).


>More importantly, CSVs are handled natively by statistical programs such as Excel and R

in excel or R? Because I've never encountered problems with tsv files with excel


R can also read both CSV and TSV just fine. Just use read.csv() or read.table() as you want. Same goes for Pandas.


Excel can handle TSV's if you name them .txt


Network effect: Database upload/export, spreadsheets, etc... I used to hate them too until I realized how ubiquitous and well supported they are.

comma vs. tab: It doesn't matter, in the end, you have to use a character to separate fields, and if you are going to use a "normal" character, there are chances that it might appear in your text, so you have to either escape that character, or quote your text, which csv does in a fairly standard way, so comma vs tab doesn't really matter.


> ... if you are going to use a "normal" character, ...

It's a shame that the ASCII file, group, record, and unit separators that have been around since forever did not catch on.


Mostly because they aren't user-inputtable. Yes you can't realistically hand-edit a 3gb .csv either, but csvs don't usually start out that large.


I was going to add a sentence to my comment... They were easy to input via control-caret, etc., until various programs repurposed those combinations. I think a bigger problem is that there is no standard way to display them.

(BTW, if inputtability mattered at all, Unicode wouldn't exist. ;-)


I'm not sure what you mean by your last sentence. Sure, you might have to work a bit to be able to change your locale to Klingon, and input that -- but for something for which unicode is actually useful, like Japanese, there's been input solutions for a long time (predating unicode, as you need a wide character set for such languages).

And unicode is of course easy to use with "less exotic" languages like Norwegian that can't be represented in basic 127-bit ascii -- via a simple keyboard map.


+1

I always found it best practice to quote text. Makes the debate a moot point.


I can think of at least one reason. Copy-pasting from a terminal fails to preserve tabs, while commas are fine.

Tabs might be somewhat less frequent than commas in data, but you're still gonna have just as bad a time when they do turn up. And don't forget quoting and escaping. Without proper parsing you can neither read CSV nor TSV safely. My go to solution is pandas.read_table()


Yeah, this is the deciding factor for me, too. The point of delimited text files is to find some compromise between machine-efficency and human readability. CSVs are generally much less readable than TSVs, but have less ambiguity, too. With TSVs, it's hard to tell if a tab space is actually a tab or just multiple spaces.

And with developers (especially Python devs) setting their text editors to translate tabs-into-spaces, copy-pasting TSVs results in data corruption. Or even opening a TSV in your text editor with default settings.


Good terminal applications such as Gnome Terminal preserve tabs.


> Use tabs, goddammit.

Use ASCII's `RECORD`, `GROUP` and `UNIT` separators, they were literally designed for it.


Which key to I press on the keyboard for those when I want to hand edit docs for quick testing?


In terminal / vim:

    Record separator => C-v C-^
    Unit separator => C-v C-_
in emacs I think you can use C-q in place of C-v


Why use a character like tab, which can also appear in common text, when there are control characters designed exactly for this purpose? ASCII 0x1C - 0x1F are there specifically for machine parsing of tabular data.


If you use control characters, it is not plain text any more. It's hard to edit with a text editor.

Tabs have a similar problem: They look like spaces in a simple text editor like notepad.


This just seems like a failing of the text editor.

But, if you can't input the character...you can't input the character.


Because you immediately run into some madman who wants to separate the columns with four spaces instead.


The company I work at uses the pipe character('|') everywhere. It's fairly painless, and you can edit the files using standard text editors. It's also pretty easy for clients to provide or consume pipes. CSVs are terrible - people always have trouble providing or ingesting them with the correct quoting: probably 50% of the time it's suggested there is strong resistance to the idea of using CSV quoting. TSVs are okay - you can copy-paste from Excel into your terminal so it's got that going for it.

There's been one time where a client provided a couple hundred gigabytes of data where all the reasonable characters one might use were used in the fields themselves, including their own separator character(the comma). So I made up some rules for guessing which commas were delimiters and which were part of a field value, and stored it using the ASCII control character as a delimiter. The only other use of this character at my company are in jokes, like "We should ask the client to provide or ingest files separated using the ASCII control codes" which always gets a laugh.


"Documents are littered with commas."

It's trivial to change all those commas to something else before making the CSV file and importing into the database.

They can then be converted back to commas by processing the query result.

This type of conversion before storage is routinely done with other characters, such as newlines, quotes, etc. For example, look at the JSON for the 10mHNComments data dump.

Personally, from a readability standpoint, tabs drive me nuts. I am glad TSV is not the default.


If you haven't tried it, check out the csvkit tools: https://csvkit.readthedocs.org/en/0.9.1/. `csvlook` and `csvcut` are awesome.

I'm also a huge fan of Ruby's `csv` package in the standard library, particularly http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html#me....

All that only works for properly quoted CSV though... for companies who can't generate quoted csv files pipes or tabs are the unfortunate way to go.


To get tabs instead of commas:

  #!/usr/bin/env python
  import csv
  import sys
  
  writer = csv.writer(sys.stdout, delimiter='\t')
  writer.writerows(csv.reader(sys.stdin))


csv is a garbage format but at least it's a standard, and one with a lot of tool support, and that can handle embedded commas in data if processed correctly. The idea that tab separated data is a superior format is, frankly, absurd. I could understand a call for, say, json or even xml, but an ad hoc standardless format which has the solitary advantage of being incrementally easier to parse with awk? No thanks.


NO CONSISTENT DELINEATORS STOP SEND ESCAPE CHARACTERS FULL STOP


Use tildes. Tilde Separated Values are relatively robust and nobody uses tildes by themselves for anything I've ever seen. Most tools that let you set a delimiter will accept tildes also.


nobody uses tildes by themselves for anything I've ever seen

Abbreviated Unix paths? ~/.bashrc

Approximated values? ~30º


Why use TSV when you can use an sqlite3 file?


I can't open a sqlite3 file easily in Excel for easy data manipulation, visualization, and sharing.


Tab? You mean two spaces right? :)


i agree, but sadly programs like excel tend to do better w/ commas. it's sad.


awk -F","

Problem solved.


Not when the data is like this.

  "Richard, Martin", "23 NS, North Coast, NY"
Comma frequently occurs inside texts, and then awk fails.


awk -F'", "'

That's apostrophe, quote, comma, space, end quote, apostrophe.


Which still fails if the values contain escaped quotes:

    Name, Age, Address
    "James aka ""Jim"", ""licensed"" attorney", 42, "New York"
That's three values:

    James aka "Jim", "licensed" attorney
    42
    New York
And there are other possible irregularities: zero or N spaces after the comma separators; unquoted values when they're not needed; backslash-escaped special characters; escaping newlines.


What if not all columns are quoted? A lot of the time only the columns that need quoting are quoted and the others are bare.


Nice solution.

Will not work were numeric fields are not quoted.

But nice solution, nevertheless.


To figure out what's the best time to post to reddit you want the number of successful submissions divided by the total number of submissions in that time period, not just the count of successful submissions. Otherwise you may just be measuring which time slot has the most submissions.


The normalization constraint is enforced by the limited size of the front page (30 entries).


But the front page can churn a lot more during the busy times.


Visualization of times between subreddits (u/fhoffa) https://i.imgur.com/iF2msED.png


Really cool analysis. It's especially fascinating to see how BigQuery is so simple to parse and query and can easily give you access to powerful information like "What's the best time to post this on reddiy?"


Man those charts are beautiful, and this coming from someone who knows R graphics and ggplot pretty well. Please may I ask a) what platform (OS/X / Windows / Linux?) and b) what are the axis and title fonts?


I have a separate post on how the ggplot2 charts are generally done: http://minimaxir.com/2015/02/ggplot-tutorial/

Platform is OSX, title font is Source Sans Pro, axis numeric fonts are Open Sans Condensed Bold (the same fonts used on the web page itself; synergy!)


You've managed to make even ggplot's already excellent default outputs look even better. Love the darker grey gridlines, the unified grey panel background which includes the headline, the alpha < 1 so the grid shows through, and especially the choice of axis fonts. Nice touch on the footnote too. This is what professional charts are supposed to look like.


This is great, thank you. One note, though: I've yet to try it, but according to this stackoverflow-post:

http://stackoverflow.com/questions/1898101/qplot-and-anti-al...

it should now be possible to get AA on both Windows and Linux via the Cairo-library.

I'll be playing a bit with R notebooks for my stats intro class, if I find the time I'll try to see if a) I get AA out of the box, or b) I can get AA easily by setting some parameters.

[ed: And if I do, I suppose pr to https://github.com/minimaxir/ggplot-tutorial will be in order]


Funny you should mention a subject which is close to my heart :-). http://stackoverflow.com/questions/6023179/anti-aliasing-in-...


Cario is good but Quartz was better in my experience.


To get back to the point of the article, is 130MM rows past the point where you should start reaching for "big data" tools? I recently did an ad-hoc analysis of a 50MM row, 3GB dataset and saw similar performance to the OP with a local postgres database. Is it worth using specialized analysis tools for exploring these types of datasets? I ended up writing a script to slice the data, but with `wc -l` taking 2 minutes on my SSD, it wasn't a great interactive experience. What do other people use for this stuff?


Slightly different domain, but I recently had to work on a system that retained email addresses that were known targets of aggressive spamming (450MM rows).

In a nutshell, the idea was to compare lists of emails and test to see how many accounts in the given list were also in the database. We'd typically get 10k item lists to test against the database.

Initially, I was creating a temporary table and doing a unique intersection on it. That took about 45 minutes on a modest machine. Next, I hacked up a bloom filter search (since all we cared about was whether or not the items in the sample were in the database); that ran in a little under a second for a set of 10k, but got a little unwieldy with sets approaching 1MM+.

We talked about it, and decided to just use Elasticsearch, and to just split the database into a bunch of buckets. Search time went up a small bit for small samples, but went waaay down for larger samples.

Also of note, going from spinning platters to SSD did make a pretty huge improvement in things when we were using a relational database. Now that it's all basically in memory, it doesn't matter so much.


With "just" 3GB of data, did you try just working on a copy located on a tmpfs/ramdisk?


Wow, on my desktop with 16GB of cheap, 4 year old ram and a Samsung SSD I bought this year, I'm getting 30x faster io on the ramdisk. Clever solution.


Even old DDR-200 (aka PC-1600, 1600 MB/s)[1] is faster than the theoretical maximum of SATA III (600 MB/s). Wikipedia says SATA 3.2 can reach 1,969 MB/s, but the more recent DDR3-1600 is 12,800 MB/s and some DDR4 reach 21,300 MB/s [2] ... SSDs aren't even close.

DDR4 also theoretically supports 512 GB sticks [3] Can't wait.

1) http://www.crucial.com/usa/en/support-memory-speeds-compatab...

2) http://www.crucial.com/usa/en/memory-ballistix-elite

3) https://en.wikipedia.org/wiki/DDR4_SDRAM


Note that part of the beauty of BigQuery is how it allows you to go deeper into any dimension without pre-planning (AKA defining partitions or indexes).

Case in point, many comments in this page point at the "best time to post" chart. But what if that time varies by subreddit?

The answer at https://www.reddit.com/r/bigquery/comments/3neghj/qotd_best_...


Slightly OT but BigQuery has another fascinating and useful public dataset: the HTTP Archive. https://www.igvita.com/2013/06/20/http-archive-bigquery-web-...


Great link. Just ran Ilya's first query--median time to first render--against September Pages:

"median": "2389", "seventy_fifth": "3543", "ninetieth": "5191"

How did the Internet page renders get slower since 2013?

2.2s median, 3.3s for 75th percentile, and 4.7s for 90th percentile in httparchive:runs.2013_06_01_pages]


Does anyone know what "R tricks" was used to generate the graph depicting when the best time to post?

I quite like that visualization and would like to use it for something else.

"2.5 seconds, 2.39 GB processed, and a few R tricks results in this:" http://minimaxir.com/img/reddit-bigquery/reddit-bigquery-2.p...


He put a link to his R code in the blog post: https://github.com/minimaxir/reddit-bigquery/blob/master/red...


As noted, essentially I generate a mapping of DOY/Hour and merge them.


Amazing how the 1000Gb can suffice. Kudos!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: