On Counting (2017)

msteffen · on May 24, 2018

Sharding a simple bash script across multiple nodes is one of the original (and still one of the most common) uses of Pachyderm.

We actually have a tutorial based on it: https://github.com/pachyderm/pachyderm/blob/master/doc/examp...

(disclosure: I work at Pachyderm. https://pachyderm.io/)

hinkley · on May 25, 2018

Or the bland and boring version: bucket sort and xargs.

bo1024 · on May 24, 2018

It would have been cool to see a discussion about algorithmic solutions, rather than solutions based on tools like MySQL.

There are three basic approaches: sort and remove duplicates (the original bash script); insert all items into a set (e.g. hash table) that only keeps unique copies, and count its size; or probabilistic solutions like Count-Min-Sketch or HyperLogLog. But the problem with the latter is that they are approximate, which doesn't sound ideal when billing customers.

The problem with both of the first two approaches is that they require all items to be stored in memory at the same time. As long as that's true, either the sort or hashtable approach will work fine. But once you run out of RAM on a single machine, it's going to slow way way down as it swaps to and from disk constantly.

To me, the natural solution is to just split the dataset alphabetically into, say, 10 or 100 equal-size jobs, and run these either sequentially or in parallel on 10 or 100 machines. So for example if the unique IDs are random strings of digits, then everything starting with 00 is in the first job, everything starting with 01 is in the second, up to 99. For each job, apply either the sort or the set approach; shouldn't matter much.

(edit) For example, here's sequential pseudocode; the second step is embarrassingly parallel.

    # split the records
    for each record in records_list:
        prefix := record[0:2]
        write record to file "records"+prefix

    # count
    total = 0
    for each records_prefix_file:
        initialize hash_table
        for each record in this records_prefix_file:
            insert record into hash_table, ignoring if already present
        total += size of hash_table

(second edit) I'm a theorist, not a practicioner, so I'm ignoring many practical issues about where to store and back up the data, etc.

jerf · on May 24, 2018

So, yes, the article is an interesting war story and I don't want to diminish the pain of the writer, who inherited a system he didn't particularly want to touch. But yes, the solution is basically "sharding". Personally I'd start with per-customer sharding, and hope that A: that holds for long enough that I'm much bigger before it's a problem and B: that if I'm already sharding by customer, when the day comes that one customer is too big for a single instance of whatever I'm using that the architecture won't get too offended if we shard a single customer multiple times.

(If I were doing both of the things, I'd certainly architect the former so the latter was possible at a later date. It's not that difficult, really; you've already got the "split problem -> process subprobems -> combine results" architecture in place, you just have to make sure not to accidentally grind in the nature of the subproblems too deeply into the system, which is just boring ol' software engineering. It doesn't even cost you much since you're not going to win that much by assuming the nature of the subproblems; it really just would be sloppy programming.)

The main challenge here was lack of resources and the importance of getting it right, rather than the challenge of the problem itself, which I don't think is anything the author doesn't know and didn't acknowledge already.

enedil · on May 24, 2018

So you're proposing distributed radix tree?

bo1024 · on May 24, 2018

You can think of it that way, but much simpler. Only one layer really.

asgionio · on May 25, 2018

I think approximate solutions would be fine as long as they're an underestimate. As long as the estimate is close enough, the company shouldn't be bothered by the lost revenue. As long as it never overestimates, the customer can't complain.

HyperLogLog is too inaccurate to be practical. But, if these files are smaller than a few billion lines, a Bloom Filter could get 0.0001% accuracy while still fitting in main memory on a single node.

saagarjha · on May 24, 2018

> The oddly named wc is a command used to count things.

It’s named after word count.

Alex63 · on May 24, 2018

I wonder if the author might have thought it was an odd name because "wc" is also used in some English-speaking countries as an abbreviation for "water closet", a synonym for a flush toilet.

AstralStorm · on May 24, 2018

Still oddly named. Why isn't it called count or cnt?

Ensorceled · on May 24, 2018

wc shares an environment with grep, awk, sed and vi; wc is NOT oddly named.

ddevault · on May 24, 2018

For those interested in the etymology of these:

grep is from the ed command (which itself is short for editor, the original Unix text editor) g/re/p, which Globally searches for a Regular Expression and Prints it.

sed is Stream EDitor.

vi is VIsual editor, which is so named because it's like ed but shows you what you're editing.

Not sure where the name awk comes from.

All of these tools are closely related to the venerable ed(1) command.

Wald76 · on May 24, 2018

Awk is named after its authors at Bell Labs: Aho, Williams, and Kernighan.

magoghm · on May 24, 2018

Aho, Weinberger, and Kernighan.

Wald76 · on May 27, 2018

You are so right, thank you!

pgl · on May 24, 2018

I'm not sure about that - I think you could argue that they're all oddly named.

enedil · on May 24, 2018

That's exactly the point of parent comment.

pgl · on May 24, 2018

I thought I heard something flying overhead.

tpm · on May 24, 2018

also with at, bc, cd, cp, dd, df, du and so on

lainga · on May 24, 2018

also - not pointing any fingers - I suspect for the same reason hexadecimal (which programmers shortened to hex) was used over the proper Latin sexadecimal.

romwell · on May 25, 2018

Hm, I never realized that hexadecimal mixes the Greek prefix (hexa-) with Latin suffix (-decimal).

However, wouldn't proper Latin be sedecimal?

hinkley · on May 25, 2018

If I wrote things in sex I’d probably get fired.

rotten · on May 24, 2018

I suspect the ability to count things other than words was added later.

saagarjha · on May 24, 2018

To keep with the Unix tradition of short command names, I guess?

LukeShu · on May 24, 2018

`wc` was in 1st edition Unix. It was setting the tradition, not keeping with it!

wodenokoto · on May 24, 2018

And overlap with a common abbreviation that refers to a toilet, WC (water closet)[1]

So yes, it has an odd name, and I'm sure the inventor of the wc command thought it funny.

[1] https://en.wikipedia.org/wiki/Flush_toilet

monochromatic · on May 24, 2018

Is there a fast way to detect duplicates when you first generate the records? If so, could just keep a continuously updated counter for each client, incrementing it every time you add a record, and decrementing on duplicates to avoid double counting.

Jupe · on May 24, 2018

Agreed.

The analysis and proposed solutions all seem to ignore the one important aspect of this data: time.

Why would you wait until the end of the month to count the recorded activities on the first of the month? And, what happens to failed activities that are repeated on the first of the next month? Are customers double-billed? If I'm interpreting the scenario correctly, a large network outage on April 30th would see a large number of duplicate charges as they are re-applied in May.

Counting during the send process makes much more sense. Duplicates can be handled in-line, and even detected over month boundaries. If its not "worth it" in terms of engineering time, then I'd be concerned about the leadership; the author indicates 'this was the only way we made money'

cdoxsey · on May 25, 2018

Billing was by month. This was a business decision.

It makes sense too. Billing requires a lot of man hours to pull off. Invoices, auditing, excel spreadsheets, etc. It's a people problem not an engineering problem.

I'm not sure of the disconnect here, but de-duplication is not trivial. If you do it every day nothing is easier than if you do it every month.

Doing it for all time is completely infeasible.

There was not a database of every piece of social media data sent out the door. That's what you would need to make sure not to record the entry again. All we had were flat files in s3.

Big flat files. It took hours to download and merge them all.

Once we had a database (Cassandra) it was updated continually (by a kafka consumer) and we could query it in a few minutes.

Jupe · on May 25, 2018

Fair enough... but somehow your system knew of failed attempts and the need to retry, correct? I'm sure I'm trivializing a rather complicated workflow, but if there's a way to detect failed attempts, there should be a way to remove them from the billing. (perhaps a separate S3 file that can be used to reduce the number of charged attempts, or split the ones you have into 'success' and 'fail' versions?)

Oh, and Cassandra is great... used it for years. Just beware of over-sized results. I've seen run-away queries crash the instance(s) running the query; ugly stuff.

Glad you found a solution that works!

hinkley · on May 25, 2018

I’m not sure I follow your argument here. You switched to a system where the data was downloaded continuously. You could have, with the shell script, easily:

* downloaded the files daily

* sorted and removed the intraday duplication in each new file.

* sorted and removed intraweek duplication each week and on the first of the month.

* You could give your customers a weekly estimate of their bill.

All of this would spread the work out and tell you when the system had glitched out weeks before the results were due.

hinkley · on May 25, 2018

This answer is both boring, and entirely correct. Good work!

Jupe · on May 25, 2018

Ah, thanks? :)

I have no problem with 'boring' solutions; actually, I prefer them. Processes that are simple, self-explanatory and/or easy to follow are usually superior IMHO.

dswalter · on May 24, 2018

Bloom filters could do it. https://en.wikipedia.org/wiki/Bloom_filter

repsilat · on May 24, 2018

It's possible with hash tables too...

Also, the Unix approach can do something like this by spitting out days at a time and sorting them, and then doing a

    sort -m "$month/*" | uniq -c

on the individual files.

`sort -m` takes basically no memory, and should chew through data fast enough.

amelius · on May 24, 2018

Are you pulling that joke every time someone asks something you think you know the answer to? It's kind of contradictory.

dswalter · on May 24, 2018

I've edited my response, but I think you misunderstand the comic.

reificator · on May 24, 2018

Was it the XKCD Ten Thousand comic?

dswalter · on May 24, 2018

Indeed.

jdironman · on May 24, 2018

This person writes quiet well. Few articles draw me in and keep me there until the end. His use of language and story telling really flows and it also made me reminisce of when I used to be passionate about programming myself. Great writing.

nyc111 · on May 24, 2018

I thought the same thing. Probably because he writes the way he speaks. Or I would like the secret of it.

jdironman · on May 24, 2018

Right. I never felt like it 'drawn on and on' like some do. It threw the technical details in while still keeping you reading. I'm curious now about his book on learning Go. If it reads the same way it would probably be worth a read.

Radim · on May 24, 2018

If you're into fast cardinality estimation (HyperLogLog) and item counting (Count-min sketch, Bloom filters, hash tables), check out Bounter:

https://github.com/RaRe-Technologies/bounter/

(Pythonic interface on top of highly optimized algorithms, faster than dict but using limited memory, MIT license)

mfontani · on May 24, 2018

Reminds me quite a bit of https://adamdrake.com/command-line-tools-can-be-235x-faster-...

… in which, though, the author at least parallelises the counting.

zxcmx · on May 24, 2018

There is a nearly-as-simple solution involving lambda (which would not require running a kafka cluster) but meh.

ILMostro7 · on May 24, 2018

Bash 5.0? Parallel execution

lmilcin · on May 24, 2018

Counting unique occurences isn't easy if you are given data the way it was described in the article and you have a requirement for exact count.

But there are things you can do to help if you are willing to redesign it a bit.

You could partition the data up front, as it is being written. Then you could send partitions to be processed in parallel (and you still could do this with the shell command!)

djhworld · on May 24, 2018

AWS Athena is pretty good at tackling this problem, or PrestoDB running on EMR.

As long as your S3 data is reasonably partitioned, and you don't have millions of small files, it does a reasonable job -even on count(distinct)

It even support approx_distinct for hyperloglog estimation too

tomtimtall · on May 24, 2018

Rather than first building up a huge pile of logs and then counting them, move the count to the location that generates the logs or the location that stores them. Easily done, no need for algorithms or special case solutions, just a simple set and count. Added bonus, the “task” is axiomatically done when the month ends no difficulties or special considerations needed there.

thanatropism · on May 24, 2018

It's worth mentioning that counting machines are way older than computers and the Census used to be done with "dumb" punched car munchers.

https://en.wikipedia.org/wiki/Tabulating_machine

ikeboy · on May 24, 2018

Why not count daily and incrementally? No reason to wait for the last day and then have to meet a deadline, do most of the processing in advance and only need to do a small amount of work each day to update.

cdoxsey · on May 24, 2018

What if you receive the same tweet 3 days later? It should only be counted once.

ikeboy · on May 24, 2018

You dedup against the old data. Easier than deduping a month's worth all at once, since the old data is sorted.

cdoxsey · on May 25, 2018

Sorry what I'm getting at is you can't do this problem incrementally. You can't calculate the count on day 1 and add it to the count on day 2. The count of each day is intertwined with all the others.

But yes I suppose if at the end of each day you deduped that days records against the rest of the month you could then add them all together.

Unfortunately day 30 would still be just as bad as doing the whole month, since removing duplicates is as expensive as counting uniques.

ikeboy · on May 25, 2018

>removing duplicates is as expensive as counting uniques.

I don't think so. You do all kinds of things (unzip, sort, uniq, plus apparently 1-2 passes through the whole data). If you did it daily, you only need to dedup day 30 vs a sorted version of the rest of the month, which takes around a 30th of the time, and you don't need to unzip/sort/uniq 29/30ths worth of data. I don't know your exact structure but I don't see why most of the computation can't be done earlier.

Think of it this way: on day 30 with the script described in the post, for the first $X amount of time you're reading the beginning of the files and don't touch any day 30 data until later in those 16 hours. So there's definitely some kind of processing that could be done prior to day 30.

ikeboy · on May 25, 2018

Deduping a file of size X/30 against a sorted file of size X should take time (X/30) * log x * n, since the lookups benefit from binary search due to the sort and you only need to do X/30 of the lookups.

This should be significantly smaller, by a factor of 10 at least and likely a factor of 30, vs sorting X records. You probably have the same Big O but doing it on less data.

Jupe · on May 24, 2018

Or just don't record it until it's successful. Logging attempts (and billing for them) doesn't seem like a good idea.

hinkley · on May 25, 2018

More precisely: logging attempts might have business value, but not to the billing department. Log your confirmations too.

halayli · on May 24, 2018

aws S3 Select might be a good fit.

https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-s...