It would have been cool to see a discussion about algorithmic solutions, rather than solutions based on tools like MySQL.
There are three basic approaches: sort and remove duplicates (the original bash script); insert all items into a set (e.g. hash table) that only keeps unique copies, and count its size; or probabilistic solutions like Count-Min-Sketch or HyperLogLog. But the problem with the latter is that they are approximate, which doesn't sound ideal when billing customers.
The problem with both of the first two approaches is that they require all items to be stored in memory at the same time. As long as that's true, either the sort or hashtable approach will work fine. But once you run out of RAM on a single machine, it's going to slow way way down as it swaps to and from disk constantly.
To me, the natural solution is to just split the dataset alphabetically into, say, 10 or 100 equal-size jobs, and run these either sequentially or in parallel on 10 or 100 machines. So for example if the unique IDs are random strings of digits, then everything starting with 00 is in the first job, everything starting with 01 is in the second, up to 99. For each job, apply either the sort or the set approach; shouldn't matter much.
(edit) For example, here's sequential pseudocode; the second step is embarrassingly parallel.
# split the records
for each record in records_list:
prefix := record[0:2]
write record to file "records"+prefix
# count
total = 0
for each records_prefix_file:
initialize hash_table
for each record in this records_prefix_file:
insert record into hash_table, ignoring if already present
total += size of hash_table
(second edit) I'm a theorist, not a practicioner, so I'm ignoring many practical issues about where to store and back up the data, etc.
So, yes, the article is an interesting war story and I don't want to diminish the pain of the writer, who inherited a system he didn't particularly want to touch. But yes, the solution is basically "sharding". Personally I'd start with per-customer sharding, and hope that A: that holds for long enough that I'm much bigger before it's a problem and B: that if I'm already sharding by customer, when the day comes that one customer is too big for a single instance of whatever I'm using that the architecture won't get too offended if we shard a single customer multiple times.
(If I were doing both of the things, I'd certainly architect the former so the latter was possible at a later date. It's not that difficult, really; you've already got the "split problem -> process subprobems -> combine results" architecture in place, you just have to make sure not to accidentally grind in the nature of the subproblems too deeply into the system, which is just boring ol' software engineering. It doesn't even cost you much since you're not going to win that much by assuming the nature of the subproblems; it really just would be sloppy programming.)
The main challenge here was lack of resources and the importance of getting it right, rather than the challenge of the problem itself, which I don't think is anything the author doesn't know and didn't acknowledge already.
I think approximate solutions would be fine as long as they're an underestimate. As long as the estimate is close enough, the company shouldn't be bothered by the lost revenue. As long as it never overestimates, the customer can't complain.
HyperLogLog is too inaccurate to be practical. But, if these files are smaller than a few billion lines, a Bloom Filter could get 0.0001% accuracy while still fitting in main memory on a single node.
I wonder if the author might have thought it was an odd name because "wc" is also used in some English-speaking countries as an abbreviation for "water closet", a synonym for a flush toilet.
grep is from the ed command (which itself is short for editor, the original Unix text editor) g/re/p, which Globally searches for a Regular Expression and Prints it.
sed is Stream EDitor.
vi is VIsual editor, which is so named because it's like ed but shows you what you're editing.
Not sure where the name awk comes from.
All of these tools are closely related to the venerable ed(1) command.
also - not pointing any fingers - I suspect for the same reason hexadecimal (which programmers shortened to hex) was used over the proper Latin sexadecimal.
Is there a fast way to detect duplicates when you first generate the records? If so, could just keep a continuously updated counter for each client, incrementing it every time you add a record, and decrementing on duplicates to avoid double counting.
The analysis and proposed solutions all seem to ignore the one important aspect of this data: time.
Why would you wait until the end of the month to count the recorded activities on the first of the month? And, what happens to failed activities that are repeated on the first of the next month? Are customers double-billed? If I'm interpreting the scenario correctly, a large network outage on April 30th would see a large number of duplicate charges as they are re-applied in May.
Counting during the send process makes much more sense. Duplicates can be handled in-line, and even detected over month boundaries. If its not "worth it" in terms of engineering time, then I'd be concerned about the leadership; the author indicates 'this was the only way we made money'
Billing was by month. This was a business decision.
It makes sense too. Billing requires a lot of man hours to pull off. Invoices, auditing, excel spreadsheets, etc. It's a people problem not an engineering problem.
I'm not sure of the disconnect here, but de-duplication is not trivial. If you do it every day nothing is easier than if you do it every month.
Doing it for all time is completely infeasible.
There was not a database of every piece of social media data sent out the door. That's what you would need to make sure not to record the entry again. All we had were flat files in s3.
Big flat files. It took hours to download and merge them all.
Once we had a database (Cassandra) it was updated continually (by a kafka consumer) and we could query it in a few minutes.
Fair enough... but somehow your system knew of failed attempts and the need to retry, correct? I'm sure I'm trivializing a rather complicated workflow, but if there's a way to detect failed attempts, there should be a way to remove them from the billing. (perhaps a separate S3 file that can be used to reduce the number of charged attempts, or split the ones you have into 'success' and 'fail' versions?)
Oh, and Cassandra is great... used it for years. Just beware of over-sized results. I've seen run-away queries crash the instance(s) running the query; ugly stuff.
I’m not sure I follow your argument here. You switched to a system where the data was downloaded continuously. You could have, with the shell script, easily:
* downloaded the files daily
* sorted and removed the intraday duplication in each new file.
* sorted and removed intraweek duplication each week and on the first of the month.
* You could give your customers a weekly estimate of their bill.
All of this would spread the work out and tell you when the system had glitched out weeks before the results were due.
I have no problem with 'boring' solutions; actually, I prefer them. Processes that are simple, self-explanatory and/or easy to follow are usually superior IMHO.
This person writes quiet well. Few articles draw me in and keep me there until the end. His use of language and story telling really flows and it also made me reminisce of when I used to be passionate about programming myself. Great writing.
Right. I never felt like it 'drawn on and on' like some do. It threw the technical details in while still keeping you reading. I'm curious now about his book on learning Go. If it reads the same way it would probably be worth a read.
Counting unique occurences isn't easy if you are given data the way it was described in the article and you have a requirement for exact count.
But there are things you can do to help if you are willing to redesign it a bit.
You could partition the data up front, as it is being written. Then you could send partitions to be processed in parallel (and you still could do this with the shell command!)
Rather than first building up a huge pile of logs and then counting them, move the count to the location that generates the logs or the location that stores them. Easily done, no need for algorithms or special case solutions, just a simple set and count. Added bonus, the “task” is axiomatically done when the month ends no difficulties or special considerations needed there.
Why not count daily and incrementally? No reason to wait for the last day and then have to meet a deadline, do most of the processing in advance and only need to do a small amount of work each day to update.
Sorry what I'm getting at is you can't do this problem incrementally. You can't calculate the count on day 1 and add it to the count on day 2. The count of each day is intertwined with all the others.
But yes I suppose if at the end of each day you deduped that days records against the rest of the month you could then add them all together.
Unfortunately day 30 would still be just as bad as doing the whole month, since removing duplicates is as expensive as counting uniques.
>removing duplicates is as expensive as counting uniques.
I don't think so. You do all kinds of things (unzip, sort, uniq, plus apparently 1-2 passes through the whole data). If you did it daily, you only need to dedup day 30 vs a sorted version of the rest of the month, which takes around a 30th of the time, and you don't need to unzip/sort/uniq 29/30ths worth of data. I don't know your exact structure but I don't see why most of the computation can't be done earlier.
Think of it this way: on day 30 with the script described in the post, for the first $X amount of time you're reading the beginning of the files and don't touch any day 30 data until later in those 16 hours. So there's definitely some kind of processing that could be done prior to day 30.
Deduping a file of size X/30 against a sorted file of size X should take time (X/30) * log x * n, since the lookups benefit from binary search due to the sort and you only need to do X/30 of the lookups.
This should be significantly smaller, by a factor of 10 at least and likely a factor of 30, vs sorting X records. You probably have the same Big O but doing it on less data.
We actually have a tutorial based on it: https://github.com/pachyderm/pachyderm/blob/master/doc/examp...
(disclosure: I work at Pachyderm. https://pachyderm.io/)