Very interesting, these kinds of articles are (to me) some of the best content we get on here.
I'm no expert in any of the languages used except maybe C, at least I'm quick to form opinions about C code. In C, I'm not so sure that there is a well-established "idiomatic" concept, since C is not so tightly bound to a community as more modern languages. At least that's my feeling.
That said, I was shocked to see that the example code did heap-allocations to store integers. Single integers, one per allocation, per "counting bucket". I would store the count in the data pointer without a second thought, I just "can't" make 4/8-byte heap allocations and still sleep at night. Especially after all the concern about processing the input in large blocks for performance! Casting an integer to/from a void pointer should be pretty safe according to my gut feel. If that turns out to be false, I would dump the libc's hash API altogether, since it makes me do heap allocations to store single integers.
I tried making the change locally, and the time taken (as measured with the OP's benchmarking program) dropped from 0.11 to 0.10 on my system (a Dell laptop running an Intel i5-7300). That's at least something.
Yeah, I would have liked to avoid it too. But I think you have to allocate with the hcreate/hsearch API (for the "simple" C version). When I roll my own hash table with optimized.c I don't allocate for the integers.
No, you can just jam the integer into the "data" field in the structure and skip that allocation.
As I said, and others have commented, it requires some complicated castery to avoid UB, but it's totally doable (and at least GCC does the right thing, while warning, with no advanced casts).
Afaik the only integer type it's safe to cast from and to pointers is intptr_t. Not exactly the type people reach to when writing the first "simple" version. Another comment also points out that hcreate (which is not posix, only GNU?) and the likes use a global hashmap? I can't see a lot of use cases for such a singleton datastructure.
If you need to round-trip a pointer through an integer and back to the same pointer, you need to use intptr_t or uintptr_t. (Note that this is true for object pointers but not necessarily function pointers. There’s no integer type guaranteed to be big enough for a function pointer.)
If you want to round-trip an integer through a pointer and back to the same integer, you can use any integer type provided it is not bigger than intptr_t or uintptr_t.
It just means that you need to cast via intptr_t, i.e. double casts `(void*)(intptr_t)i` and `(int)(intptr_t)p`. GLib has macros for it if you feel that's useful.
I've been programming in C for decades and I'd never heard of hcreate. I'm sitting here wondering how the heck did that ever get standardized, allowing single hash table per process. What a bizarre feature.
I wonder if it's like insanities like SIGPIPE: it's kind of useful if you write IO tools in a few dozen/hundred lines of C. You don't need to carry around explicit context, or check if writes failed, or anything. Of course it sucks for anything else and we still have to deal with SIGPIPE .
Someone on Lobsters suggested this too, and I don't think it would be particularly edifying. But maybe I'm wrong. Someone should go out and do that benchmark. Maybe we'll learn something. Here is what I wrote though:
Coming up with a benchmark is hard. Benchmarks like these are models. All models are wrong. But. Some are useful.
Follow you thinking to its logical conclusion. If you open up the parallel programming flood gates, now you need to do that for all of the programs. Now your effort to contribute a sample for each language goes up. Maybe enough where actually doing the benchmark isn't practical.
So let's play the tape. If someone told me, "sure, have fun, throw threads at this!" I'd probably say, "can I use memory maps?" And let's say, "yes, sure, go for it!" I say okay, here's my first crack at it:
1. Hope stdin is a file and open it as a memory map.
2. Split the memory map slice into N chunks where N is the number of logical CPU cores on the system.
3. Run approximately the same algorithm devised in existing code submissions in N different threads, with one thread per chunk.
4. At the end, join the results and print.
If that ends up being the best strategy overall, then your model has become over-complicated because performance now likely depends on step (3). Which is exactly what the existing benchmark is. Now, maybe some languages have a harder time passing data over thread boundaries than others and maybe that impacts things. But that doesn't apply to, say, C, Rust, C++ or Go.
If that's not the best strategy, then maybe there is a cleverer approach that involves synchronizing on one shared data structure. I doubt it, but maybe. If so, your benchmark turns into one that is very different. It's no longer just about data transformation, but now you have to worry about synchronization trickiness. And now you're way outside of a standard goroutine pool that you might have used in Go.
Building a good benchmark is an art. Sometimes the constraints look dumb and arbitrary. But you have to sit down and really look at it from a bunch of different angles:
* Do the constraints approximate some real world conditions?
* Do the constraints inhibit writing code that looks like real world code?
* Do the constraints inhibit optimizing the crap out of the code?
* Do the constraints make the benchmark difficult to create in the first place?
* Do the constraints permit easy and clear measurement?
There are probably more things that I didn't think about.
Amusingly, I've done a multi-threaded version of the word counting program in order to test a shm kv store. I needed benchmark that created a lot of cross thread concurrent accesses to keys and I found a blog about this test. My version has serious constraints though, you have to create a shared memory map with enough space to hold all of the keys beforehand, as it doesn't resize the shm kv map as it runs.
The speedup of the multi-threaded version vs the single-threaded version is about linear. The single threaded version uses 2 threads, one to read stdin and one to hash the keys, the 16 threaded version uses one thread to read, 16 to hash.
Yes, you've basically gone over the details of why doing parallelized / concurrent performance comparisons between different programming languages is very difficult. I agree with most of what you have written. However, this doesn't change the fact that any comparison of performance between different languages, that ignores relevant concurrent and / or parallelization features, isn't very useful. This is especially true when the selling point for a few languages in this list is concurrency and / or parallization related features.
No, what I'm saying is that it is useful. Because in a concurrent program, you'll almost certainly being using the single threaded techniques developed here. And that the fastest program will be the one that uses the fastest single threaded techniques. So it makes sense to simplify the model and just look at the thing that is going to dominate the runtime of the program.
If a "dumb" thread pool is indeed what ends up being the best concurrent solution, then Rust, Go, C and C++ (and probably most of the other languages) are going to have almost no problems implementing it. It's just Not That Interesting.
Now, maybe I'm wrong. Maybe there is something clever here to exploit in a concurrent program, and that some languages let you do that easier than others. I doubt it. It would be interesting if it turned out to be the case, but it would be a different benchmark. It wouldn't just be about how well a language can deal with simple data transformation tasks, but also about how it deals with tricky concurrency problems. But this doesn't seem like the kind of problem that needs such things. It complicates the model.
Like honestly, just saying that any benchmark that ignores parallelism "isn't very useful" is just totally ridiculous.
> No, what I'm saying is that it is useful. Because in a concurrent program, you'll almost certainly being using the single threaded techniques developed here. And that the fastest program will be the one that uses the fastest single threaded techniques.
You have a point. Maybe I am wrong in saying that it's near useless.
> If a "dumb" thread pool is indeed what ends up being the best concurrent solution, then Rust, Go, C and C++ (and probably most of the other languages) are going to have almost no problems implementing it. It's just Not That Interesting.
Your argument makes sense if every language used the same form of concurrency or parallelism. Unfortunately, that isn't the case. Wouldn't the type of concurrency or parallelism affect performance?
> Now, maybe I'm wrong. Maybe there is something clever here to exploit in a concurrent program, and that some languages let you do that easier than others.
but concurrency / parallelism IS much easier in some languages vs others (did I misunderstand what you wrote?).
> Wouldn't the type of concurrency or parallelism affect performance?
That's what I'm saying: almost certainly not IF the best concurrent solution to this problem is to chunk of the work and feed it off to a bog standard thread pool. Pretty much every language---even C---has tools that make this pretty easy to do.
And I don't see any reason why a dumb thread pool wouldn't be the best answer to this if parallelism were allowed. Like, the fact that Go has goroutines is just not going to help with anything for this kind of work-load. Goroutines help in more complex concurrent situations.
> but concurrency / parallelism IS much easier in some languages vs others (did I misunderstand what you wrote?).
When the extent of the parallelism is a dumb thread pool with no synchronization between them, then it's pretty easy to do that in just about any language.
> That's what I'm saying: almost certainly not IF the best concurrent solution to this problem is to chunk of the work and feed it off to a bog standard thread pool.
That's the thing. This is not the case for all programming languages, even for some of the ones included this performance test. Python has a GIL. Distributing work via threads isn't going to do much unless it's something I/O intensive. Consequently, testing performance via a programming language's available methods to distribute work does matter. Special features like Goroutines also matter.
> When the extent of the parallelism is a dumb thread pool with no synchronization between them, then it's pretty easy to do that in just about any language.
The specific implementations of work distribution still matters for performance doesn't it? Different implementations yield varying levels of performance right?
Python has a `multiprocessing` module that sidesteps the GIL.
You can question me all day. Go build the benchmark. Publish it. Have people scrutinize it. I have my prior based on building this sort of tooling for years and actually publishing benchmarks.
And you still continue to miss my point. Instead, we're in a hypothetical pissing match about code that doesn't exist. Either the parallelism problem is uninteresting, or your benchmark becomes about complex synchronization. The former complicates the benchmark for no good reason. The latter results in a different benchmark entirely.
> Python has a `multiprocessing` module that sidesteps the GIL.
Yes and distributing work with multiple processes is not the same as distributing it via threads. There is a difference in performance in different implementations and variations of concurrency and parallelism.
> we're in a hypothetical pissing match about code that doesn't exist. Either the parallelism problem is uninteresting, or your benchmark becomes about complex synchronization.
Yes, the code doesn't exist because it's difficult to do, and that's what makes it interesting.
> Yes and distributing work with multiple processes is not the same as distributing it via threads. There is a difference in performance in different implementations and variations of concurrency and parallelism.
I literally addressed that in my first comment: "Now, maybe some languages have a harder time passing data over thread boundaries than others and maybe that impacts things. But that doesn't apply to, say, C, Rust, C++ or Go."
> Yes, the code doesn't exist because it's difficult to do, and that's what makes it interesting.
Well, I mean, I haven't done it not because it's hard, but because it appears pointless to me. That's my prior. Show me I'm wrong. I'm happy to learn something new.
> Well, I mean, I haven't done it not because it's hard, but because it appears pointless to me.
That's just your opinion and you haven't convinced me otherwise. Likewise, I don't think I'm going to convince you either.
> I literally addressed that in my first comment
Yes, but that doesn't mean that it's not an important point. Even if we were just comparing "C, Rust, C++ or Go.", different work distribution implementations have different sets of performance. It's hard, but it's still useful to capture that data because the real world doesn't run on a single thread or even a single core. If you're going to continue glossing over this, there's no point in continuing our discussion. We're just going in circles.
I didn't say it wasn't useful to capture the data. Indeed, I've been saying, "happy to see it done and happy to learn something if I'm wrong." Look at what I said above:
> Now, maybe I'm wrong. Maybe there is something clever here to exploit in a concurrent program, and that some languages let you do that easier than others. I doubt it. It would be interesting if it turned out to be the case, but it would be a different benchmark.
So the fact that you keep trying to niggle at the specifics of parallelism tells me that you're completely and totally missing my point. You're missing the forest for the trees.
Looking back at the convo, it started by you saying, "it would be nice to see parallelism considered." Fine. I responded by saying why that particular rabbit hole is probably not be so useful, or would be measuring something totally different than the OP. To which, you kind of flipped the script and said, "However, single thread benchmarks don't reflect the real world unless you're a student." It would have been a lot better to just say, "I disagree, but I don't have time to do the benchmark myself, so we'll have to leave it at that."
There's a lot of goalpost shifting going on. Like, are we arguing whether a parallel benchmark could be useful? Or are we arguing about whether a single threaded benchmark is useless? They are two different arguments. The former gets my response above, navel gazing at why that model probably isn't particularly useful. But the latter is just crazytown.
This is definitely the most interesting part of the article for me:
Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.
> Knuth has shown us here how to program intelligibly, but not wisely. I buy the discipline. I do not buy the result. He has fashioned a sort of industrial-strength Fabergé egg—intricate, wonderfully worked, refined beyond all ordinary desires, a museum piece from the start.
Well, yes – absolutely.
If you read the initial column[1] pre-McIlroy's response, you'll see that Knuth is not presenting a word-counting solution, but instead the concept of literate programming and how intertwining code and prose can be better for the programmer. Knuth isn't trying to count words, he's trying to teach you a different way to develop, and he's doing it by showing an example that has been simplified so the average person can chew on it.
I'm not sure if the popular interpretation of "ivory tower Knuth versus tactical genius McIlroy" is a modern take or if that was how it was received when the column was published, but it feels very unfair to present it as such today.
I have seen it observed that while the tr/sort/uniq approach has practical value, that if the question is how would you implement the features that tr/sort/uniq provide simply invoking them is not much of an answer. I don't think Knuth was unaware that there were existing tools that could solve the problem, the whole point was to demonstrate his literate programming approach at some semi-reasonable scale on a problem everyone could recognize. It's kinda like how pretty much every interview-sized question has some sort of off-the-shelf solution somewhere, because anything that fits into an interview has been solved thousands of times over... but that's not the point of interviews.
In modern terms, it would be something like complaining that $LANGUAGE's web server implementation is a grotesque bloated mess (and therefore $LANGUAGE isn't very good) because in shell all I have to do is run "nginx". There is a true and useful sense in which that is true, but there is also a true and useful sense in which that is utterly missing the point.
Also interesting is that this problem is used as the example in the book "Exercises in Programming Style" by Cristina Videira Lopes. It shows this problem solved in 33 different programming styles, for example historical, function composition, data-centric etc.
Last summer I was looking for a program that would tell me which shell commands last used in a specific directory. Because I had been reading "The Unix Programming Environment" it quickly occurred to me that the program I was looking for could easily be made from several existing commands. Only after I wrote the small script did I find that the solution already existed except it was >1000 lines of code, 20 github contributors, and utilized a sqlite database.
Knuth-McIlroy comes up a lot. Previous discussion [1]. For this example I can make a very similar Nim program [2] run almost exactly the same speed as `wc -w`, yet the optimized C program runs 1.2x faster not 1.15x slower than `wc -w` - a 1.4x discrepancy - bigger than many ratios in the table. So, people should be very cautious about conclusions from any of this.
Thanks for doing a nim version! I've been learning nim, and am very much a beginner, so I thought I'd tackle a simple version [0]. I'd appreciate any critiques... I don't know how idiomatic it is.
Honestly, the optimized version you created is kind of opaque; I don't want people to think that that's what "typical" nim looks like.
I'll do a pull request on the project for mine as simple (I'd have added your version as an optimized version, but I couldn't get it to compile).
On my machine, the simple-c version runs in 0.70 seconds, my simple nim runs in 0.73 seconds (and they produce the same output, which is nice :-), the nim executable is 95k.
Sure. Feel free. I agree it's probably a bit advanced, and it breaks a rule or two of the "constraints" some people seem passionate about and needs to include punctuation to match results.
Feedback-wise, yours isn't so bad, though I haven't tested it. There is definitely a simpler one that looks roughly like the Python and probably runs a bit slower than yours. That would probably be a better/more fair candidate for the "simple" category, but it probably still has ok-ish performance.
This may be more fair for the "simple" case in Nim.
import tables, strutils
var counts = initCountTable[string](16384)
for line in stdin.lines:
for word in line.toLowerASCII.split:
counts.inc word
counts.sort # SortOrder.Ascending to reverse
for word, count in counts:
echo word, " ", count
I compiled with `nim c -d:danger --gc:orc --panics:on` with gcc-10.2 on Linux-5.11 with nim-devel and file in /dev/shm. Runs in about 1.97x the time of "wc -w" for me (.454s vs .2309).
If we apply "BS-scaling" to the article table that would be 2.27*.2 = 0.454 sec on the author's machine which would make it twice as fast as the "simple C". Yet, if I actually run the simple C, I get 0.651 seconds, so only 651/454=1.43x "simple C" speed. This mismatch is, again, bigger than Go vs. RustB (0.38/0.28 = 1.35x). My only point is that you cannot just apply BS scaling and these results may very well fail to generalize across environments.
For what it's worth, literally every time I re-do anything in Rust in Nim, the Nim is faster. But benchmarks are like opinions...everyone has them and you should very much form your own, not delegate to "reputation". The best benchmark is actual application code. I make no positive general claims, but say this only because Rust people so often do. { EDIT: I think it is generally a big mistake to make many assumptions about prog.lang performance, especially if the assuming is on the "must be fast" side. }
Thanks! That's a much cleaner implementation, and I learned some cool nim. I got a note on GitHub with a similar implementation, so I went with theirs.
I think most of the uglynes in the c++ template case comes from the tool not resolving the standard types. Just replacing the std::string boilerplate reduces it by roughly half.
So that looks like a std::vector<pair<string,int> > constructor taking two map iterators as input and an internal string member taking char pointer based iterators as input.
I was suprised at the C++ outcome, so I started testing myself. I was very amazed to find a 4~5x speedup when reading directly from file (std::ifstream) instead of reading stdin. This was on a 10.5.7 MBP 2014
[edit]
more like 9~10x actually...
time { ./build/Release/countwords_ifstream; } | tail
...
real 0m1.337s
user 0m1.332s
sys 0m0.022s
time { ./build/Release/countwords_original < kjvbible_x10.txt; } | tail
...
real 0m12.184s
user 0m12.138s
sys 0m0.044s
I dug into this once because reading from stdin on my Mac was so slow, but only when using std::cin. Using C APIs to read stdin was fast. Turns out libc++ on mac uses `getc` to read from stdin one character at a time. Madness.
Huh, well that makes a lot of sense. Three years ago when I was in University practicing for competitive coding we came across a limit in the performance (ms) we could get on even simple problems. The professor leading the practice poked away at his own solution to some problem for days because he couldn't get a better time than the top score online. Eventually he implemented his own versions of cin and cout. Needless to say he was pretty frustrated but there's just some problems that are hard to let go of without a solution.
Could be a lock on stdin that is acquired in a loop? Rust has such a lock and typically one should lock stdin at the beginning and read from within the critical section.
I've got to say I'm rather happy with Go's performance and verbosity of the unoptimized version in this micro-benchmark. For me it's just the right trade-off. Go was the right choice for my current project. I don't need it faster but wouldn't want my garbage-collected language to be substantially slower.
I suppose SBCL would be in a similar ballpark for this task, though, and would love to see an unoptimized SBCL version.
The nice thing about the Go GC is, it runs completely in parallel to your program, so that there are no significant GC pauses any more (typically far below the millisecond range, rather microseconds).
Broadly, yes, Go's GC is quite fast and unobtrusive. They've done a good job tuning it for many purposes. On the post's programs though it's quite possible it doesn't come into play at all, as the compiler keeps things on the stack when possible.
But like all GCs, there are cases where it has issues. Anecdotally, I've seen multiple cases where "ballast"[1] reduced RPC response times by around 25%. A bit of GC tuning on some tests sped them up by over 10x (due to pathological thrashing - each test would go from just-below the GC threshold to just-over). Etc. Granted, it's usually on not-great code, but the point still stands - if you care about performance, GC cannot be ignored, regardless of how good its marketing is.
Ada with the Ravenscar profile enabled is a good choice for such scenarios. Together with a real-time operating system it allows you to satisfy hard real-time guarantees. As you certainly know, though, hard real-time constraints require hardware validation.
Yea, for a lot of requirements hard realtime isn't necessary and at least some older linux systems can be tuned to remove almost all jitter giving very good 99.9th percentile figures. I've even seen it done with zero alloc java
And, Perl's hash implementation can perform better compared to Awk's associative arrays. The `SCOWL-wl.txt` file used below was created using http://app.aspell.net/create. `words.txt` is from `/usr/share/dict/words`. Mawk is usually faster, but GNU Awk does better in this particular case.
$ wc -l words.txt SCOWL-wl.txt
99171 words.txt
662349 SCOWL-wl.txt
761520 total
# finding common lines between two files
# shorter file passed as the first argument here
$ time gawk 'NR==FNR{a[$0]; next} $0 in a' words.txt SCOWL-wl.txt > t1
real 0m0.376s
$ time perl -ne 'if(!$#ARGV){$h{$_}=1; next}
print if exists $h{$_}' words.txt SCOWL-wl.txt > t2
real 0m0.284s
Which runs in .39 seconds. mhd's solution runs in .42 seconds. For reference the optimized C version takes .059 seconds (possibly cause of clang?), wc -w takes .023s and the optimized python version takes .288
I know this is meant as a joke, but I just use this to once again note that Perl actually had a very common linter/pretty printer way before it became all the rage (perltidy/perlcritic). I also wish other languages had a book like Perl Best Practices, that goes beyond just syntactical issues.
To be fair, I once came across a sentiment that was pretty close to "readability is bad". Taken from both the (bourne) shell and awk, perl has lots of short variables that determine parsing input and other matters, like "$/". You can then `use English` to have alternative names for that, but those might not be as well known ("$/" would be "$INPUT_RECORD_SEPARATOR", or "$RS" as a tribute to awk). So the line noise might be more common and thus more understandable than the "readable" version.
I don't quite buy that for e.g. "$INPUT_RECORD_SEPARATOR", but there's an argument to be made for "$RS" or "$ARG" being worse than the Asterix swear words.
> I know this is meant as a joke, but I just use this to once again note that Perl actually had a very common linter/pretty printer way before it became all the rage (perltidy/perlcritic).
And all great perl hackers deliver readable code. But we still enjoy a round of perl golf once in a while. And honor demands, that, if you respond in perl golf style to an interview question, you make it count ;)
Unfortunate, because for all the badness of Perl, it shines in exactly this case of "read a potentially-infinite stream of text from stdin, process it, and dump another stream to stdout".
Chaotic standard library with cute-but-difficult-to-search-for names like "chomp", messy and complicated rules around sigils and scalar() for arrays and tables, lots of magic convenience like "$_" being optional in loops, s/// being "destructive" by default, and the need for some ugly hacks around handling subroutine arguments.
Perl has plenty of "good parts", too. But it is not an easy language to learn and even good idiomatic Perl code can be difficult to read. I don't blame people for favoring Python and Ruby.
I often see examples comparing one-liner worthy code written in one language with beautiful, linted, critique-passing indented code written in other languages, and on and on goes the Perl bashing based on examples of the former.
My version? That's a one-liner, one shouldn't maintain those. It was mainly intended to show that a lot of the logic for simple cases is so common that perl (and some other languages) have those as command line switches.
That also makes it hard to win for total outsiders, though. This touches on core functionality, which often has quite a legacy behind it and thus its own vernacular. I can write `while (my $line = <STDIN>)` instead of `while(<>)`, but I can't get as easily rid of "chomp" being confusing to most people. A bit like car/cdr in Lisp.
But that happens even if you have plain english functions and no special cased syntax. Looking at the simple python version, it took some effort for me to remember what `Count.update(lst)` does, or that `most_common` without arguments returns all elements.
Honestly, when you don’t know perl it looks really unreadable but if you spend some time learning it it’s not that hard it’s just many of the operators are not google-able. The harder part, from experience, is maintaining perl scripts. A lot of perl scripts from the early 2000s are written in extremely hacky ways without modern code style sensibilities. In 2021 I don’t see any reason to use perl over python3 anymore, python3 is fast, widely used, easy to write, and easy to learn. Even the bioinformatics folks, some of the last perl holdouts, have been switching to python.
Since learning perl a couple years ago I personally only use it for one liners to replace sed/awk. If I do use perl in a script it’s usually in one liner form in a bash script to post process output from something like ripgrep.
> In 2021 I don’t see any reason to use perl over python3 anymore
Regex handling is still far nicer in perl than anywhere else, so for any script which is primarily about string parsing with regular expressions, perl is the right tool for the job.
You would get a NameError (or maybe an UnboundLocalError in some cases) if you use a nonexistent variable name in Python. At least, this has been true since 2013 when I started learning Python.
I haven't looked into it yet (whether the benchmarks are comparable) but it would be interesting to check the programs there against the ones in this post, and either update the benchmarks here or add new submissions to that question.
For a while the fastest program there was one in Rust, then out of curiosity I translated Knuth's program from 1986 into C++ and a simplified version based on it was even faster; currently a Rust translation of the idea (a custom trie using not too much memory) is the fastest one there. (In fact, the current fastest two Rust submissions there came after I linked to the question from HN in Feb 2020, in a couple of comments at https://news.ycombinator.com/item?id=22221592 and in the discussion on a "Donald Knuth was framed" article: https://news.ycombinator.com/item?id=22406070.)
Edit: I tried to do the comparison myself but one issue I ran into (apart from minor things like taking input from filename vs stdin, and printing top N words versus all, which are easily handled) is that the benchmark on the StackExchange question considers "words" to be made of alphabetic characters (a–z), and the trie solutions make use of that fact (a trie node has 26 children etc), while the benchmark from this post includes punctuation (e.g. it has "words" like "him," and "him." i.e. "him" followed by a comma or period are counted as separate words). I gave up at this point, but maybe someone with more energy could modify the benchmark here and try comparing them.
Here, they have code samples and multiple implementations of each task in each language.
It's a great resource for, at a glance, how performant a language is at basic cpu and memory intensive tasks and how simple the related code is.
It still has the issue that some code is unrealistically optimized for the language. And also that many solutions use standard libraries written in other languages. It's worth looking at the source, therefore.
I am very sceptical of these benchmarks for the reasons you listed. The individual implementations are often too different to give a comparable score. And sometimes this is because of the rules. Like the requirement to use the language-provided hash tables, while C is free to implement an optimized hash table for the problem.
Just looking at the first lines of the linked code, I see the following:
// Define a custom hash function to use instead of khash's default hash
// function. This custom hash function uses a simpler bit shift and XOR which
// results in several percent faster performance compared to when khash's
// default hash function is used.
That is picking terms. Many languages don't allow to specify custom hash functions, so you can't specify the hash functions until you do write a custom hash table. A fair benchmark would allow for this, if it is allowed in C.
The OP "Performance comparison: counting words in Python, Go, C++, C, Awk, Forth, Rust" does not allow external libraries BUT then makes an exception for C, and then makes a further exception -- In the optimized version we’ll roll our own hash table.
Wikipedia is your friend -- "A hash table uses a hash function to compute an index, also called a hash code …"
I was not talking about the article but the benchmark site linked. Not sure what you want to say with the wikipedia citation.
I have not made a comprehensive list, but the topic has come up when discussing one of the Go benchmarks. The Go language has a build-in hashtable, called a map, and this does not allow to specify custom hash functions. Of course rolling out a hashtable implementation which does allow this is prohibited by the benchmark rules. Probably intentionally, also like your distraction attempts are hinting to.
Please share the widely agreed criteria to identify code we would all call "unrealistically optimized" ;-)
2 tasks — pidigits & regex-redux — explicitly allow use of standard libraries (GMP, PCRE2, RE2) written in other languages.
8 tasks do not.
(Those standard libraries were allowed because some language implementations provided arbitrary precision arithmetic or regex by wrapping those standard libraries.)
Of course, realistic use of some languages relies on calling standard libraries written in other languages.
Regarding speed of the C++ version: The fundamental mistake is to use std::unordered_map. The standard library map and set implementations (both ordered -> tree-based and unordered -> hash-based) are almost never the correct choice if you care about performance.
That's definitely a flaw in the language, but like with C you would just use a different implementation for this purpose.
Although std::unordered_map is definitely not nearly as fast as a hash map can be in c++, that's probably not the main problem here. I would bet on iostreams and/or tolower being bigger problems.
Note that tolower is called for every /character/. I don't know everything it does, but I know it's doing some semi-esoteric locale stuff, and in general it's doing a lot more than a naive ascii implementation would. I seem to recall it's even doing a dynamic_cast in there somewhere. For each character of input.
I was curious, so I ran callgrind on it. Here are the top 10 offenders (by Ir, 'instructions retired'). For clarity I have elided many details. Total Ir is 5,647,099,795, so the stuff below accounts for greater than 50% of all Ir.
To expand on this a bit, the C++ unordered map (hash table) uses basically a linked list to track the entries in each bucket (separate chaining for collisions) because the standard requires that operations that may resize the map don't invalidate iterators.
You pay for this with extra memory allocations and non-contiguous memory access.
In most low level benchmarks, other collision mechanisms (probing, etc) perform much better.
Would be interested to see a similar comparison for a problem where the Python solution is unable to offload a significant amount of the computation onto its standard-library.
Also, nice to see Forth included. I strongly suspect that if the Python solution were unable to leverage its rich and well optimised standard-library, it would be easily outpaced by Forth.
Also, the Forth solution used Gforth, which is far from the fastest Forth engine around. [0] It would likely have performed close to C if they had used a native-code-compiling Forth engine (SwiftForth, VFX Forth, or iForth, all of which unfortunately are proprietary payware).
> if the Python solution were unable to leverage its rich and well optimised standard-library
collections.Counter is a very straightforward collection backed by dict. The code isn't specially optimized, you can write the same code yourself with dict and even avoid (probably a negligible amount of) overhead.
Now, dict is optimized, but that's a builtin type. You don't need PSL.
The article provides profiling data of the optimised Python solution. If I'm reading it right, 90% of the time (1148 milliseconds out of a total of 1280) is spent executing the _collections._count_elements and str.split methods, both of which are Python standard-library methods presumably implemented in optimized C.
Of course, they're right to optimize this way, and it's to Python's credit that it's possible to leverage this approach so successfully using only the standard-library, but I think my earlier point stands.
Again, very straightforward, based on dict. Python itself does not function at all without the dict implementation since it's the underpinning of the object model.
str is a builtin type. Calling str.split() a standard library method is... okay.
That link shows that yes, "_count_elements" is implemented in Python, but right after it pulls in the C version if available (which it presumably is in CPython):
try: # Load C helper function if available
from _collections import _count_elements
except ImportError:
pass
Thanks for the link, interesting that _count_elements is written in Python. At a glance it looks like the kind of thing that might benefit from being written in C, but presumably that's not the case. Perhaps because of the dynamic typing of both arguments?
> Python itself does not function at all without the dict implementation since it's the underpinning of the object model.
Sure, but I don't see the point here.
> Calling str.split() a standard library method is... okay.
For our purposes there's no reason to draw a distinction between standard-library functionality closely integrated into the language, and standard-library functionality that isn't. The relevant point is that computational work is being handed off to optimised C code.
Again, if you're optimising Python, the smart move is indeed to make maximal use of the optimised machinery in the standard-library (or for that matter some other dependable library). My point still stands: it would be interesting to look at a problem that isn't amenable to this approach, where the performance of the Python interpreter itself would be brought to the fore by necessity.
How easy it is to find such a problem will be a function of how good a job Python does of providing applicable optimised functionality in its standard-library. Perhaps something like computing matrix determinants? I don't know enough Python to say.
It can be interesting – especially if you also compare it to something like the PyPy JIT – but it runs afoul of the idiomatic angle since Python has had decades of recommended practice being to write code in Python and optimize the hotspots in C, which has led to very popular libraries like NumPy, PIL/Pillow, etc. becoming almost universal in community recommendations.
Right. They don't emphasise improving the performance of the Python interpreter precisely because they prefer to handle computational heavy-lifting in C instead. Python has been very successful with this approach. Java takes the opposite approach, promoting 'purity' and investing very heavily in JVM performance.
It would still be interesting to see for benchmark purposes.
Right, that's what I meant. If the C code has to cope with dynamic typing anyway, there might be little scope to speed things up.
The same isn't true for, say, string operations, where there might be plenty of opportunity to move tight loops from Python to C and speed things up considerably.
Truthfully, I don't think optimization was a motivation for writing _count_elements() in Python or not. The CPython project doesn't prioritize optimization of the language or standard library unless there are clear performance gains to be made in the face of slow code. What most likely happened is someone wrote some code on a mailing list or bug report, and a maintainer included it as is.
Reading the bug report that introduced Counter[1], the author states it is the simplest implementation they came up with, and Guido and other maintainers prioritized simplicity.
(Too late to edit) benhoyt's comment points out that _count_elements has two implementations, one in Python and one in C. The Python version is used only if the C function is not found.
> I still like it, because (unlike C++) I can understand it
No article like that is complete without a little C++ critique :-)
> I think it’s the simple, idiomatic versions that are the most telling. This is the code programmers are likely to write in real life.
I very much agree with him on this statement. If people started writing "optimized" versions of algorithms (on that level); who would ever be able to read others peoples code...
I saw a talk by Matt Godbolt demoing the compiler explorer showing showing the assembly between idiomatic C++ vs someone trying to outsmart the compiler. In all cases the idiomatic version was superior in performance and readability / maintainability.
The one thing I would say is: as you dig more into C++ perf, you can develop an intuition for the kinds of things the compiler can and can't optimize. For instance, the C++ example from the article uses `std::unordered_map`, which is just an absolute mess from a cache locality perspective—the best compilers today (or of the foreseeable future) can't do a thing to fix that. Improving the programmer's choice of data structure is just not on the radar. :(
std::iostream is going to be hopelessly slow, but I would like to point out that -O3 is more appropriate for templated C++ code.
It seems that the two hot spots are the std::string constructor (unsurprisingly) and the vector constructor.
Making string faster is not easy. I would check that gcc is configured to use C++11 strings with the small string optimization and not the older C++03 compatible refcounted string. Writing your own string or a custom allocator are options, but probably overkill for the problem.
The vector constructor is more surprising. It is possible that -O3 could help a bit there. Wrapping the map iterators with std::move_iterator would also avoid a lot of string copies.
std::move_iterator was my first thought too, but I'm not sure if it works, since the key type for std::unordered_map is defined to be const, so you may have to make a copy anyway.
Maybe use a vector-of-pointer-to-pair, since we're leaving the map around and we don't really need the vector to own anything.
I basically solved the same problem about a year ago. I was reading Gormenghast trilogy and for a considerable part of the book I stumbled upon rare English words of unknown or hazy meaning for me. I decided I need a tool to prepare Anki decks with those words, bu as I'm too lazy to manually log them, I thought I just use the same algorithm, take the top most infrequently occurring words and generate a CSV with definitions from GCIDE[0] to load into Anki.
I used Rust and thought it's the textbook case for a trie, but what I found is that crates.io had quite a few tries but not all of them working or ergonomic enough. Only one of the libraries I tried (huh) gave me a slight advantage over a HashMap, although statistically insignificant, which is rather disappointing, considering the program outputted sorted frequency groups.
I'm pretty pleased with the results. I considered adding stemming[1], silly me. Too hard of a problem to bother.
Setting LC_* to 'C' nukes locale awareness by saying "we're only caring about ASCII-en-us" and thus goes back to the 'cheap' tolower that's also present in the C and C++ version, and most likely in the awk as well - in contrast to, say, the python version.
Finding the "lower case of a character" is immensely harder in face of unicode, because the table is way larger and there's no nice speed hacks by manipulating an index into the ASCII table. JFGI: "tolower performance unicode".
If you notice setting LC_ALL makes a difference in performance for you, you ought to be aware that now you're no longer comparing same capabilities.
I'm not sure how to constructively state that except for: One shouldn't compare ASCII and unicode "tolower" in performance comparisons, as you end up comparing apples and oranges (at best. More like apples and snails) - which I did.
If the author uses code that uses "tolower" in job interviews, i.e., evaluates candidates based on their input WRT case normalization - and he even writes ("This is Unicode-aware ...") - he should know that Unicode awareness is not ubiquitous, and comes at quite a cost. Knowing about unicode awareness, one would assume he'd be aware of not making a fair comparison.
Can you state the specific comparison that is not apples-to-apples in the OP?
The only one I see is the comparison between 'simple' and 'optimized'. But that's more about "what does a simple idiomatic solution look like" and what does an "optimized and possibly less simple" solution look like. That comparison isn't designed to be apples-to-apples in the way you're saying. The simple variant will use Unicode-aware casing in environments where that's the simple and natural thing to do.
IIRC, most of the 'optimized' programs are using ASCII casing. Python doesn't, but casing isn't even close to the bottleneck in that program.
I believe the author is fully aware of this, and just chose the ASCII version for all languages precisely so that we don't get into Unicode efficiency issues. Doing that wouldn't be objectively bad or anything, it just isn't the question the author is asking. It is still a fine question on its own terms.
Yes, I'm definitely aware of this, thanks -- and as a result, you can't really compare the "simple" and "optimized" versions. Most of the "simple" versions (except C) are Unicode-aware, most of the optimized versions not -- though I acknowledge I haven't stuck to that always, e.g., in the C case, when it's just "too hard".
Regarding the C++ version: instead of copying data to a vector and then sorting the vector, it might be more profitable to use std::multimap<size_t, string>. This is a sorted container (likely rb-tree based) that handles duplicated keys.
I don't think Go gets the credit it deserves (from a performance aspect). I've had to rewrite a good deal of C++ code in Go and most of the time, the Go code runs faster.
Been true of Java for decades. Most application programmers aren't churning out highly tuned C++, and their design is easily outperformed with a bump allocator and generational GC.
Optimized grep is as fast as the non optimized grep. Seems odd.
I've seen ripgrep (in Rust, also by article contributor Andrew Gallant aka burntsushi) outperform grep in almost every way possible, both in benchmarks and in my experience.
I think grep (and wc) are included in that table as a sort of baseline. Like, "roughly speaking, what can we expect?" Or maybe, "what can we hope for?" grep (or ripgrep) isn't otherwise particularly relevant here.
In the case of grep and wc, the "optimized" variant is running with LC_ALL=C, which sets the locale to C. Presumably, the non-optimized variant is using the OP's default locale, which is maybe something like en_US.UTF-8. (Apologies for the US assumption, but what matters is that it's not the C locale.)
This is a common trick for speeding up GNU utilities when you're only working with ASCII data. It also works for things like `sort`.
In the case of grep, sometimes running with and without the locale will be about as fast, particularly if you're only searching for a simple ASCII literal pattern.
The optimized Python code seems to have a number of bugs:
The loop ending condition might miss non-handled `remaining` (when there is no new-line at the end of the file). (The fix should be simple. Just move the check one line below.)
When there is no newline in some chunk, it would handle the word incorrectly at the chunk boundaries. (But ok, with the constraint that lines cannot be longer than the chunk size, this should not happen. But this could have been fixed easily anyway. Just `remaining = chunk; continue` + the other fix.)
Thanks for the feedback. I was trying to make that clear with "grep reference", but you're right, it's still ambiguous. I'll add a note to clear it up. Both "wc" and "grep" don't solve the problem, they're just comparisons to see how fast other programs can read the input (and in wc's case, tokenize into words). For grep, I'm using "grep foobar".
(The C code.) Using hcreate and the fact that posix describes this interface which uses a hash table singleton, is just mind boggling. :) So ugly, and you'd hope anything as non-reusable as that is deprecated with a strong wording.
Even the "optimized" C version is far from what an experienced C programmer would write if performance was paramount. General-purpose memory allocation, using hash tables with inherently bad spatial and temporal locality, using buffered I/O instead of mapping the file to memory.
You can't map an arbitrary stream into memory. That's one of the constraints of the challenge: it has to be able to work on a stream.
And even then, you can't memory map all kinds of files. This is what happens when you assume that you can:
$ ag MHz /proc/cpuinfo
$ grep MHz /proc/cpuinfo
cpu MHz : 2300.000
cpu MHz : 988.934
cpu MHz : 2300.000
cpu MHz : 800.044
cpu MHz : 2300.000
cpu MHz : 2300.000
cpu MHz : 2300.000
cpu MHz : 1100.949
In the optimized C program, allocation is not a bottleneck. Pretty much all allocation is done upfront and that's all you need. There is an allocation for writing a new word to the table, but that's also amortized and isn't a big factor in the performance of the program in this particular benchmark.
As for "bad spatial and temporal locality," can you be more specific? I guess the only thing I can see is perhaps inlining words smaller than some size into the same allocation as the hash table. But the hash table is otherwise one contiguous allocation.
> As for "bad spatial and temporal locality," can you be more specific?
One could use a lexicographic Btree-like structure with multiple characters per node, sharing consecutive cache-lines. For looking up the word "bar", you would traverse the "b" key from the root node, then its child "a" node, then the "r" child of the "a" node. Most of the nodes near the top would remain hot in the cache, and deeper less-visited nodes could be joined or compacted based on some criteria. Of course there are many subtleties, special cases and complications and the solution would be far more complex in terms of source code length.
With a hash table, you effectively compute a hash for every word, and then jump at arbitrary locations throughout the whole table.
You've traded hashing for a memory access per byte. I did something similar in the 'optimized-trie' program for Rust: https://github.com/benhoyt/countwords/blob/master/rust/optim... --- It ended up being slower. But, it is very much impacted by cache effects. So the next step here would indeed be to figure out how to reduce cache misses. But it's non-trivial.
Like, maybe your idea works. Maybe. I don't know. You'd have to try it. But it's not clear to me that it will. And it doesn't support "is far from what an experienced C programmer would write if performance was paramount" IMO.
> That's one of the constraints of the challenge: it has to be able to work on a stream.
I cannot find any stream constraint in either the article or the countwords repo. He just says "standard input" and his "test.sh" uses "<kjvbible_x10.txt".
You can mmap stdin/fd 0 after an `fstat(0,..)` no problemo. Can it fail and should you check errors? Sure (as can accessing stdin in any way, actually).
He even mentions memory-mapped IO as a further way to go in the article in the C part.
> Memory: don’t read whole file into memory. Buffering it line-by-line is okay, or in chunks with a maximum buffer size of 64KB. That said, it’s okay to keep the whole word-count map in memory (we’re assuming the input is text in a real language, not full of randomized unique words).
A shorter but less precise way of saying that is, "make the program work on streams."
> You can mmap stdin/fd 0 after an `fstat(0,..)` no problemo. Can it fail and should you check errors? Sure (as can accessing stdin in any way, actually).
Of course. And when it fails, what do you do? You defer to something that can handle a stream! Which is exactly the problem in the OP.
I disagree. mmap will demand page into memory - typically reading 4K at a time off storage (though really the buffer cache in all these timings). Much less than 64K. Were the file bigger than RAM/were there competition, it would also evict earlier pages. It's really just another way to "buffer" without worrying about all the fuss in user space, and I don't think you're going to convince me otherwise. I mean, MAP_POPULATE might violate the idea, but that is a very special case of mmap.
Honestly, I think the author himself saying memory-mapped IO is a forward direction but "enough for now!" is pretty conclusive that he didn't think it was a forbidden optimization direction. I think you are over interpreting his early step-by-step style language explaining finite memory as a strict spec, or perhaps he shared an earlier draft of the article with you before he wrote that.
It's fine to fail over to a stream, but often mmap is faster, as I know you know.
If you mmap'd a file, how do you keep its maximum memory usage to 64KB? I actually don't know the answer to that question. It's definitely not the default. You don't get to control it. The OS does.
> It's fine to fail over to a stream, but often mmap is faster, as I know you know.
But that's exactly the point. You literally cannot use mmap in all cases, either because you just can't or because it's actually slower. And in those cases, you need to fail over to a streaming implementation. And that streaming implementation needs to be fast too. And that's what this benchmark is.
If you submitted a program that only used mmaps, and I sent a stream into that program, it would fail. So then you would need to modify the program to handle streams. And your strategy for dealing with mmaps couldn't be used. So then you'd need to optimize your handling of streams. Which is exactly what the programs in the OP are doing. So in the end, mmaps are a distraction for a benchmark like this, and I suspect that's why the OP didn't bother with them.
You keep ignoring the author literally opening it up as an avenue in:
>I’m sure there’s further you could go with the C version: investigate memory-mapped I/O, avoid processing byte-at-a-time, use a fancier data structure for counting, etc. But this is quite enough for now!
This subthread started with bluetomcat talking about the optimized C and what experienced C devs might do. That strikes me as fair game to open up discussion of alternative IO anyway even if the author hadn't already (which he clearly did, as quoted). So, that's two reasons it's worth bringing up, and I was never criticizing your program!
If what you are on about is "Who wins what scorecard against what arbitrary constraints" or "My hands were tied, really!" or whatever, then, sorry, but this benchmark is too uncontrolled for great answers even on its own terms. Various stdio-using things would be "out of constraints" if a system was configured with bigger than 64K default buffers anyway which is certainly possible, if unlikely, today. None of the sample programs or the test harness "check" for that. Some of the languages may not be able to ensure it. Using that to leapfrog to "only streams" is a stretch. Is CPU freq scaling controlled? Min of N trials to filter background noise/get repeatability? Even simple mean+-sdev? No, no, and no. And probably four more things.
Beyond all of that, I also don't think that's what this subthread was ever about. bluetomcat's opening was literally the opposite - equivalent to "real devs would do xyz implicitly independent of the arbitrary constraints posed if performance is paramount"..seemingly a follow up on my quote from the author. He can of course chime in if I misread that. Presumably, the author would have had to relax that already problematic 64K constraint when moving on to mmap (which moving on he might have done if the article were fewer languages or if he had just done it that way first in his open coded C).
The positions you seem dug into here seems to me "don't mention mmap to people questioning the posing of the contest even though the author did" or "you cannot default to mmap and fail over like fstat||mmap||do_streams||aiie_noStdInEven". Yes, which is faster varies by OS/situation. So? Maybe "experienced" C/whatever devs know their situation. (You use mmap in ripgrep...). Maybe these are all honest communication errors, but I don't think your positions are very tenable.
As I mentioned in a few places, performance conclusions here are harder than they might look. As for "distractions", one might say that about literally all the optimized variants, including their numbers in the table since the numbers are likely to change more. The article might be stronger to focus on only the simple variants of all the rest, leaving the optimized ones for the github repo and weird "contest rule" debates on the github issues.
Anyway, to add a little more actual information for passersby less dug in to defending some weird position like "", Nim's stdlib has a trie in critbits module. So, that test is "in bounds" and an easy experiment someone might enjoy.
It's not about a scorecard. I didn't just dig my heels in and say, "language lawyering the OP says mmap isn't allowed so shut up." It's about what's a useful benchmark. I explained why mmap's are a distraction beyond just pointing to the OP's constraints. I don't think there's really much else to say and I disagree with your characterization of this thread.
> So ugly, and you'd hope anything as non-reusable as that is deprecated with a strong wording.
It isn't. It was only added in POSIX 2001. Still there without major changes in 2018. [0] (There's actually a lot of similar libraries in POSIX, that don't include re-entrant forms.)
Thankfully, re-entrant versions are supplied as extensions by GNU, which is significantly less horrifying.
I would have loved to see the Knuth version (which was literate Pascal with WEB, I think) compared as well, but I imagine that is a nightmare to build. I wonder if the source code is available anywhere, I can't recall ever seeing it. I think Knuth used a custom-made trie, and it would be interesting to see that compared to these hash table based solutions.
Also: this comparison honestly just makes me love AWK more and more. So little code needed! So concise, and pretty ok performance as well! AWK really is an underrated little language.
I think the point was to compare the reasonably simplest idiomatic examples in the various languages. For example, the C++ example used `std::unordered_map` even though that isn't necessarily the fastest possible hash table for C++ for even general purpose jobs, let alone specific ones.
Of course you could write significantly faster implementations in C (and probably in the other languages) by writing data structures from scratch especially for the job (and memory allocation strategies and maybe other stuff), and if you're the author of wc or grep then that is well worth doing because it will be used so many times relative to the amount of time for you to write that code. The comparison of the idiomatic C program vs grep is basically proof of that. But the article seemed to be targetting more typical developers than just want to plug together some common existing components.
I do get a wry grin from the trie not being faster. Feels like that is always a data structure that feels like it will speed things up, but really needs very specific settings to do so.
Real life application for this exercise (I use a python version from time to time):
Count the frequency of words in a book, then get the most frequent words you don't know in Anki, so you can learn the words that will help you understand more of the text.
This is of course taken from the Fluent Forever book.
Example from Малкият принц:
какво 62
са 61
ако 58
когато 56
ме 55
беше 53
а 51
планета 51
ли 49
човек 43
едно 42
бе 42
ден 42
който 40
I have python script that takes on the book, does what you said and translates everything also. Very helpful, with this + list od 2K most popular words learned english in 2 books when in middle school.
Sure here it is[0]. This one is for parsing Kindle clippings but it's quite similar (except no sorting of all words, I was usually using some service for mobi to txt switch before feeding script). I should probably clean it up and switch implementation to english (but don't use it anymore).
I wonder why the C++ version opted for an unordered_map, copied into a vector and then sorted. I'd be curious to see if using an std::map would work better.
There is no C# optimized variant because nobody wrote one. There is no claim that the C# program is optimized. Look at the table: there are "simple" and "optimized" variants for each language (except C#). The "simple" variants are intentionally not supposed to be optimized, and instead, use something approximating idiomatic code for the given language.
You're micro-optimizing the wrong thing, most of the time taken in C# is in the .Split(' ', StringSplitOptions.RemoveEmptyEntries) (which is UNICODE instead of ASCII as in other languages by the way) and also Console.ReadLine().
C# supports non-allocating splitting via the Span / ReadOnlySpan see this article about this very question:
The author named it “simple.cs” for presumably that reason and the “though not being a C# developer, there may be obvious things I’m missing” part makes me think they’d be happy to get your optimized contribution.
Totally fair - my point was mostly that it seemed pretty clear that the author didn’t have much C# experience and would gladly accept an expert’s optimized version.
Your result is very surprising probably because you benchmarked on the very small example corpus whereas other languages were benchmarked on a much much bigger corpus
Also not entirely sure how one plans to compare runtime performance when the original article didn't really describe the hardware it ran on that I could see skimming it over. If you can get better runtime performance results and run against ancient hardware of the same architecture, you may be able to assume you're below the lower bound of the benchmark system explored. If your runtimes are better in that case, you may be able to place high confidence that the LISP example in this case actually is more performent. Definitely would run it against the same inputs they used since they provide it and describe how you can easily get/derive the input (some Project Gutenberg ebook concatenated 10 times).
I suppose you could also recreate all their examples to create your own baseline of runtime performance but that's a lot of work for what seems to be a not-very empirical benchmark (at least to me).
Disclaimer: I did not check runtime complexity of any of the implementations because I didn't really care and skipped straight to the performance results table.
On x10 (no optimization yet so consing is killing performance):
firmament: 10
firmament, 10
genesis 10
version 10
Evaluation took:
2.798 seconds of real time
2.813069 seconds of total run time (2.677126 user, 0.135943 system)
100.54% CPU
8,125,595,437 processor cycles
934,110,048 bytes consed
It's just testing the output.
For benchmark each variant is run 5 times with kjvbible.txt as input and then the lowest execution time is used as benchmark.
Do you have the same exact hardware that the article is using? Otherwise, we can't say one way or another on that with the data you've provided.
I'm also pretty certain the article is benchmarking against the 10x copy file for the actual benchmarks.
See this example command in the article:
time $PROGRAM <kjvbible_x10.txt >/dev/null
So, even if you had the exact same hardware, I'm pretty sure your program would only be a bit faster than the unoptimized C# version. However, it's possible that your machine is a lot slower than what's used in the article, and your program is actually pretty fast -- but without more points of comparison, we just don't know. You haven't run the other benchmark programs on your hardware and posted the results.
It would be interesting to have the article author run the lisp code on their machine for a real comparison - would be very interested to see the results. My machine is a Linux (Ubuntu) laptop. I just sent an email to the author with the common lisp code - we'll se if he's interested enough to check.
Missed the sample set from the example. on the kjvbible.txt file with no optimization:
...
WEB> (time (performance-count "/home/frederick/kjvbible.txt"))
the 64015
and 51313
of 34634
26879
to 13567
that 12784
in 12503
he 10261
shall 9838
unto 8987
for 8810
i 8708
...
Evaluation took:
0.365 seconds of real time
0.370322 seconds of total run time (0.338364 user, 0.031958 system)
101.37% CPU
1,060,005,621 processor cycles
106,297,040 bytes consed
I had a bug in my original code which alphabetized rather than sorted by count, so the sort line should be:
((keys (sort (alexandria:hash-table-keys map) (lambda(x y)(> (gethash x map)(gethash y map))))))
Beyond the comments other people left here, for the results to be comparable, you need to provide the result for at least one (but ideally more than one) of the solutions in the article running on your hardware... otherwise you could just have a really fast (or slow) machine.
I assume you're asking why the "simple" Rust version is slower than the "simple" Go version?
I wrote the simple Rust program. I've also been writing Go for a long time. It's tough to say precisely why, but here are some guesses:
* The simple variants, by virtue of being simple, do a lot of extra allocation. Go, because of its GC, might be able to do a bit better here. (I believe the Rust program also does more allocations than the Go program.)
* In Rust, all strings are UTF-8 validated. In Go, they are not. In Go, strings are only conventionally UTF-8.
* A good portion of these programs is spent interacting with a hashmap. Rust's default hashing algorithm is chosen to prevent HashDoS attacks[1] at the expense of slower hashing. I actually don't know whether Go's hashmap does the same. So I'm highlighting it here as a potential difference that perhaps someone else can elaborate on.
Strings are just slices of bytes in Rust too. Both languages use UTF-8 as their internal representation. The difference is that Rust requires its strings to be valid UTF-8. Go does not. Either choice is reasonable. The downside of requiring valid UTF-8 is that in order to read bytes from a file, you have to do a validation check on them to ensure they are UTF-8 before returning a string back. This is an additional cost.
But this is not the only additional cost in the Rust program. I outlined a few others.
> Isn't it possible to do the same with Rust for this benchmark?
This benchmark has two programs for each language. The "simple" and "optimized" variant. The "simple" version is supposed to be written in an idiomatic style for that language. Taking extra steps to make tweaks and optimize the code is, I think, against the spirit of the challenge. Obviously, this is a very fuzzy concept, and everyone can make up their own mind on the extent to which this framing is useful. (I think it is, personally, especially when you also allow for a second submission that tries to make the program fast.)
To clarify my question, can Rust read files and split their strings without checking UTF8 correctness? That could speedup the algorithm.
In Go I have used this [1] in the past when I had to validate UTF8 encoding but on hot paths where I'm sure UTF8 is valid (coming from sanitized database data for example) I skipped that part.
Don’t downvote, it’s a really good question. Naively, I’d also have expected Rust to be faster, given the same algorithm. My guess is that without a GC, the Rust solution can’t delay deallocating all those temporary strings (both variants seem to allocate a string for each word, though I’m not sure if Go doesn’t simply share string slices). GCs free memory in batches, which is often more efficient, and sometimes don’t have to do it all on process exit. The Rust code meticulously deallocates each string individually.
No problem -- defaultdict is one of my favourite Python tools! Note that the code you posted is nice for a one-liner, however, it goes against my constraint of not reading the entire file into memory (important for very large files).
counts = Hash.new(0)
STDIN.each_line do |line|
words = line.downcase.split
words.each do |word|
counts[word] += 1
end
end
counts.sort_by { |_k, v| -v }.each do |word, count|
p "#{word} #{count}"
end
Yeah. I want to check that locally... I don't feel like that's doing anything useful here.
For starters, ints aren't normally "allocated" (in the heap sense). Go is value oriented, not reference oriented.
I feel like the bigger change in the code was avoiding the allocation of the string for each word in the source text, and instead passing the byte slice directly to the point where it was used as the map key. Since strings are immutable, when you convert a byte slice to a string, it must be copied into a new immutable backing slice so that no one else can touch it. The exception is that the compiler optimizes map accesses where the key is a string type, but the code is converting a byte slice into a string at the point of using it as a map key. The compiler elides that conversion, making it a no-op, and instead just passes the byte slice in, since the hash map can make a copy of the string if it needs to, but it will otherwise avoid the unnecessary allocation and copying that converting this would otherwise require.
I feel like that's where the big difference is. Heap allocating the integers and storing references in the map doesn't feel like it's actually doing anything here, although storing pointer values in maps can be useful in certain scenarios.
No, map[string]int doesn't allocate ints. Why the map[string]*int version is faster is because it avoids a "write" to the map when you're not adding new words -- it can just look up the pointer and increment what it points to. The Go compiler is not smart enough (yet) to optimize the former to avoid a map write. (That's my understanding, anyway.)
I don't trust myself to write a decent version, but if someone here wants to do some marketing for Nim I'm sure the code would be compact and easy to read, but could still end up quite fast here!
There're more string copies that can be optimized away for the C++ solution:
1. When filling the map with `word`, can use emplace with std::move of the `word` instead of operator[] because `word` is no longer needed after that, after checking that it's not already in the map
2. Constructing the vector (for sorting) with all pairs from the map ends up copying the whole map's worth of data! Can use iterator of the map instead of the key of the map in the pair. This may also make sort() faster because it may be cheaper to move an iterator than a string during sort()
In the Python examples, you should iterate over sys.stdin.buffer, which wraps sys.stdin with an io.BufferReader, to make it equivalent to the examples in other languages.
If you're not using multiple CPU cores with either multiple processes or threads for all the languages being compared, this blog post is not as useful as it good be
I like it (and collapseos), but scheme would be more amazing for that projec; because well, once I saw some Spaniard doing vector algebra in Forth exercises and my head nearly exploded.
Forth is ideal to compose new words, but sometimes playing with the stack look like juggling.
I read it as counting source code words, in which case Linux Shell scripts really won this contest.
I'm not a linux/shell expert, but I really, really thought the shell script was the most performant answer... it was quite legible even to this Windows user, and only had one line of text.
Gprof is as good as deprecated nowadays. Perf is an amazing tool: convenient for simple call-graph profiling, and with a huge slew of features if you want to dig deeper.
The constraints do say that it is okay to assume lines are reasonable length. But yes, if you were making GNU wordfreq, that might not be something you want to assume. But at that point, it just depends on what you want your failure mode to be I suppose. greps for example will happily just gobble up as much memory as they can to fit a line into memory. :-)
What advantage would a trie's prefix searching give you in a word count task? Worse, you're following word.length() pointers, potentially all over your memory. You could collapse with Patricia or Merkle or whatever, but they'll all be worse than the one pointer lookup of a hash table (modulo collision chasing if you pick a bad hash/size) & all in the service of giving you a feature (prefix lookup) you don't need.
The comments explain why I think it's slower. The TL;DR is that using a trie requires more memory accesses than a hash table (per byte of input), and those memory accesses slow the whole enterprise down.
But, not all memory accesses are created equal. So perhaps a more clever representation that exploits locality better would do better. "Obvious" techniques for shrinking memory usage made a big difference and brought the performance of the trie program closer to C: https://github.com/benhoyt/countwords/pull/2
Very interesting, but why would you omit the two most used languages, Java and JavaScript in favor of obscure ones such as Rust and Forth that nobody uses?
Because they bring absolutely nothing new to the comparison. These languages are uninteresting in their feature set, and plain boring compared to Forth. Python and C++ already represent the "most used" languages category.
Also, not knowing Forth is a Bad Thing(tm) if you want to be a Real Programmer(tm). Same goes for Lisp and Prolog. They all have a place in the history and practice of programming, and dismissing them as "obscure languages that no one uses" misses the point.
If the author doesn’t know how to use C++ they should not have included it. It just leads to everyone commenting on that defective aspect of the article, instead of the rest of it.
The table at the end of the article, and most of the article, are meaningless. They are comparing the performance of grep and wc, programs written in C, with their own program, written in C. Python is also in the mix, and is also a program written in C. Most C programs are also valid C++ programs, so the top entry for grep should really be labelled "C/C++". I'm tired of this kind of comparison.
The other day someone posted rust was faster than C and I asked for context because the article had no proof. My post was flagged
This article shows exactly why C is nearly twice as fast as C++ and why rust is in between. This is a good article, the other thread was a bad article and bad thread
Maybe it was negative but how else would you talk about a problem with an article? Every benchmark I seen with actual numbers and code had C++ faster except for one and we were able to check the source to why
Even this comment, while short, is much better! "doesn't have context" is, ironically, kind of context-free. What context did you want? Given what you just said, the answer is apparently "benchmarks." You don't even need to make your comment all that much longer to be much, much better (imho, I'm not dang, I don't own this place, heh)
> With the lack of context in this article I'm willing to bet rust isn't actually faster than C
into something like
> Without benchmarks, I'm willing to bet rust isn't actually faster than C.
Now that I write that out, it's even shorter! But it's way better. You've made a specific criticism of the article, rather than making a very abstract one.
I'd still say that something this short isn't likely to be upvoted, but it isn't likely to be downvoted like you were. Compare your comment to this one: https://news.ycombinator.com/item?id=26446830
While it does start with some "I like Rust," imho that wasn't even necessary. It's also more aggressive than you were, but even got some replies. Even though it contained the same sentiment.
I'm not sure if people think benchmarks is just the numbers but I wanted to see source so I can why it was faster and what conditions. C++ is basically an assembler and saying you're faster than assembly is either completely untrue or it'd be poorly written assembly which should be examined.
So essentially I wanted actual code with example numbers
These things invariably report misleading results just because the author can't be bothered to learn to write idiomatic code in most of the languages.
Nobody is obliged to learn that many languages well, but truth demands scruples. If you are not interested in truth, what is the point of publishing one of these?
The result is predictable: whichever language the author already knows best wins.
In this one, for example, we get:
"There’s obviously a lot more pushing you could do with C++. However, I suspect it would end up getting more and more low-level and more C-like"
which is just wholly false.
There are reasons why the people who are the most serious about performance use C++: you can get C++ code as fast as the machine can physically do without giving up any abstraction or readability, and without avoiding powerful features. That is the core value proposition of the language: abstraction without penalty. It delivers, and not by accident.
To get top performance does require awareness of what operations are inherently slow for machines, and not writing code to require such operations. But that doesn't mean giving anything up.
Code that looks like C is no faster than C. Hence, to make code faster than C, you must make it look less like C.
Really, we don't expect an article like this to reveal all about how to write the best code in all languages.
Why not write the C++ program? It's not like the OP presented the C++ program as the shining example of what the best looks like. They inserted plenty of caveats.
And FWIW, I ported the optimized C program to Rust. They look very similar even though Rust is all about zero cost abstractions as well. So the OP's claim really isn't that ridiculous.
In the old days we said, "You can write Fortran in any language", meaning you can write bad code in any language. But that is not a reason to write bad code.
Idiomatic Rust, like idiomatic C++, is faster than C. So, your C-in-Rust runs like C, slow, not like good Rust or good C++, fast.
I don't know. But Ben Hoyt got a similar time as me too. I saw your GitHub PR. Moreover, looking at your program, I would expect it to be slower than the other optimized versions (Rust, C, Go). Because it doesn't actually implement the optimizations in those programs! So not only is your code slow on two different compilers, but it also matches my prior.
You're the one who came out of the woodwork and said a bunch of nonsense. Maybe put in a little effort here to try to figure things out. Like, oh, I don't know, run the benchmarks on the other programs to compare relative timings. Like I did. Or show the commands you used to compile. Like I did. Or show the commands you used to run the program. Like I did. You want me to hold your beer? Fine. But put in the work.
I can't really wrap my head around these types of question.
Surely the correct answer is "use some established third party library that does this exact task well". Probably faster, better battle tested against weird input, localised etc etc than anything you can come up with by yourself.
Though obviously that is then useless if the interview is actually for someone who can actually write something like this for some super secret internal thing that is nothing like the problem solved by the standard solutions available. But is that really the case? I have doubts.
If I was hiring though, I'd be more worried about hiring people who would write standard elements from scratch (mostly because that's more fun) than re-use standard building blocks to get a task done correctly.
e.g. for this specific question I'd be most impressed with someone who could talk about Spacy or some other standard library that can do a lot of things in this area very quickly but also be repurposed to do interesting new things. Do I need someone who can write Spacy from scratch or someone who can use Spacy?
Immediately optimising by ignoring non-ASCII and punctuation seems mor elike a red flag than a positive.
> Surely the correct answer is "use some established third party library that does this exact task well". Probably faster, better battle tested against weird input, localised etc etc than anything you can come up with by yourself.
The point of questions like this is to test basic understanding, not the ability to search on SO or similar.
> If I was hiring though, I'd be more worried about hiring people who would write standard elements from scratch (mostly because that's more fun) than re-use standard building blocks to get a task done correctly.
As long as the hire is able to fix/extend the 3rd party library on their own... this is exactly what you figure out with "these types of questions".
It seems almost disingenuous to me to miss the point of this type of interview question, because it is so obvious that the intention is “imagine a scenario where you need to do a high performance implementation of some very well-defined task, for which you don’t have a super fast utility function. Let’s use word frequency counting as an example and ignore the fact that such a problem is almost certainly already solved.” It’s implicit in the asking of the question that they aren’t actually in need of word counts on arbitrary bodies of text and need your help solving that problem. It’s a programming interview. Accept the premise and show them that you can program.
It does irk me too. Being a programmer requires understanding the complexities hidden behind a task, and this brushes them all aside as if they don't matter.
There are cases where your task is nice, simple, and well-defined, but word counting is not one of them. It's better to give a more realistic task that allows people to think about the premises.
Being a programmer involves a lot of things that are difficult to test for all at once in 45 minutes. Be irked all you want, but it seems counterproductive to pretend what’s being asked is in anyway unclear or unreasonable when you’re given a pure programming question with a clear spec. You’re supposed to write the code. That’s all.
My problem isn't that the spec is unclear, it's that the spec is silly. The task makes no sense, and selects for misusing tools. This makes it a bad interview question as given.
I’d argue that the problem does make perfect sense and isn’t a “misuse of tools” if you manage to understand the goal of the interviewer, which is “watch this person write code for a solution to a well-specified but arbitrary (perhaps even silly) problem.”
I was more viewing this in context of the article, which goes on to micro-optimize a version with an IMO unreasonable set of assumptions.
If it's instead explicitly given that this is a quick litmus test, the choices of the first pass implementation don't really matter, and you'll talk about complexities after, then sure.
... otherwise the proper answer is probably “I would recognize I’m sitting in a room with someone who has likely already put enough thought into this task that they could be considered an expert on the solution and I would consult them on the best solution rather than reinventing the wheel,” which would almost certainly not get you hired.
> use some established third party library that does this exact task well
I seriously doubt such a library exists for most languages (Javascript might be an exception because they seem to love writing one-line libraries). It's such a simple task!
I'm no expert in any of the languages used except maybe C, at least I'm quick to form opinions about C code. In C, I'm not so sure that there is a well-established "idiomatic" concept, since C is not so tightly bound to a community as more modern languages. At least that's my feeling.
That said, I was shocked to see that the example code did heap-allocations to store integers. Single integers, one per allocation, per "counting bucket". I would store the count in the data pointer without a second thought, I just "can't" make 4/8-byte heap allocations and still sleep at night. Especially after all the concern about processing the input in large blocks for performance! Casting an integer to/from a void pointer should be pretty safe according to my gut feel. If that turns out to be false, I would dump the libc's hash API altogether, since it makes me do heap allocations to store single integers.
I tried making the change locally, and the time taken (as measured with the OP's benchmarking program) dropped from 0.11 to 0.10 on my system (a Dell laptop running an Intel i5-7300). That's at least something.