Performance comparison: counting words in Python, Go, C++, C, Awk, Forth, Rust

unwind · on March 15, 2021

Very interesting, these kinds of articles are (to me) some of the best content we get on here.

I'm no expert in any of the languages used except maybe C, at least I'm quick to form opinions about C code. In C, I'm not so sure that there is a well-established "idiomatic" concept, since C is not so tightly bound to a community as more modern languages. At least that's my feeling.

That said, I was shocked to see that the example code did heap-allocations to store integers. Single integers, one per allocation, per "counting bucket". I would store the count in the data pointer without a second thought, I just "can't" make 4/8-byte heap allocations and still sleep at night. Especially after all the concern about processing the input in large blocks for performance! Casting an integer to/from a void pointer should be pretty safe according to my gut feel. If that turns out to be false, I would dump the libc's hash API altogether, since it makes me do heap allocations to store single integers.

I tried making the change locally, and the time taken (as measured with the OP's benchmarking program) dropped from 0.11 to 0.10 on my system (a Dell laptop running an Intel i5-7300). That's at least something.

benhoyt · on March 15, 2021

Yeah, I would have liked to avoid it too. But I think you have to allocate with the hcreate/hsearch API (for the "simple" C version). When I roll my own hash table with optimized.c I don't allocate for the integers.

unwind · on March 15, 2021

No, you can just jam the integer into the "data" field in the structure and skip that allocation.

As I said, and others have commented, it requires some complicated castery to avoid UB, but it's totally doable (and at least GCC does the right thing, while warning, with no advanced casts).

Thanks for an interesting article!

c-cube · on March 15, 2021

Afaik the only integer type it's safe to cast from and to pointers is intptr_t. Not exactly the type people reach to when writing the first "simple" version. Another comment also points out that hcreate (which is not posix, only GNU?) and the likes use a global hashmap? I can't see a lot of use cases for such a singleton datastructure.

fanf2 · on March 15, 2021

If you need to round-trip a pointer through an integer and back to the same pointer, you need to use intptr_t or uintptr_t. (Note that this is true for object pointers but not necessarily function pointers. There’s no integer type guaranteed to be big enough for a function pointer.)

If you want to round-trip an integer through a pointer and back to the same integer, you can use any integer type provided it is not bigger than intptr_t or uintptr_t.

scatters · on March 15, 2021

It just means that you need to cast via intptr_t, i.e. double casts `(void*)(intptr_t)i` and `(int)(intptr_t)p`. GLib has macros for it if you feel that's useful.

beached_whale · on March 15, 2021

At least in C++, it’s any integer type that is large enough. uintmax_t/intmax_t are guaranteed to be large enough

remexre · on March 15, 2021

hcreate is POSIX, hcreate_r (which allows having multiple hashmaps per process) is the GNU extension, iirc.

I thought POSIX added a bunch more safe pointer casts, but I don't know off the top of my head...

therealcamino · on March 15, 2021

I've been programming in C for decades and I'd never heard of hcreate. I'm sitting here wondering how the heck did that ever get standardized, allowing single hash table per process. What a bizarre feature.

c-cube · on March 15, 2021

I wonder if it's like insanities like SIGPIPE: it's kind of useful if you write IO tools in a few dozen/hundred lines of C. You don't need to carry around explicit context, or check if writes failed, or anything. Of course it sucks for anything else and we still have to deal with SIGPIPE .

chaostheory · on March 15, 2021

It would be nice if the post went further as in using code that either uses concurrency or parallelism.

burntsushi · on March 15, 2021

Someone on Lobsters suggested this too, and I don't think it would be particularly edifying. But maybe I'm wrong. Someone should go out and do that benchmark. Maybe we'll learn something. Here is what I wrote though:

Coming up with a benchmark is hard. Benchmarks like these are models. All models are wrong. But. Some are useful.

Follow you thinking to its logical conclusion. If you open up the parallel programming flood gates, now you need to do that for all of the programs. Now your effort to contribute a sample for each language goes up. Maybe enough where actually doing the benchmark isn't practical.

So let's play the tape. If someone told me, "sure, have fun, throw threads at this!" I'd probably say, "can I use memory maps?" And let's say, "yes, sure, go for it!" I say okay, here's my first crack at it:

1. Hope stdin is a file and open it as a memory map.

2. Split the memory map slice into N chunks where N is the number of logical CPU cores on the system.

3. Run approximately the same algorithm devised in existing code submissions in N different threads, with one thread per chunk.

4. At the end, join the results and print.

If that ends up being the best strategy overall, then your model has become over-complicated because performance now likely depends on step (3). Which is exactly what the existing benchmark is. Now, maybe some languages have a harder time passing data over thread boundaries than others and maybe that impacts things. But that doesn't apply to, say, C, Rust, C++ or Go.

If that's not the best strategy, then maybe there is a cleverer approach that involves synchronizing on one shared data structure. I doubt it, but maybe. If so, your benchmark turns into one that is very different. It's no longer just about data transformation, but now you have to worry about synchronization trickiness. And now you're way outside of a standard goroutine pool that you might have used in Go.

Building a good benchmark is an art. Sometimes the constraints look dumb and arbitrary. But you have to sit down and really look at it from a bunch of different angles:

* Do the constraints approximate some real world conditions?

* Do the constraints inhibit writing code that looks like real world code?

* Do the constraints inhibit optimizing the crap out of the code?

* Do the constraints make the benchmark difficult to create in the first place?

* Do the constraints permit easy and clear measurement?

There are probably more things that I didn't think about.

injinj · on March 16, 2021

Amusingly, I've done a multi-threaded version of the word counting program in order to test a shm kv store. I needed benchmark that created a lot of cross thread concurrent accesses to keys and I found a blog about this test. My version has serious constraints though, you have to create a shared memory map with enough space to hold all of the keys beforehand, as it doesn't resize the shm kv map as it runs.

This is the source for it:

https://github.com/raitechnology/raikv/blob/master/test/ctes...

The speedup of the multi-threaded version vs the single-threaded version is about linear. The single threaded version uses 2 threads, one to read stdin and one to hash the keys, the 16 threaded version uses one thread to read, 16 to hash.

$ time ctest -t 1 < ~/data/enwiki-p10p30303 18.88user 0.25system 0:09.58elapsed 199%CPU

$ time ctest -t 16 < ~/data/enwiki-p10p30303 8.08user 0.25system 0:00.49elapsed 1680%CPU

chaostheory · on March 15, 2021

Yes, benchmarks are hard. However, single thread benchmarks don't reflect the real world unless you're a student.

burntsushi · on March 15, 2021

TIL GNU grep is not in the real world.

Consider re-reading my comment. It anticipated your criticism and responded directly to it.

chaostheory · on March 15, 2021

Yes, you've basically gone over the details of why doing parallelized / concurrent performance comparisons between different programming languages is very difficult. I agree with most of what you have written. However, this doesn't change the fact that any comparison of performance between different languages, that ignores relevant concurrent and / or parallelization features, isn't very useful. This is especially true when the selling point for a few languages in this list is concurrency and / or parallization related features.

burntsushi · on March 16, 2021

No, what I'm saying is that it is useful. Because in a concurrent program, you'll almost certainly being using the single threaded techniques developed here. And that the fastest program will be the one that uses the fastest single threaded techniques. So it makes sense to simplify the model and just look at the thing that is going to dominate the runtime of the program.

If a "dumb" thread pool is indeed what ends up being the best concurrent solution, then Rust, Go, C and C++ (and probably most of the other languages) are going to have almost no problems implementing it. It's just Not That Interesting.

Now, maybe I'm wrong. Maybe there is something clever here to exploit in a concurrent program, and that some languages let you do that easier than others. I doubt it. It would be interesting if it turned out to be the case, but it would be a different benchmark. It wouldn't just be about how well a language can deal with simple data transformation tasks, but also about how it deals with tricky concurrency problems. But this doesn't seem like the kind of problem that needs such things. It complicates the model.

Like honestly, just saying that any benchmark that ignores parallelism "isn't very useful" is just totally ridiculous.

chaostheory · on March 16, 2021

> No, what I'm saying is that it is useful. Because in a concurrent program, you'll almost certainly being using the single threaded techniques developed here. And that the fastest program will be the one that uses the fastest single threaded techniques.

You have a point. Maybe I am wrong in saying that it's near useless.

> If a "dumb" thread pool is indeed what ends up being the best concurrent solution, then Rust, Go, C and C++ (and probably most of the other languages) are going to have almost no problems implementing it. It's just Not That Interesting.

Your argument makes sense if every language used the same form of concurrency or parallelism. Unfortunately, that isn't the case. Wouldn't the type of concurrency or parallelism affect performance?

> Now, maybe I'm wrong. Maybe there is something clever here to exploit in a concurrent program, and that some languages let you do that easier than others.

but concurrency / parallelism IS much easier in some languages vs others (did I misunderstand what you wrote?).

burntsushi · on March 16, 2021

> Wouldn't the type of concurrency or parallelism affect performance?

That's what I'm saying: almost certainly not IF the best concurrent solution to this problem is to chunk of the work and feed it off to a bog standard thread pool. Pretty much every language---even C---has tools that make this pretty easy to do.

And I don't see any reason why a dumb thread pool wouldn't be the best answer to this if parallelism were allowed. Like, the fact that Go has goroutines is just not going to help with anything for this kind of work-load. Goroutines help in more complex concurrent situations.

> but concurrency / parallelism IS much easier in some languages vs others (did I misunderstand what you wrote?).

When the extent of the parallelism is a dumb thread pool with no synchronization between them, then it's pretty easy to do that in just about any language.

chaostheory · on March 16, 2021

> That's what I'm saying: almost certainly not IF the best concurrent solution to this problem is to chunk of the work and feed it off to a bog standard thread pool.

That's the thing. This is not the case for all programming languages, even for some of the ones included this performance test. Python has a GIL. Distributing work via threads isn't going to do much unless it's something I/O intensive. Consequently, testing performance via a programming language's available methods to distribute work does matter. Special features like Goroutines also matter.

> When the extent of the parallelism is a dumb thread pool with no synchronization between them, then it's pretty easy to do that in just about any language.

The specific implementations of work distribution still matters for performance doesn't it? Different implementations yield varying levels of performance right?

burntsushi · on March 16, 2021

Python has a `multiprocessing` module that sidesteps the GIL.

You can question me all day. Go build the benchmark. Publish it. Have people scrutinize it. I have my prior based on building this sort of tooling for years and actually publishing benchmarks.

And you still continue to miss my point. Instead, we're in a hypothetical pissing match about code that doesn't exist. Either the parallelism problem is uninteresting, or your benchmark becomes about complex synchronization. The former complicates the benchmark for no good reason. The latter results in a different benchmark entirely.

chaostheory · on March 16, 2021

> Python has a `multiprocessing` module that sidesteps the GIL.

Yes and distributing work with multiple processes is not the same as distributing it via threads. There is a difference in performance in different implementations and variations of concurrency and parallelism.

> we're in a hypothetical pissing match about code that doesn't exist. Either the parallelism problem is uninteresting, or your benchmark becomes about complex synchronization.

Yes, the code doesn't exist because it's difficult to do, and that's what makes it interesting.

burntsushi · on March 16, 2021

> Yes and distributing work with multiple processes is not the same as distributing it via threads. There is a difference in performance in different implementations and variations of concurrency and parallelism.

I literally addressed that in my first comment: "Now, maybe some languages have a harder time passing data over thread boundaries than others and maybe that impacts things. But that doesn't apply to, say, C, Rust, C++ or Go."

> Yes, the code doesn't exist because it's difficult to do, and that's what makes it interesting.

Well, I mean, I haven't done it not because it's hard, but because it appears pointless to me. That's my prior. Show me I'm wrong. I'm happy to learn something new.

chaostheory · on March 16, 2021

> Well, I mean, I haven't done it not because it's hard, but because it appears pointless to me.

That's just your opinion and you haven't convinced me otherwise. Likewise, I don't think I'm going to convince you either.

> I literally addressed that in my first comment

Yes, but that doesn't mean that it's not an important point. Even if we were just comparing "C, Rust, C++ or Go.", different work distribution implementations have different sets of performance. It's hard, but it's still useful to capture that data because the real world doesn't run on a single thread or even a single core. If you're going to continue glossing over this, there's no point in continuing our discussion. We're just going in circles.

burntsushi · on March 16, 2021

I didn't say it wasn't useful to capture the data. Indeed, I've been saying, "happy to see it done and happy to learn something if I'm wrong." Look at what I said above:

> Now, maybe I'm wrong. Maybe there is something clever here to exploit in a concurrent program, and that some languages let you do that easier than others. I doubt it. It would be interesting if it turned out to be the case, but it would be a different benchmark.

So the fact that you keep trying to niggle at the specifics of parallelism tells me that you're completely and totally missing my point. You're missing the forest for the trees.

Looking back at the convo, it started by you saying, "it would be nice to see parallelism considered." Fine. I responded by saying why that particular rabbit hole is probably not be so useful, or would be measuring something totally different than the OP. To which, you kind of flipped the script and said, "However, single thread benchmarks don't reflect the real world unless you're a student." It would have been a lot better to just say, "I disagree, but I don't have time to do the benchmark myself, so we'll have to leave it at that."

There's a lot of goalpost shifting going on. Like, are we arguing whether a parallel benchmark could be useful? Or are we arguing about whether a single threaded benchmark is useless? They are two different arguments. The former gets my response above, navel gazing at why that model probably isn't particularly useful. But the latter is just crazytown.

Shorel · on March 15, 2021

This is definitely the most interesting part of the article for me:

Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.

http://www.leancrew.com/all-this/2011/12/more-shell-less-egg...

35fbe7d3d5b9 · on March 15, 2021

I've always personally disliked McIlroy's oft-quoted conclusion (emph. mine):

> Knuth has shown us here how to program intelligibly, but not wisely. I buy the discipline. I do not buy the result. He has fashioned a sort of industrial-strength Fabergé egg—intricate, wonderfully worked, refined beyond all ordinary desires, a museum piece from the start.

Well, yes – absolutely.

If you read the initial column[1] pre-McIlroy's response, you'll see that Knuth is not presenting a word-counting solution, but instead the concept of literate programming and how intertwining code and prose can be better for the programmer. Knuth isn't trying to count words, he's trying to teach you a different way to develop, and he's doing it by showing an example that has been simplified so the average person can chew on it.

I'm not sure if the popular interpretation of "ivory tower Knuth versus tactical genius McIlroy" is a modern take or if that was how it was received when the column was published, but it feels very unfair to present it as such today.

[1]: https://doi.org/10.1145/5689.315644

jerf · on March 15, 2021

I have seen it observed that while the tr/sort/uniq approach has practical value, that if the question is how would you implement the features that tr/sort/uniq provide simply invoking them is not much of an answer. I don't think Knuth was unaware that there were existing tools that could solve the problem, the whole point was to demonstrate his literate programming approach at some semi-reasonable scale on a problem everyone could recognize. It's kinda like how pretty much every interview-sized question has some sort of off-the-shelf solution somewhere, because anything that fits into an interview has been solved thousands of times over... but that's not the point of interviews.

In modern terms, it would be something like complaining that $LANGUAGE's web server implementation is a grotesque bloated mess (and therefore $LANGUAGE isn't very good) because in shell all I have to do is run "nginx". There is a true and useful sense in which that is true, but there is also a true and useful sense in which that is utterly missing the point.

henrik_w · on March 15, 2021

Also interesting is that this problem is used as the example in the book "Exercises in Programming Style" by Cristina Videira Lopes. It shows this problem solved in 33 different programming styles, for example historical, function composition, data-centric etc.

I have written more about it here: https://henrikwarne.com/2018/03/13/exercises-in-programming-...

taeric · on March 15, 2021

There is a new version of that book. Thanks for bringing it to my attention, looks great!

felixr · on March 15, 2021

If you want to read the " ten-page Knuthian masterpiece": https://homepages.cwi.nl/~storm/teaching/reader/BentleyEtAl8...

I don't think it was linked from the article

_lx4l · on March 15, 2021

Last summer I was looking for a program that would tell me which shell commands last used in a specific directory. Because I had been reading "The Unix Programming Environment" it quickly occurred to me that the program I was looking for could easily be made from several existing commands. Only after I wrote the small script did I find that the solution already existed except it was >1000 lines of code, 20 github contributors, and utilized a sqlite database.

My one line has less features, but I'll take it.

themantri · on March 19, 2021

Could you please share the one-liner? Thanks!

_lx4l · on March 19, 2021

https://github.com/natethinks/jog/

taeric · on March 15, 2021

Sorta. That incident is also famous for being easy to misrepresent.

That said, if you are new to that, Programmer Pearls is a great series. And literate programming is neat to at least know of.

cb321 · on March 15, 2021

Knuth-McIlroy comes up a lot. Previous discussion [1]. For this example I can make a very similar Nim program [2] run almost exactly the same speed as `wc -w`, yet the optimized C program runs 1.2x faster not 1.15x slower than `wc -w` - a 1.4x discrepancy - bigger than many ratios in the table. So, people should be very cautious about conclusions from any of this.

[1] https://news.ycombinator.com/item?id=24817594

[2] https://github.com/c-blake/adix/blob/master/tests/wf.nim

chris_st · on March 15, 2021

Thanks for doing a nim version! I've been learning nim, and am very much a beginner, so I thought I'd tackle a simple version [0]. I'd appreciate any critiques... I don't know how idiomatic it is.

Honestly, the optimized version you created is kind of opaque; I don't want people to think that that's what "typical" nim looks like.

I'll do a pull request on the project for mine as simple (I'd have added your version as an optimized version, but I couldn't get it to compile).

On my machine, the simple-c version runs in 0.70 seconds, my simple nim runs in 0.73 seconds (and they produce the same output, which is nice :-), the nim executable is 95k.

[0]: https://github.com/csterritt/word_frequency_nim/blob/master/...

cb321 · on March 15, 2021

Sure. Feel free. I agree it's probably a bit advanced, and it breaks a rule or two of the "constraints" some people seem passionate about and needs to include punctuation to match results.

Feedback-wise, yours isn't so bad, though I haven't tested it. There is definitely a simpler one that looks roughly like the Python and probably runs a bit slower than yours. That would probably be a better/more fair candidate for the "simple" category, but it probably still has ok-ish performance.

cb321 · on March 15, 2021

This may be more fair for the "simple" case in Nim.

    import tables, strutils
    var counts = initCountTable[string](16384)
    for line in stdin.lines:
      for word in line.toLowerASCII.split:
        counts.inc word
    counts.sort            # SortOrder.Ascending to reverse
    for word, count in counts:
      echo word, " ", count

I compiled with `nim c -d:danger --gc:orc --panics:on` with gcc-10.2 on Linux-5.11 with nim-devel and file in /dev/shm. Runs in about 1.97x the time of "wc -w" for me (.454s vs .2309).

If we apply "BS-scaling" to the article table that would be 2.27*.2 = 0.454 sec on the author's machine which would make it twice as fast as the "simple C". Yet, if I actually run the simple C, I get 0.651 seconds, so only 651/454=1.43x "simple C" speed. This mismatch is, again, bigger than Go vs. RustB (0.38/0.28 = 1.35x). My only point is that you cannot just apply BS scaling and these results may very well fail to generalize across environments.

For what it's worth, literally every time I re-do anything in Rust in Nim, the Nim is faster. But benchmarks are like opinions...everyone has them and you should very much form your own, not delegate to "reputation". The best benchmark is actual application code. I make no positive general claims, but say this only because Rust people so often do. { EDIT: I think it is generally a big mistake to make many assumptions about prog.lang performance, especially if the assuming is on the "must be fast" side. }

chris_st · on March 15, 2021

Thanks! That's a much cleaner implementation, and I learned some cool nim. I got a note on GitHub with a similar implementation, so I went with theirs.

And, cool, "BS-scaling" is my new term of art :-)

josefx · on March 15, 2021

I think most of the uglynes in the c++ template case comes from the tool not resolving the standard types. Just replacing the std::string boilerplate reduces it by roughly half.

  0.00    0.00   32187/32187 std::vector<std::pair<std::string,int> >::vector<std::__detail::_Node_iterator<std::pair<std::string const, int>, false, true>,void> 
  ( 
     std::__detail::_Node_iterator<std::pair<std::string const, int>, false, true>, 
     std::__detail::_Node_iterator<std::pair<std::string const, int>, false, true> > const&
  ) [11]
  0.0    0.00    0.00   32187         void std::string::_M_construct<char*>(char*, char\*, std::forward_iterator_tag) [8]

So that looks like a std::vector<pair<string,int> > constructor taking two map iterators as input and an internal string member taking char pointer based iterators as input.

macgyverismo · on March 15, 2021

I was suprised at the C++ outcome, so I started testing myself. I was very amazed to find a 4~5x speedup when reading directly from file (std::ifstream) instead of reading stdin. This was on a 10.5.7 MBP 2014

[edit]

more like 9~10x actually...

  time { ./build/Release/countwords_ifstream; } | tail
  ...
  real 0m1.337s
  user 0m1.332s
  sys 0m0.022s

  time { ./build/Release/countwords_original < kjvbible_x10.txt; } | tail
  ...
  real 0m12.184s
  user 0m12.138s
  sys 0m0.044s

changes between these two:

  6a7
  > #include <fstream>
  8a10
  >  std::ifstream inp( "/Users/macgyverismo/Desktop/test/kjvbible_x10.txt" );
  13c15
  <     while (std::cin >> word) {
  ---
  >     while (inp >> word) {

cornstalks · on March 15, 2021

I dug into this once because reading from stdin on my Mac was so slow, but only when using std::cin. Using C APIs to read stdin was fast. Turns out libc++ on mac uses `getc` to read from stdin one character at a time. Madness.

darwingr · on March 16, 2021

Huh, well that makes a lot of sense. Three years ago when I was in University practicing for competitive coding we came across a limit in the performance (ms) we could get on even simple problems. The professor leading the practice poked away at his own solution to some problem for days because he couldn't get a better time than the top score online. Eventually he implemented his own versions of cin and cout. Needless to say he was pretty frustrated but there's just some problems that are hard to let go of without a solution.

w-m · on March 15, 2021

Similar here, reading the file from stdin on a 2020 MBP on macOS 11.2.2 is incredibly slow.

    int main() {
        std::ios::sync_with_stdio(false);
        std::string str((std::istreambuf_iterator<char>(std::cin)),
                         std::istreambuf_iterator<char>());
    }

takes around 8 seconds to run, no counting at all.

    int main() {
        std::ifstream f("kjvbible_x10.txt");
        std::string str((std::istreambuf_iterator<char>(f)),
                         std::istreambuf_iterator<char>());
    }

runs in ~270 miliseconds. Factor 30x, yikes.

c-cube · on March 15, 2021

Could be a lock on stdin that is acquired in a loop? Rust has such a lock and typically one should lock stdin at the beginning and read from within the critical section.

FAANG_dream · on March 15, 2021

Try this maybe? https://codeforces.com/blog/entry/6251

macgyverismo · on March 15, 2021

Asuming you mean 'std::ios::sync_with_stdio(false);', that was already present in the original code.

usefulcat · on March 15, 2021

I tried that on Linux and got at best a 5% improvement.

benhoyt · on March 15, 2021

Yeah, I'm on Linux too and tried that with no noticeable difference in results. There's something off about macOS's approach.

nerdponx · on March 15, 2021

Is that due to something at the OS or filesystem driver level?

jonathanstrange · on March 15, 2021

I've got to say I'm rather happy with Go's performance and verbosity of the unoptimized version in this micro-benchmark. For me it's just the right trade-off. Go was the right choice for my current project. I don't need it faster but wouldn't want my garbage-collected language to be substantially slower.

I suppose SBCL would be in a similar ballpark for this task, though, and would love to see an unoptimized SBCL version.

easytiger · on March 15, 2021

This test also doesn't take into account if this was a long runing process continually churning whats the effect on GC/RC pauses etc.

_ph_ · on March 15, 2021

The nice thing about the Go GC is, it runs completely in parallel to your program, so that there are no significant GC pauses any more (typically far below the millisecond range, rather microseconds).

Groxx · on March 16, 2021

Broadly, yes, Go's GC is quite fast and unobtrusive. They've done a good job tuning it for many purposes. On the post's programs though it's quite possible it doesn't come into play at all, as the compiler keeps things on the stack when possible.

But like all GCs, there are cases where it has issues. Anecdotally, I've seen multiple cases where "ballast"[1] reduced RPC response times by around 25%. A bit of GC tuning on some tests sped them up by over 10x (due to pathological thrashing - each test would go from just-below the GC threshold to just-over). Etc. Granted, it's usually on not-great code, but the point still stands - if you care about performance, GC cannot be ignored, regardless of how good its marketing is.

[1] https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i...

easytiger · on March 15, 2021

I'm aware, but alas in my domain anything above 50us is "noticeable" if it happens regularly

analognoise · on March 15, 2021

easytiger · on March 15, 2021

Something like that yes

_ph_ · on March 15, 2021

It might be a noticeable delay if you require quick response by the program, but won't have a significant impact on the total run time of the program.

easytiger · on March 16, 2021

It depends on your requirements.

Some domains might state a worst case for a single transaction or percentage of transactions that definitely exceeds what can be achieved here

jonathanstrange · on March 16, 2021

Ada with the Ravenscar profile enabled is a good choice for such scenarios. Together with a real-time operating system it allows you to satisfy hard real-time guarantees. As you certainly know, though, hard real-time constraints require hardware validation.

easytiger · on March 17, 2021

Yea, for a lot of requirements hard realtime isn't necessary and at least some older linux systems can be tuned to remove almost all jitter giving very good 99.9th percentile figures. I've even seen it done with zero alloc java

oefrha · on March 15, 2021

Text processing, without a single mention of Perl. How times have changed.

asicsp · on March 15, 2021

And, Perl's hash implementation can perform better compared to Awk's associative arrays. The `SCOWL-wl.txt` file used below was created using http://app.aspell.net/create. `words.txt` is from `/usr/share/dict/words`. Mawk is usually faster, but GNU Awk does better in this particular case.

    $ wc -l words.txt SCOWL-wl.txt
      99171 words.txt
     662349 SCOWL-wl.txt
     761520 total

    # finding common lines between two files
    # shorter file passed as the first argument here
    $ time gawk 'NR==FNR{a[$0]; next} $0 in a' words.txt SCOWL-wl.txt > t1
    real    0m0.376s
    $ time perl -ne 'if(!$#ARGV){$h{$_}=1; next}
                     print if exists $h{$_}' words.txt SCOWL-wl.txt > t2
    real    0m0.284s

mhd · on March 15, 2021

Would

    perl -lane '$c{lc $_}++ for @F;END{print "$_ $c{$_}" for sort {$c{$b}<=>$c{$a}||$a cmp $b} keys %c}' < input.txt

work?

faizshah · on March 15, 2021

Nice! I came up with:

perl -nle '$w{lc s/[[:punct:]]//r}++ foreach split(/\s+/,$_); print map{"$w{$_} $_ \n"} (sort{$w{$a} <=> $w{$b}} keys(%w)) if eof()' < kjvbible.txt

If anyone's curious this takes .47 seconds on my local machine whereas the optimized shell solution takes about 3.5 seconds.

Edit: It looks like mhd's usage of autosplit -a and @F are faster than split($_) but calling print foreach key is slower than my print map strategy.

So our solutions are combined to get:

perl -anle '$w{lc s/[[:punct:]]//r}++ foreach @F; END{print map{"$w{$_} $_ \n"} (sort{$w{$a} <=> $w{$b}} keys(%w))}' < kjvbible.txt

Which runs in .39 seconds. mhd's solution runs in .42 seconds. For reference the optimized C version takes .059 seconds (possibly cause of clang?), wc -w takes .023s and the optimized python version takes .288

FireTrader · on March 15, 2021

I haven't written perl in 5 years. Reading this evokes a rare sense of joy / Stockholm syndrome.

jlg23 · on March 15, 2021

If you interviewed for a perl job, you'd probably be disqualified for writing code that is too readable ;)

mhd · on March 15, 2021

I know this is meant as a joke, but I just use this to once again note that Perl actually had a very common linter/pretty printer way before it became all the rage (perltidy/perlcritic). I also wish other languages had a book like Perl Best Practices, that goes beyond just syntactical issues.

To be fair, I once came across a sentiment that was pretty close to "readability is bad". Taken from both the (bourne) shell and awk, perl has lots of short variables that determine parsing input and other matters, like "$/". You can then `use English` to have alternative names for that, but those might not be as well known ("$/" would be "$INPUT_RECORD_SEPARATOR", or "$RS" as a tribute to awk). So the line noise might be more common and thus more understandable than the "readable" version.

I don't quite buy that for e.g. "$INPUT_RECORD_SEPARATOR", but there's an argument to be made for "$RS" or "$ARG" being worse than the Asterix swear words.

jlg23 · on March 15, 2021

> I know this is meant as a joke, but I just use this to once again note that Perl actually had a very common linter/pretty printer way before it became all the rage (perltidy/perlcritic).

And all great perl hackers deliver readable code. But we still enjoy a round of perl golf once in a while. And honor demands, that, if you respond in perl golf style to an interview question, you make it count ;)

anthk · on March 15, 2021

Perl is readable, just indent it.

nerdponx · on March 15, 2021

What is the readable indented version of this?

   perl -anle '$w{lc s/[[:punct:]]//r}++ foreach @F; END{print map{"$w{$_} $_ \n"} (sort{$w{$a} <=> $w{$b}} keys(%w))}'

faizshah · on March 15, 2021

I wrote that as a one liner it’s not supposed to be readable. Here’s a readable version of essentially the same program:

https://perlmaven.com/count-words-in-text-using-perl

jlg23 · on March 15, 2021

I know, I still have my black belt in perl somewhere. The one-liner is readable, too.

Throw in some needless array-referencing-and-dereferencing so people can rightfully consider perl a write-only language...

nerdponx · on March 15, 2021

Unfortunate, because for all the badness of Perl, it shines in exactly this case of "read a potentially-infinite stream of text from stdin, process it, and dump another stream to stdout".

mfontani · on March 15, 2021

What would "all the badness of Perl" be?

nerdponx · on March 15, 2021

Chaotic standard library with cute-but-difficult-to-search-for names like "chomp", messy and complicated rules around sigils and scalar() for arrays and tables, lots of magic convenience like "$_" being optional in loops, s/// being "destructive" by default, and the need for some ugly hacks around handling subroutine arguments.

Perl has plenty of "good parts", too. But it is not an easy language to learn and even good idiomatic Perl code can be difficult to read. I don't blame people for favoring Python and Ruby.

mfontani · on March 15, 2021

I often see examples comparing one-liner worthy code written in one language with beautiful, linted, critique-passing indented code written in other languages, and on and on goes the Perl bashing based on examples of the former.

keanebean86 · on March 15, 2021

My only complaint is the dollar sign. Same with PHP. It take time to build the muscle memory for that.

cgh · on March 15, 2021

I've always believed Perl has an alignment of chaotic good.

outside1234 · on March 15, 2021

See the unreadable unmaintainable code above.

mhd · on March 15, 2021

My version? That's a one-liner, one shouldn't maintain those. It was mainly intended to show that a lot of the logic for simple cases is so common that perl (and some other languages) have those as command line switches.

That also makes it hard to win for total outsiders, though. This touches on core functionality, which often has quite a legacy behind it and thus its own vernacular. I can write `while (my $line = <STDIN>)` instead of `while(<>)`, but I can't get as easily rid of "chomp" being confusing to most people. A bit like car/cdr in Lisp.

But that happens even if you have plain english functions and no special cased syntax. Looking at the simple python version, it took some effort for me to remember what `Count.update(lst)` does, or that `most_common` without arguments returns all elements.

faizshah · on March 15, 2021

Honestly, when you don’t know perl it looks really unreadable but if you spend some time learning it it’s not that hard it’s just many of the operators are not google-able. The harder part, from experience, is maintaining perl scripts. A lot of perl scripts from the early 2000s are written in extremely hacky ways without modern code style sensibilities. In 2021 I don’t see any reason to use perl over python3 anymore, python3 is fast, widely used, easy to write, and easy to learn. Even the bioinformatics folks, some of the last perl holdouts, have been switching to python.

Since learning perl a couple years ago I personally only use it for one liners to replace sed/awk. If I do use perl in a script it’s usually in one liner form in a bash script to post process output from something like ripgrep.

jjav · on March 15, 2021

> In 2021 I don’t see any reason to use perl over python3 anymore

Regex handling is still far nicer in perl than anywhere else, so for any script which is primarily about string parsing with regular expressions, perl is the right tool for the job.

mfontani · on March 15, 2021

> In 2021 I don’t see any reason to use perl over python3 anymore

Is there, in 2021, a "use strict" equivalent for Python, or can one still misspell variable names and not be warned about it?

nerdponx · on March 15, 2021

You would get a NameError (or maybe an UnboundLocalError in some cases) if you use a nonexistent variable name in Python. At least, this has been true since 2013 when I started learning Python.

nerdponx · on March 15, 2021

For what it's worth, there's nothing in the Perl manual that says you have to write unreadable, unmaintainable code.

Although maybe it's hard to judge a programming language separately from the culture around that programming language.

svat · on March 15, 2021

There is a question on Stack Exchange about exactly this: https://codegolf.stackexchange.com/questions/188133/bentleys...

I haven't looked into it yet (whether the benchmarks are comparable) but it would be interesting to check the programs there against the ones in this post, and either update the benchmarks here or add new submissions to that question.

For a while the fastest program there was one in Rust, then out of curiosity I translated Knuth's program from 1986 into C++ and a simplified version based on it was even faster; currently a Rust translation of the idea (a custom trie using not too much memory) is the fastest one there. (In fact, the current fastest two Rust submissions there came after I linked to the question from HN in Feb 2020, in a couple of comments at https://news.ycombinator.com/item?id=22221592 and in the discussion on a "Donald Knuth was framed" article: https://news.ycombinator.com/item?id=22406070.)

Edit: I tried to do the comparison myself but one issue I ran into (apart from minor things like taking input from filename vs stdin, and printing top N words versus all, which are easily handled) is that the benchmark on the StackExchange question considers "words" to be made of alphabetic characters (a–z), and the trie solutions make use of that fact (a trie node has 26 children etc), while the benchmark from this post includes punctuation (e.g. it has "words" like "him," and "him." i.e. "him" followed by a comma or period are counted as separate words). I gave up at this point, but maybe someone with more energy could modify the benchmark here and try comparing them.

benhoyt · on March 15, 2021

Very interesting -- I hadn't seen that. Thanks for the link.

throwaway13337 · on March 15, 2021

For more language benchmarks of this sort, there is a pretty extensive list:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Here, they have code samples and multiple implementations of each task in each language.

It's a great resource for, at a glance, how performant a language is at basic cpu and memory intensive tasks and how simple the related code is.

It still has the issue that some code is unrealistically optimized for the language. And also that many solutions use standard libraries written in other languages. It's worth looking at the source, therefore.

_ph_ · on March 15, 2021

I am very sceptical of these benchmarks for the reasons you listed. The individual implementations are often too different to give a comparable score. And sometimes this is because of the rules. Like the requirement to use the language-provided hash tables, while C is free to implement an optimized hash table for the problem.

igouy · on March 15, 2021

You are very sceptical of which benchnmarks ?

For the benchmarks game, C is not free to implement an optimized hash table for the problem.

The hash table used is from "Klib: a Generic Library in C".

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

_ph_ · on March 15, 2021

Just looking at the first lines of the linked code, I see the following:

  // Define a custom hash function to use instead of khash's default hash
  // function. This custom hash function uses a simpler bit shift and XOR which
  // results in several percent faster performance compared to when khash's
  // default hash function is used.

quite contradicting your statement.

igouy · on March 15, 2021

Again, for the benchmarks game, C is not free to implement an optimized hash table for the problem.

Any of the programs are free to implement a "custom hash function" and many of them do.

hash table != hash function

_ph_ · on March 15, 2021

That is picking terms. Many languages don't allow to specify custom hash functions, so you can't specify the hash functions until you do write a custom hash table. A fair benchmark would allow for this, if it is allowed in C.

igouy · on March 16, 2021

The OP "Performance comparison: counting words in Python, Go, C++, C, Awk, Forth, Rust" does not allow external libraries BUT then makes an exception for C, and then makes a further exception -- In the optimized version we’ll roll our own hash table.

Wikipedia is your friend -- "A hash table uses a hash function to compute an index, also called a hash code …"

Which languages shown on the benchmarks game website 'don't allow to specify custom hash functions …" ? https://en.wikipedia.org/wiki/Hash_function

_ph_ · on March 16, 2021

I was not talking about the article but the benchmark site linked. Not sure what you want to say with the wikipedia citation. I have not made a comprehensive list, but the topic has come up when discussing one of the Go benchmarks. The Go language has a build-in hashtable, called a map, and this does not allow to specify custom hash functions. Of course rolling out a hashtable implementation which does allow this is prohibited by the benchmark rules. Probably intentionally, also like your distraction attempts are hinting to.

igouy · on March 16, 2021

So you say "Many languages don't allow to specify custom hash functions" but can only name one?

> Probably intentionally

Wow.

igouy · on March 16, 2021

Please share the widely agreed criteria to identify code we would all call "unrealistically optimized" ;-)

2 tasks — pidigits & regex-redux — explicitly allow use of standard libraries (GMP, PCRE2, RE2) written in other languages.

8 tasks do not.

(Those standard libraries were allowed because some language implementations provided arbitrary precision arithmetic or regex by wrapping those standard libraries.)

Of course, realistic use of some languages relies on calling standard libraries written in other languages.

tadkar · on March 15, 2021

There is a similar great project here [1] with the Hungarian Wikipedia corpus. Great workout for non English and maybe non-ascii operations.

The performance of Java there is super impressive. It should port relatively quickly to this file too...

[1] https://github.com/juditacs/wordcount

jeromeof · on March 15, 2021

Great, it would be nice if Java was included by the op

_old_dude_ · on March 15, 2021

No java ! here is my code golf version

  var map =
      new BufferedReader(new InputStreamReader(in, UTF_8))
          .lines()
          .flatMap(compile(" ")::splitAsStream)
          .collect(groupingBy(w -> w, () -> new TreeMap<>(reverseOrder()), counting()));

It obviously doesn't work without the imports

  import java.io.BufferedReader;
  import java.io.InputStreamReader;
  import java.util.TreeMap;

  import static java.lang.System.in;
  import static java.nio.charset.StandardCharsets.UTF_8;
  import static java.util.Collections.reverseOrder;
  import static java.util.regex.Pattern.compile;
  import static java.util.stream.Collectors.counting;
  import static java.util.stream.Collectors.groupingBy;

killingtime74 · on March 15, 2021

Maybe graalvm native version could be competitive

MauranKilom · on March 15, 2021

Regarding speed of the C++ version: The fundamental mistake is to use std::unordered_map. The standard library map and set implementations (both ordered -> tree-based and unordered -> hash-based) are almost never the correct choice if you care about performance.

That's definitely a flaw in the language, but like with C you would just use a different implementation for this purpose.

usefulcat · on March 15, 2021

Although std::unordered_map is definitely not nearly as fast as a hash map can be in c++, that's probably not the main problem here. I would bet on iostreams and/or tolower being bigger problems.

Note that tolower is called for every /character/. I don't know everything it does, but I know it's doing some semi-esoteric locale stuff, and in general it's doing a lot more than a naive ascii implementation would. I seem to recall it's even doing a dynamic_cast in there somewhere. For each character of input.

I was curious, so I ran callgrind on it. Here are the top 10 offenders (by Ir, 'instructions retired'). For clarity I have elided many details. Total Ir is 5,647,099,795, so the stuff below accounts for greater than 50% of all Ir.

    1,079,594,033  std::istream& operator>>(std::istream&, std::string&)
      553,657,533  std::istream::sentry::sentry(std::istream&, bool)
      550,159,646  __cxxabiv1::__vmi_class_type_info::__do_dyncast(...)
      426,992,280  __dynamic_cast
      413,240,330  std::_Hash_bytes(void const*, unsigned long, unsigned long)
      383,294,340  tolower 
      230,374,111  std::string::_M_append(char const*, unsigned long) 
      223,653,549  /usr/include/c++/9/bits/stl_algo.h:main
      172,438,098  __strcmp_avx2
      164,226,680  std::ctype const& std::use_facet(std::locale const&)

jjnoakes · on March 15, 2021

To expand on this a bit, the C++ unordered map (hash table) uses basically a linked list to track the entries in each bucket (separate chaining for collisions) because the standard requires that operations that may resize the map don't invalidate iterators.

You pay for this with extra memory allocations and non-contiguous memory access.

In most low level benchmarks, other collision mechanisms (probing, etc) perform much better.

MaxBarraclough · on March 15, 2021

Would be interested to see a similar comparison for a problem where the Python solution is unable to offload a significant amount of the computation onto its standard-library.

Also, nice to see Forth included. I strongly suspect that if the Python solution were unable to leverage its rich and well optimised standard-library, it would be easily outpaced by Forth.

Also, the Forth solution used Gforth, which is far from the fastest Forth engine around. [0] It would likely have performed close to C if they had used a native-code-compiling Forth engine (SwiftForth, VFX Forth, or iForth, all of which unfortunately are proprietary payware).

[0] https://github.com/ForthHub/discussion/issues/88#issuecommen...

oefrha · on March 15, 2021

> if the Python solution were unable to leverage its rich and well optimised standard-library

collections.Counter is a very straightforward collection backed by dict. The code isn't specially optimized, you can write the same code yourself with dict and even avoid (probably a negligible amount of) overhead.

Now, dict is optimized, but that's a builtin type. You don't need PSL.

https://github.com/python/cpython/blob/2fe408497e6838b6fb761...

MaxBarraclough · on March 15, 2021

The article provides profiling data of the optimised Python solution. If I'm reading it right, 90% of the time (1148 milliseconds out of a total of 1280) is spent executing the _collections._count_elements and str.split methods, both of which are Python standard-library methods presumably implemented in optimized C.

Of course, they're right to optimize this way, and it's to Python's credit that it's possible to leverage this approach so successfully using only the standard-library, but I think my earlier point stands.

oefrha · on March 15, 2021

I practically linked to _collections._count_elements already (linked Counter.update instead because I thought linking to the call site is clearer), but here it is: https://github.com/python/cpython/blob/2fe408497e6838b6fb761...

Again, very straightforward, based on dict. Python itself does not function at all without the dict implementation since it's the underpinning of the object model.

str is a builtin type. Calling str.split() a standard library method is... okay.

benhoyt · on March 15, 2021

That link shows that yes, "_count_elements" is implemented in Python, but right after it pulls in the C version if available (which it presumably is in CPython):

   try:                                    # Load C helper function if available
       from _collections import _count_elements
   except ImportError:
       pass

Here is the C version (with a comment a little way down describing the "fast path advantages": https://github.com/python/cpython/blob/93d33b47af70ede473f82...

MaxBarraclough · on March 15, 2021

Well spotted, thanks.

MaxBarraclough · on March 15, 2021

Thanks for the link, interesting that _count_elements is written in Python. At a glance it looks like the kind of thing that might benefit from being written in C, but presumably that's not the case. Perhaps because of the dynamic typing of both arguments?

> Python itself does not function at all without the dict implementation since it's the underpinning of the object model.

Sure, but I don't see the point here.

> Calling str.split() a standard library method is... okay.

For our purposes there's no reason to draw a distinction between standard-library functionality closely integrated into the language, and standard-library functionality that isn't. The relevant point is that computational work is being handed off to optimised C code.

Again, if you're optimising Python, the smart move is indeed to make maximal use of the optimised machinery in the standard-library (or for that matter some other dependable library). My point still stands: it would be interesting to look at a problem that isn't amenable to this approach, where the performance of the Python interpreter itself would be brought to the fore by necessity.

How easy it is to find such a problem will be a function of how good a job Python does of providing applicable optimised functionality in its standard-library. Perhaps something like computing matrix determinants? I don't know enough Python to say.

acdha · on March 15, 2021

It can be interesting – especially if you also compare it to something like the PyPy JIT – but it runs afoul of the idiomatic angle since Python has had decades of recommended practice being to write code in Python and optimize the hotspots in C, which has led to very popular libraries like NumPy, PIL/Pillow, etc. becoming almost universal in community recommendations.

MaxBarraclough · on March 15, 2021

Right. They don't emphasise improving the performance of the Python interpreter precisely because they prefer to handle computational heavy-lifting in C instead. Python has been very successful with this approach. Java takes the opposite approach, promoting 'purity' and investing very heavily in JVM performance.

It would still be interesting to see for benchmark purposes.

acdha · on March 15, 2021

Agreed — and especially so if you could see how common it is that CPython performance tanks but PyPy (or Jython?) does not after the JIT kicks in.

heavyset_go · on March 15, 2021

> Perhaps because of the dynamic typing of both arguments?

It wouldn't matter, you'd be dealing with PyObject pointers in C anyway.

MaxBarraclough · on March 15, 2021

Right, that's what I meant. If the C code has to cope with dynamic typing anyway, there might be little scope to speed things up.

The same isn't true for, say, string operations, where there might be plenty of opportunity to move tight loops from Python to C and speed things up considerably.

heavyset_go · on March 15, 2021

Truthfully, I don't think optimization was a motivation for writing _count_elements() in Python or not. The CPython project doesn't prioritize optimization of the language or standard library unless there are clear performance gains to be made in the face of slow code. What most likely happened is someone wrote some code on a mailing list or bug report, and a maintainer included it as is.

Reading the bug report that introduced Counter[1], the author states it is the simplest implementation they came up with, and Guido and other maintainers prioritized simplicity.

[1] https://bugs.python.org/issue1696199

MaxBarraclough · on March 15, 2021

Interesting, thanks.

MaxBarraclough · on March 15, 2021

(Too late to edit) benhoyt's comment points out that _count_elements has two implementations, one in Python and one in C. The Python version is used only if the C function is not found.

musingsole · on March 15, 2021

I often wonder how fast I'd run a race if I shot my foot too.

bjarneh · on March 15, 2021

> I still like it, because (unlike C++) I can understand it

No article like that is complete without a little C++ critique :-)

> I think it’s the simple, idiomatic versions that are the most telling. This is the code programmers are likely to write in real life.

I very much agree with him on this statement. If people started writing "optimized" versions of algorithms (on that level); who would ever be able to read others peoples code...

wirthjason · on March 15, 2021

I saw a talk by Matt Godbolt demoing the compiler explorer showing showing the assembly between idiomatic C++ vs someone trying to outsmart the compiler. In all cases the idiomatic version was superior in performance and readability / maintainability.

s3cur3 · on March 15, 2021

Link to the (great) CppCon talk: https://www.youtube.com/watch?v=bSkpMdDe4g4

The one thing I would say is: as you dig more into C++ perf, you can develop an intuition for the kinds of things the compiler can and can't optimize. For instance, the C++ example from the article uses `std::unordered_map`, which is just an absolute mess from a cache locality perspective—the best compilers today (or of the foreseeable future) can't do a thing to fix that. Improving the programmer's choice of data structure is just not on the radar. :(

gpderetta · on March 15, 2021

std::iostream is going to be hopelessly slow, but I would like to point out that -O3 is more appropriate for templated C++ code.

It seems that the two hot spots are the std::string constructor (unsurprisingly) and the vector constructor.

Making string faster is not easy. I would check that gcc is configured to use C++11 strings with the small string optimization and not the older C++03 compatible refcounted string. Writing your own string or a custom allocator are options, but probably overkill for the problem.

The vector constructor is more surprising. It is possible that -O3 could help a bit there. Wrapping the map iterators with std::move_iterator would also avoid a lot of string copies.

thestoicattack · on March 15, 2021

std::move_iterator was my first thought too, but I'm not sure if it works, since the key type for std::unordered_map is defined to be const, so you may have to make a copy anyway.

Maybe use a vector-of-pointer-to-pair, since we're leaving the map around and we don't really need the vector to own anything.

gpderetta · on March 15, 2021

good point about the string being const! There are ways around that, but none pretty.

usefulcat · on March 15, 2021

-O3 makes basically no difference for this code. I tried it.

sesuximo · on March 15, 2021

Why copy the string at all? You have string in file. Mmap file and just store string view of it.

gpderetta · on March 15, 2021

you can't mmap a pipe unfortunately and that's a very valid use case. At the very least you would need two distinct code paths.

ZoomZoomZoom · on March 15, 2021

I basically solved the same problem about a year ago. I was reading Gormenghast trilogy and for a considerable part of the book I stumbled upon rare English words of unknown or hazy meaning for me. I decided I need a tool to prepare Anki decks with those words, bu as I'm too lazy to manually log them, I thought I just use the same algorithm, take the top most infrequently occurring words and generate a CSV with definitions from GCIDE[0] to load into Anki.

I used Rust and thought it's the textbook case for a trie, but what I found is that crates.io had quite a few tries but not all of them working or ergonomic enough. Only one of the libraries I tried (huh) gave me a slight advantage over a HashMap, although statistically insignificant, which is rather disappointing, considering the program outputted sorted frequency groups.

I'm pretty pleased with the results. I considered adding stemming[1], silly me. Too hard of a problem to bother.

0. https://gcide.gnu.org.ua/

1. https://en.wikipedia.org/wiki/Stemming

frankenst1 · on March 15, 2021

Here are my results with JavaScript (node.js v14.15.4) added:

    Language      | Simple | Optimized | Notes
    ------------- | ------ | --------- | -----
    `wc -w`       |   0.17 |      0.16 | `wc` reference; optimized sets `LC_ALL=C`
    `grep`        |   0.53 |      0.54 | `grep` reference; optimized sets `LC_ALL=C`
    Go            |   0.82 |      0.29 |
    C             |   0.83 |      0.20 |
    Rust B        |   1.10 |      0.28 | also by Andrew: bonus and custom hash
    JavaScript    |   1.29 |           | no readline
    JavaScript    |   1.81 |           | with readline
    Rust A        |   1.55 |      0.30 | by Andrew Gallant
    Python        |   2.14 |      1.17 |
    Ruby          |   3.59 |           |
    AWK           |   4.28 |      1.22 | optimized uses `mawk`
    C++           |   6.68 |      0.83 | "optimized" isn't very optimized
    Shell         |  42.60 |      9.50 | optimized does `LC_ALL=C sort -S 2G`

Btw: is there something wrong with my shell?

    //with readline
    const rl = require('readline').createInterface(process.stdin);

    const wordCounter = {};

    rl.on('line', (line) => {
        let words = line.toLowerCase().split(/\s+/).forEach(word => {
            wordCounter[word] = (wordCounter[word] || 0) + 1;
        });
    }).on('close',() => {
        let output = Object.entries(wordCounter)
            .sort(([,countA],[,countB]) => countB - countA)
            .map(wc => wc.join(" "))
            .join("\n");
        console.log(output);
    });

--------------------------------------------

    //without readline
    const wordCounter = {};
    let lastChunk = '';
    process.stdin.on('data', chunk => {
        let words = (lastChunk + chunk.toString().toLowerCase()).split(/\s+/);

        if (!words.length) return;
        lastChunk = words.pop();

        for (let word of words) {
            if (word) wordCounter[word] = (wordCounter[word] || 0)+1;
        }
    }).on('end',() => {
        if (lastChunk) wordCounter[lastChunk] = (wordCounter[lastChunk] || 0)+1;

        let output = Object.entries(wordCounter)
            .sort(([,countA],[,countB]) => countB - countA)
            .map(wc => wc.join(" "))
            .join("\n");
        console.log(output);
    });

ephaeton · on March 15, 2021

This shouldn't compare unicode-aware 'tolower' with ASCII-tolower ... sends false signals. "Optimizing" by setting LC_ to "C" .. oh my.

1f60c · on March 15, 2021

If you know something the author doesn't, I'm sure they'd appreciate a (constructive!) reaction blog post.

ephaeton · on March 15, 2021

Setting LC_* to 'C' nukes locale awareness by saying "we're only caring about ASCII-en-us" and thus goes back to the 'cheap' tolower that's also present in the C and C++ version, and most likely in the awk as well - in contrast to, say, the python version.

Finding the "lower case of a character" is immensely harder in face of unicode, because the table is way larger and there's no nice speed hacks by manipulating an index into the ASCII table. JFGI: "tolower performance unicode".

If you notice setting LC_ALL makes a difference in performance for you, you ought to be aware that now you're no longer comparing same capabilities.

I'm not sure how to constructively state that except for: One shouldn't compare ASCII and unicode "tolower" in performance comparisons, as you end up comparing apples and oranges (at best. More like apples and snails) - which I did.

If the author uses code that uses "tolower" in job interviews, i.e., evaluates candidates based on their input WRT case normalization - and he even writes ("This is Unicode-aware ...") - he should know that Unicode awareness is not ubiquitous, and comes at quite a cost. Knowing about unicode awareness, one would assume he'd be aware of not making a fair comparison.

burntsushi · on March 15, 2021

Can you state the specific comparison that is not apples-to-apples in the OP?

The only one I see is the comparison between 'simple' and 'optimized'. But that's more about "what does a simple idiomatic solution look like" and what does an "optimized and possibly less simple" solution look like. That comparison isn't designed to be apples-to-apples in the way you're saying. The simple variant will use Unicode-aware casing in environments where that's the simple and natural thing to do.

IIRC, most of the 'optimized' programs are using ASCII casing. Python doesn't, but casing isn't even close to the bottleneck in that program.

jerf · on March 15, 2021

I believe the author is fully aware of this, and just chose the ASCII version for all languages precisely so that we don't get into Unicode efficiency issues. Doing that wouldn't be objectively bad or anything, it just isn't the question the author is asking. It is still a fine question on its own terms.

benhoyt · on March 15, 2021

Yes, I'm definitely aware of this, thanks -- and as a result, you can't really compare the "simple" and "optimized" versions. Most of the "simple" versions (except C) are Unicode-aware, most of the optimized versions not -- though I acknowledge I haven't stuck to that always, e.g., in the C case, when it's just "too hard".

wmu · on March 15, 2021

Regarding the C++ version: instead of copying data to a vector and then sorting the vector, it might be more profitable to use std::multimap<size_t, string>. This is a sorted container (likely rb-tree based) that handles duplicated keys.

Also, as somebody else mentioned, iostreams are slow. Even simple reading of file can be several times slower than plain C (http://0x80.pl/notesen/2019-01-07-cpp-read-file.html).

jpcooper · on March 15, 2021

Maybe due to stdio synchronisation? This can be turned off.

https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdi...

usefulcat · on March 16, 2021

He did that:

    "This line makes it run almost twice as fast:
    ios::sync_with_stdio(false);"

_wldu · on March 15, 2021

I don't think Go gets the credit it deserves (from a performance aspect). I've had to rewrite a good deal of C++ code in Go and most of the time, the Go code runs faster.

debug-desperado · on March 15, 2021

Been true of Java for decades. Most application programmers aren't churning out highly tuned C++, and their design is easily outperformed with a bump allocator and generational GC.

cies · on March 15, 2021

Optimized grep is as fast as the non optimized grep. Seems odd.

I've seen ripgrep (in Rust, also by article contributor Andrew Gallant aka burntsushi) outperform grep in almost every way possible, both in benchmarks and in my experience.

burntsushi · on March 15, 2021

I think grep (and wc) are included in that table as a sort of baseline. Like, "roughly speaking, what can we expect?" Or maybe, "what can we hope for?" grep (or ripgrep) isn't otherwise particularly relevant here.

In the case of grep and wc, the "optimized" variant is running with LC_ALL=C, which sets the locale to C. Presumably, the non-optimized variant is using the OP's default locale, which is maybe something like en_US.UTF-8. (Apologies for the US assumption, but what matters is that it's not the C locale.)

This is a common trick for speeding up GNU utilities when you're only working with ASCII data. It also works for things like `sort`.

In the case of grep, sometimes running with and without the locale will be about as fast, particularly if you're only searching for a simple ASCII literal pattern.

rtb · on March 15, 2021

The "grep" and "wc" commands that are included in that table don't attempt to solve the problem. That wasn't clear to me on first reading.

I suppose they are included as a baseline for reading a file and for tokenising strings into words. See https://github.com/benhoyt/countwords/blob/eb2a8adf21c895907...

albertzeyer · on March 15, 2021

The optimized Python code seems to have a number of bugs:

The loop ending condition might miss non-handled `remaining` (when there is no new-line at the end of the file). (The fix should be simple. Just move the check one line below.)

When there is no newline in some chunk, it would handle the word incorrectly at the chunk boundaries. (But ok, with the constraint that lines cannot be longer than the chunk size, this should not happen. But this could have been fixed easily anyway. Just `remaining = chunk; continue` + the other fix.)

rtb · on March 15, 2021

What are the "grep" and "wc" lines doing in the results? I can't find them in his git repos.

Does he have a solution for the problem using grep, or is he just including "grep x <input" as a reference point for how fast you can read a file?

fmos · on March 15, 2021

The grep version searches the input for "foobar" [1]. Hence, it doesn't solve the problem.

[1] https://github.com/benhoyt/countwords/blob/eb2a8adf21c895907...

rtb · on March 15, 2021

Thanks! I think he should make that a lot clearer in the "results" table.

benhoyt · on March 15, 2021

Thanks for the feedback. I was trying to make that clear with "grep reference", but you're right, it's still ambiguous. I'll add a note to clear it up. Both "wc" and "grep" don't solve the problem, they're just comparisons to see how fast other programs can read the input (and in wc's case, tokenize into words). For grep, I'm using "grep foobar".

kzrdude · on March 15, 2021

(The C code.) Using hcreate and the fact that posix describes this interface which uses a hash table singleton, is just mind boggling. :) So ugly, and you'd hope anything as non-reusable as that is deprecated with a strong wording.

bluetomcat · on March 15, 2021

Even the "optimized" C version is far from what an experienced C programmer would write if performance was paramount. General-purpose memory allocation, using hash tables with inherently bad spatial and temporal locality, using buffered I/O instead of mapping the file to memory.

burntsushi · on March 15, 2021

You can't map an arbitrary stream into memory. That's one of the constraints of the challenge: it has to be able to work on a stream.

And even then, you can't memory map all kinds of files. This is what happens when you assume that you can:

    $ ag MHz /proc/cpuinfo
    $ grep MHz /proc/cpuinfo
    cpu MHz         : 2300.000
    cpu MHz         : 988.934
    cpu MHz         : 2300.000
    cpu MHz         : 800.044
    cpu MHz         : 2300.000
    cpu MHz         : 2300.000
    cpu MHz         : 2300.000
    cpu MHz         : 1100.949

In the optimized C program, allocation is not a bottleneck. Pretty much all allocation is done upfront and that's all you need. There is an allocation for writing a new word to the table, but that's also amortized and isn't a big factor in the performance of the program in this particular benchmark.

As for "bad spatial and temporal locality," can you be more specific? I guess the only thing I can see is perhaps inlining words smaller than some size into the same allocation as the hash table. But the hash table is otherwise one contiguous allocation.

bluetomcat · on March 15, 2021

> As for "bad spatial and temporal locality," can you be more specific?

One could use a lexicographic Btree-like structure with multiple characters per node, sharing consecutive cache-lines. For looking up the word "bar", you would traverse the "b" key from the root node, then its child "a" node, then the "r" child of the "a" node. Most of the nodes near the top would remain hot in the cache, and deeper less-visited nodes could be joined or compacted based on some criteria. Of course there are many subtleties, special cases and complications and the solution would be far more complex in terms of source code length.

With a hash table, you effectively compute a hash for every word, and then jump at arbitrary locations throughout the whole table.

burntsushi · on March 15, 2021

You've traded hashing for a memory access per byte. I did something similar in the 'optimized-trie' program for Rust: https://github.com/benhoyt/countwords/blob/master/rust/optim... --- It ended up being slower. But, it is very much impacted by cache effects. So the next step here would indeed be to figure out how to reduce cache misses. But it's non-trivial.

Like, maybe your idea works. Maybe. I don't know. You'd have to try it. But it's not clear to me that it will. And it doesn't support "is far from what an experienced C programmer would write if performance was paramount" IMO.

cb321 · on March 15, 2021

> That's one of the constraints of the challenge: it has to be able to work on a stream.

I cannot find any stream constraint in either the article or the countwords repo. He just says "standard input" and his "test.sh" uses "<kjvbible_x10.txt".

You can mmap stdin/fd 0 after an `fstat(0,..)` no problemo. Can it fail and should you check errors? Sure (as can accessing stdin in any way, actually).

He even mentions memory-mapped IO as a further way to go in the article in the C part.

burntsushi · on March 15, 2021

From the OP:

> Memory: don’t read whole file into memory. Buffering it line-by-line is okay, or in chunks with a maximum buffer size of 64KB. That said, it’s okay to keep the whole word-count map in memory (we’re assuming the input is text in a real language, not full of randomized unique words).

A shorter but less precise way of saying that is, "make the program work on streams."

> You can mmap stdin/fd 0 after an `fstat(0,..)` no problemo. Can it fail and should you check errors? Sure (as can accessing stdin in any way, actually).

Of course. And when it fails, what do you do? You defer to something that can handle a stream! Which is exactly the problem in the OP.

cb321 · on March 15, 2021

I disagree. mmap will demand page into memory - typically reading 4K at a time off storage (though really the buffer cache in all these timings). Much less than 64K. Were the file bigger than RAM/were there competition, it would also evict earlier pages. It's really just another way to "buffer" without worrying about all the fuss in user space, and I don't think you're going to convince me otherwise. I mean, MAP_POPULATE might violate the idea, but that is a very special case of mmap.

Honestly, I think the author himself saying memory-mapped IO is a forward direction but "enough for now!" is pretty conclusive that he didn't think it was a forbidden optimization direction. I think you are over interpreting his early step-by-step style language explaining finite memory as a strict spec, or perhaps he shared an earlier draft of the article with you before he wrote that.

It's fine to fail over to a stream, but often mmap is faster, as I know you know.

burntsushi · on March 15, 2021

If you mmap'd a file, how do you keep its maximum memory usage to 64KB? I actually don't know the answer to that question. It's definitely not the default. You don't get to control it. The OS does.

> It's fine to fail over to a stream, but often mmap is faster, as I know you know.

But that's exactly the point. You literally cannot use mmap in all cases, either because you just can't or because it's actually slower. And in those cases, you need to fail over to a streaming implementation. And that streaming implementation needs to be fast too. And that's what this benchmark is.

If you submitted a program that only used mmaps, and I sent a stream into that program, it would fail. So then you would need to modify the program to handle streams. And your strategy for dealing with mmaps couldn't be used. So then you'd need to optimize your handling of streams. Which is exactly what the programs in the OP are doing. So in the end, mmaps are a distraction for a benchmark like this, and I suspect that's why the OP didn't bother with them.

cb321 · on March 15, 2021

You keep ignoring the author literally opening it up as an avenue in:

>I’m sure there’s further you could go with the C version: investigate memory-mapped I/O, avoid processing byte-at-a-time, use a fancier data structure for counting, etc. But this is quite enough for now!

This subthread started with bluetomcat talking about the optimized C and what experienced C devs might do. That strikes me as fair game to open up discussion of alternative IO anyway even if the author hadn't already (which he clearly did, as quoted). So, that's two reasons it's worth bringing up, and I was never criticizing your program!

If what you are on about is "Who wins what scorecard against what arbitrary constraints" or "My hands were tied, really!" or whatever, then, sorry, but this benchmark is too uncontrolled for great answers even on its own terms. Various stdio-using things would be "out of constraints" if a system was configured with bigger than 64K default buffers anyway which is certainly possible, if unlikely, today. None of the sample programs or the test harness "check" for that. Some of the languages may not be able to ensure it. Using that to leapfrog to "only streams" is a stretch. Is CPU freq scaling controlled? Min of N trials to filter background noise/get repeatability? Even simple mean+-sdev? No, no, and no. And probably four more things.

Beyond all of that, I also don't think that's what this subthread was ever about. bluetomcat's opening was literally the opposite - equivalent to "real devs would do xyz implicitly independent of the arbitrary constraints posed if performance is paramount"..seemingly a follow up on my quote from the author. He can of course chime in if I misread that. Presumably, the author would have had to relax that already problematic 64K constraint when moving on to mmap (which moving on he might have done if the article were fewer languages or if he had just done it that way first in his open coded C).

The positions you seem dug into here seems to me "don't mention mmap to people questioning the posing of the contest even though the author did" or "you cannot default to mmap and fail over like fstat||mmap||do_streams||aiie_noStdInEven". Yes, which is faster varies by OS/situation. So? Maybe "experienced" C/whatever devs know their situation. (You use mmap in ripgrep...). Maybe these are all honest communication errors, but I don't think your positions are very tenable.

As I mentioned in a few places, performance conclusions here are harder than they might look. As for "distractions", one might say that about literally all the optimized variants, including their numbers in the table since the numbers are likely to change more. The article might be stronger to focus on only the simple variants of all the rest, leaving the optimized ones for the github repo and weird "contest rule" debates on the github issues.

Anyway, to add a little more actual information for passersby less dug in to defending some weird position like "", Nim's stdlib has a trie in critbits module. So, that test is "in bounds" and an easy experiment someone might enjoy.

Have a nice day.

burntsushi · on March 15, 2021

It's not about a scorecard. I didn't just dig my heels in and say, "language lawyering the OP says mmap isn't allowed so shut up." It's about what's a useful benchmark. I explained why mmap's are a distraction beyond just pointing to the OP's constraints. I don't think there's really much else to say and I disagree with your characterization of this thread.

shakna · on March 15, 2021

> So ugly, and you'd hope anything as non-reusable as that is deprecated with a strong wording.

It isn't. It was only added in POSIX 2001. Still there without major changes in 2018. [0] (There's actually a lot of similar libraries in POSIX, that don't include re-entrant forms.)

Thankfully, re-entrant versions are supplied as extensions by GNU, which is significantly less horrifying.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/se...

drran · on March 15, 2021

It's not so ugly when you use C on a computer with less than 64Kb of RAM.

OskarS · on March 15, 2021

I would have loved to see the Knuth version (which was literate Pascal with WEB, I think) compared as well, but I imagine that is a nightmare to build. I wonder if the source code is available anywhere, I can't recall ever seeing it. I think Knuth used a custom-made trie, and it would be interesting to see that compared to these hash table based solutions.

Also: this comparison honestly just makes me love AWK more and more. So little code needed! So concise, and pretty ok performance as well! AWK really is an underrated little language.

Ballas · on March 15, 2021

But both wc and grep are written in C...

quietbritishjim · on March 15, 2021

I think the point was to compare the reasonably simplest idiomatic examples in the various languages. For example, the C++ example used `std::unordered_map` even though that isn't necessarily the fastest possible hash table for C++ for even general purpose jobs, let alone specific ones.

Of course you could write significantly faster implementations in C (and probably in the other languages) by writing data structures from scratch especially for the job (and memory allocation strategies and maybe other stuff), and if you're the author of wc or grep then that is well worth doing because it will be used so many times relative to the amount of time for you to write that code. The comparison of the idiomatic C program vs grep is basically proof of that. But the article seemed to be targetting more typical developers than just want to plug together some common existing components.

dcu · on March 15, 2021

https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

dgellow · on March 15, 2021

You can see them as a baseline

Simplicitas · on March 15, 2021

exactly .. this is the part I don't understand

taeric · on March 15, 2021

I do get a wry grin from the trie not being faster. Feels like that is always a data structure that feels like it will speed things up, but really needs very specific settings to do so.

lionkor · on March 15, 2021

seriously, scanf in a loop, strdup, etc.?

the reason C can be so fast is because a SKILLED developer can write very optimized code. The C and C++ examples are not written well by any means.

All this article does is demonstrate that a less complex language is easier to write "good enough" code in.

methyl · on March 15, 2021

> All this article does is demonstrate that a less complex language is easier to write "good enough" code in.

Is Rust less complex than C?

ZoomZoomZoom · on March 15, 2021

Depends on your point of view. Rust has more complexity hidden in layers underneath the one a developer is working with directly.

mymatsubara · on March 16, 2021

The scanf and strdup are used in the unoptimized version, which I think is completely fair.

Shorel · on March 15, 2021

Real life application for this exercise (I use a python version from time to time):

Count the frequency of words in a book, then get the most frequent words you don't know in Anki, so you can learn the words that will help you understand more of the text.

This is of course taken from the Fluent Forever book.

Example from Малкият принц:

   какво 62
   са 61
   ако 58
   когато 56
   ме 55
   беше 53
   а 51
   планета 51
   ли 49
   човек 43
   едно 42
   бе 42
   ден 42
   който 40

machiaweliczny · on March 16, 2021

I have python script that takes on the book, does what you said and translates everything also. Very helpful, with this + list od 2K most popular words learned english in 2 books when in middle school.

Shorel · on March 17, 2021

Care to share it?

machiaweliczny · on March 17, 2021

Sure here it is[0]. This one is for parsing Kindle clippings but it's quite similar (except no sorting of all words, I was usually using some service for mobi to txt switch before feeding script). I should probably clean it up and switch implementation to english (but don't use it anymore).

[0] https://github.com/Machiaweliczny/KindleClippingsTranslator/...

Shorel · on March 17, 2021

Dziękuję bardzo!

dietrologia · on March 15, 2021

I wonder why the C++ version opted for an unordered_map, copied into a vector and then sorted. I'd be curious to see if using an std::map would work better.

pjmlp · on March 15, 2021

The C# is clearly not optimized, if the goal is to win micro-benchmarks.

- foreach has hidden allocations

- there are better data structures than Dictionary

- GetValueOrDefault has higher cost than a plain if

burntsushi · on March 15, 2021

There is no C# optimized variant because nobody wrote one. There is no claim that the C# program is optimized. Look at the table: there are "simple" and "optimized" variants for each language (except C#). The "simple" variants are intentionally not supposed to be optimized, and instead, use something approximating idiomatic code for the given language.

Someone1234 · on March 15, 2021

You're micro-optimizing the wrong thing, most of the time taken in C# is in the .Split(' ', StringSplitOptions.RemoveEmptyEntries) (which is UNICODE instead of ASCII as in other languages by the way) and also Console.ReadLine().

C# supports non-allocating splitting via the Span / ReadOnlySpan see this article about this very question:

https://www.meziantou.net/split-a-string-into-lines-without-...

You can also use the Win32 APIs to more efficiently hook the console input:

https://stackoverflow.com/questions/33342340/fast-reading-of...

You can change foreach to for or whatever, but after you deal with the major program's bottlenecks.

pjmlp · on March 15, 2021

Surely, I haven't bothered to list everything.

acdha · on March 15, 2021

The author named it “simple.cs” for presumably that reason and the “though not being a C# developer, there may be obvious things I’m missing” part makes me think they’d be happy to get your optimized contribution.

pjmlp · on March 15, 2021

I give my skills as free beer when it is relevant for me, otherwise there are just advices.

acdha · on March 15, 2021

Totally fair - my point was mostly that it seemed pretty clear that the author didn’t have much C# experience and would gladly accept an expert’s optimized version.

pjmlp · on March 15, 2021

Which is why I mentioned some low hanging fruit of possible improvements.

tester34 · on March 15, 2021

>there are better data structures than Dictionary

What built in you'd use?

andix · on March 15, 2021

Some structure, where you don't have to do a keynlookup twice, for incementing the number.

I would try it with a dictonary, that links to an array list with the counts.

gibsonf1 · on March 15, 2021

Unoptimzed version in Common Lisp (so has consing) (see performance against bible and bible x 10 later in this thread)

For kjvbible.txt at (unoptimized) 0.365 is 3rd fastest beating C, and 4th faststest beating optimized Go

(would be interesting to see how fast the optimized Lisp would be)

Common Lisp (sbcl.org):

(defmethod performance-count ((path-file string))

  (let ((map (make-hash-table :test 'equal)))

    (with-open-file (stream path-file :direction :input :if-does-not-exist nil)

      (when stream

        (loop for line = (read-line stream nil 'end)

       until (eq line 'end)

       do

       (let ((split (split-string #\space (string-downcase line))))

         (dolist (word split)

    (let ((index (gethash word map)))

      (if index

          (setf (gethash word map) (incf index))

        (setf (gethash word map) 1))))))

        (let ((keys (sort (alexandria:hash-table-keys map) (lambda(x y)(> (gethash x map)(gethash y map))))))

   (dolist (key keys)

     (format t "~A ~A~%" key (gethash key map))))))))

WEB> (time (performance-count "/home/frederick/performance-comparison.txt"))

the 4

foo 2

defenestration 1

paulgdp · on March 15, 2021

Your result is very surprising probably because you benchmarked on the very small example corpus whereas other languages were benchmarked on a much much bigger corpus

Frost1x · on March 15, 2021

Also not entirely sure how one plans to compare runtime performance when the original article didn't really describe the hardware it ran on that I could see skimming it over. If you can get better runtime performance results and run against ancient hardware of the same architecture, you may be able to assume you're below the lower bound of the benchmark system explored. If your runtimes are better in that case, you may be able to place high confidence that the LISP example in this case actually is more performent. Definitely would run it against the same inputs they used since they provide it and describe how you can easily get/derive the input (some Project Gutenberg ebook concatenated 10 times).

I suppose you could also recreate all their examples to create your own baseline of runtime performance but that's a lot of work for what seems to be a not-very empirical benchmark (at least to me).

Disclaimer: I did not check runtime complexity of any of the implementations because I didn't really care and skipped straight to the performance results table.

woliveirajr · on March 15, 2021

On a 4.3MB file:

https://github.com/benhoyt/countwords/blob/master/kjvbible.t...

duckerude · on March 15, 2021

That file repeated 10 times, even: https://github.com/benhoyt/countwords/blob/master/test.sh

gibsonf1 · on March 15, 2021

On x10 (no optimization yet so consing is killing performance):

firmament: 10

firmament, 10

genesis 10

version 10

Evaluation took:

  2.798 seconds of real time

  2.813069 seconds of total run time (2.677126 user, 0.135943 system)

  100.54% CPU

  8,125,595,437 processor cycles

  934,110,048 bytes consed

Sekhmet · on March 15, 2021

It's just testing the output. For benchmark each variant is run 5 times with kjvbible.txt as input and then the lowest execution time is used as benchmark.

gibsonf1 · on March 15, 2021

In that case, unoptimized lisp for kjvbible.txt at 0.365 is 3rd fastest beating C