$ time wc -w 100m
2266395 100m
real 0m4.568s
$ cc -o wc wc.c
$ time ./wc 100m
2343390 100m
real 0m0.511s
Of course, it disagrees on the answer, because I just used 100M of random data and it doesn't care about wide characters. It gives the same answer as GNU on plain ASCII text.
It's not faster because it's better, it's faster because it is doing less.
Alternatively, here's my entry for "Beating GNU wc with GNU wc":
$ dd if=/dev/urandom of=100m bs=1M count=100 2> /dev/null
$ time wc -w 100m
2319144 100m
real 0m4.543s
user 0m4.512s
sys 0m0.029s
$ time env LC_CTYPE=C wc -w 100m
2308032 100m
real 0m0.842s
user 0m0.842s
sys 0m0.000s
All of these articles are frustrating because they use different environments and test sets and none of the ones I’ve read have posted the test sets up. Some people use random characters, some people use existing files. Some people use files of 1 MiB, some 100 MiB, some several GiB in size. Not only that, but the people programming the replacements don’t even normalize for the difference in machine/processor capability by compiling the competitors and GNU wc from scratch. The system wc is likely to be compiled differently depending on your machine. The multithreaded implementations are going to perform differently depending on if you’re running Chrome when you test the app or not, etc.
This would easily be solved by using the same distribution as a live USB, sharing testing sets, and compiling things from scratch with predefined options, but nobody seems to want to go to that much effort to get coherent comparisons.
>none of the ones I’ve read have posted the test sets up
I think the earliest(?) entry[1] in this "series" (the one done in Haskell) had their test input shared alongside the source code, and it has been referenced in some others (often multiplied several times over, as the original isn't very big). Beyond that it's ASCII only, the contents matter little.
>nobody seems to want to go to that much effort to get coherent comparisons
I think nobody really wants to get any coherent comparisons, because this thing isn't really a competition between the entries themselves.
I tested this on a fresh install of Fedora 31, so I didn’t really see any benefit of running it on a LiveUSB. As I mentioned in the article, the wc implementation I used for comparison has been compiled locally with gcc 9.2.1 and -O3 optimizations. I’ve also listed my exact system specifications there. I’ve used the unprocessed enwik9 dataset (Wikipedia dump), truncated to 100 MB and 1 GB.
I understand your frustrations with the previous posts, but I’ve tried to make my article as unambiguous as possible. Do give it a read, if you have any futher suggestions or comments, I’d be happy to hear them!
> This would easily be solved by using the same distribution as a live USB, sharing testing sets, and compiling things from scratch with predefined options, but nobody seems to want to go to that much effort to get coherent comparisons.
Alternatively, you could have a sort of "shootout CI server" where people upload their compiled binaries as Docker images and the CI server runs them against several a random subset from a set of (hidden) fixed test datasets, averaging the results. (Random and hidden such that you can't just overfit against the test set; fixed so that it's still mostly measuring the same thing.)
I think the Netflix Prize sort of worked like this?
I actually disagree. If you look at the code written in the article, it's hardly unreadable or very code golf-y as you suggest. I think the author is really just trying to say that programmers shouldn't dismiss Go as a possible systems language in favor of C just because C has a reputation of being faster in all cases.
Are we really saying a garbage collected language should be considered for a systems language? I know the GC is fast but surely memory managed languages are going to have less issues with pausing during execution, no?
Right, so when I’m trying to process a network packet in about a microsecond (a reasonable target for financial trading software), Go’s GC rules it out. Many low-level, high-performance systems language tasks aren’t feasible in Go for similar reasons.
That doesn’t make it a bad language, but it’s not universally applicable either.
But yeah, I certainly wouldn't use it for hard real-time tasks. If you can't tolerate missing deadlines ever, there is a very short list of acceptable tools.
But (and correct me if I'm wrong; it's not my area) HFT doesn't strike me as hard real-time? See also a sibling comment that asks about Jane Street's OCaml use.
It's not like GC pauses are happening constantly; Go programs don't allocate that much, and a well tuned program can go a long time between collections. It likely is appropriate for many soft or firm real time systems. And you can shut the GC off if there are sections where GC really must not happen:
That's how those who use Java do it: write their code to minimize allocations, and then configure the JVM with a GC threshold that's bigger than their whole day's allocations, and manually GC at a suitable time.
Last I looked (a while ago) it seemed like there wasn't quite enough control on the Go GC to do this effectively. It's very likely that's been fixed by now.
"The Singularity research codebase and design evolved to become the Midori advanced-development OS project. While never reaching commercial release, at one time Midori powered all of Microsoft’s natural language search service for the West Coast and Asia."
I think there are already a few gc/jitted components in mainstream kernels (bpf, lua drivers). So in some points, it's not a crazy idea. Now I don't advocate for Java drivers.
In the sense that the Go authors used "systems language", i.e. for writing things like servers and command line tools, sure why not? Go's GC pauses are extremely short (under 1ms). That shouldn't really affect much.
Source? People write CLI tools in Ruby and Node.js these days, so I really doubt that anybody would use the term "systems language" to refer to languages well-suited to building CLI tools.
As others mentioned "systems language" isn't well defined. If you want to define a "systems program" as one with hard real-time requirements, then no, Go isn't a very good choice; however, the article defines it differently (such that `wc` satisfies), and demonstrates that Go satisfies that definition.
That said, it's better to use terms that map better to a set of requirements, such as "hard-realtime" or "soft-realtime".
I'm not sure this proves that Go is a possible systems language, since it could be equally like that the explanation is just "wc doesn't need to be written in a systems language to be fast". For an executable that runs, processes some stuff, and then terminates in the span of a second or so, you're not going to incur much slowdown due to a GC counting references and cleaning stuff up, but for something long-running, it could make more of a difference.
I could be wrong, but it seems like it wouldn't be difficult to outperform these implementations with a similarly hacky C implementation. I expect that blog post in the coming days.
Hey, author here. You are absolutely right - Go would have a hard time outperforming finely tuned C. Rather, the article was more focused on exploring concurrency mechanisms in Go. The (admittedly clickbait) title was more of a reference to the articles before it than anything else.
Author here. Why do you feel it is code golf? My primary focus when writing it was readability - I'm sure I could do it in much less than 70 lines if I had to. If you have any suggestions for improving the readability of my implementation, do let me know!
The version you are comparing against is ancient and assumes a low-memory footprint. It only uses something like 10KB of data. It also has a ton more features -- like command line parsing, the ability to read from a file on the command line or stdin, and locale-specific recognition of whitespace.
Argument parsing is hardly the bottleneck when you're scanning 1 GB of data. The reason I didn't implement the rest of the features was to keep the implementation light and readable, and I don't think a fully compliant wc implemented in Go with these patterns would be significantly slower.
As for having a low-memory footprint, 2 of the 4 implementations I mentioned (one single core, one multi-core) consume less memory than wc, and still run much faster.
Yeah, let me just multithread and hand-optimize the piss out of this Go program until it's marginally faster than it would have been in C.
Reminds me of the argument that venison tastes better than beef. The argument roughly being, "If you shoot the deer right, drag it home right, gut it right, skin it right, tenderize it right and cook it _just_ right, it'll be _almost_ as good as store-bought frozen beef"
Author here. I think you should go through the article again. I think it's quite readable, and there are no "hand-optimizations" as you say. Also, the single-core implementation was already faster than the C version - the multithreaded version was only done to explore different methods of concurrency in Go.
To be fair the author does say "it has since turned into a game of trying to take on the venerable wc with different languages". Of course the real message is that the original wc isn't particularly efficient and it can be beaten in many different languages - including C.
Speedy CPUs with lots of cores are cheap enough that the time taken for garbage collection may very well be worth it.
On the other hand, I do still love code golf and other speed/size/wonkiness competitions. It's a lot like the early demo scene whose sole effort was to show how much they could do with very limited resources.
How many times have the tests been run? Once? First for C, then for Go? What about disk caching then?
What I can see there is that for the same algorithm, the Go version was not so fast. It was comparable. The memory overhead could be caused by the size of the program, as wc has more functionalities.
Then there is compared a totally different algorithm with a false claim "this way a go implementation is faster than a c one". Sure it is, as this is a different implementation of a different algorithm. A fair comparison would be implementing the same algorithm in c and comparing then. I assume the difference wouldn't be huge.
So, generally, I think it's not a fair comparison.
The tests were run 10 times, and I used the median value. There wasn't much variance between runs, so I don't think disk caching played much of a role here.
Being a garbage collected language with a runtime, Go certainly cannot match the performance of C, and it was never my point to prove otherwise - obviously, for the same algorithm, the C implementation would be faster. Instead, I was exploring Go to highlight how simple it is to write safe, concurrent code in it.
wc seems like a bad example because it’s basically IO bound. The title should read, “Go’s built in bufio reader is faster than raw reads on the file descriptor.” Which it should be because that’s the point of bufio.
No, it's not faster. They are comparing a Go version that does not do character decoding (which is necessary for correctly counting the number of words under the presence of non-ASCII punctuation) to a C version that does decode characters (and matches them against a larger character set with iswspace).
This can be easily shown, by counting words and lines using wc separately. Word counting decodes characters (to find non-ASCII whitespace that may separate words), whereas line counting just looks for ASCII line separators:
$ time wc -w wiki-large.txt
17794000 wiki-large.txt
wc -w wiki-large.txt 0.48s user 0.02s system 99% cpu 0.496 total
$ time wc -l wiki-large.txt
854100 wiki-large.txt
wc -l wiki-large.txt 0.02s user 0.01s system 99% cpu 0.034 total
So, without character decoding, looking at every byte is ~15 times faster. So, if you'd compile wc without multibyte character support (which would be a fair comparison), it would probably beat Go without any parallelization.
Author here. I have addressed this in the article. The bufio-based implementation was the first one, and it was actually slower.
In the second section, I was able to surpass the performance of the C implementation - by using a read call with a buffer. As I mentioned in the article, the C implementation does the same, and in the interest of a fair benchmark, I set equal buffer sizes for both.
I don't know Go so I am not sure what file.Read does exactly but wc, on my system which uses the same wc.c as Jenner's article, is doing blocking reads in a naive way which I argue makes it somewhat of a paper target.
Maybe the Linux wc version you have is better. I don't know. They do exist and Penner's article gave a link to one. I am not sure as you didn't link to the source of your wc.
(Edit: I just noticed you did use the OSX wc, you're just running it on Fedora. Sorry about that.)
But in any case using just wall clock time can be deceptive. Here is what I mean. On my machine, a somewhat old Mac Mini with a spinning disk, user CPU is only ~1/5 of the real time. The rest waiting for the OS or the disk.
% for ((i=0;i<250;i++)); do cat /usr/share/dict/words >> a; done
% /usr/bin/time wc -l a
58971500 a
1.24 real 0.28 user 0.64 sys
> I hope it demonstrates that Go can be a viable alternative to C as a systems programming language.
Are you kidding me? This is nothing close to a systems programming language. This isn't much of a comparison at all. wc is a very simple case that doesn't match the complexity of real-world programs. Go comes close here because you're not using the high-level abstractions and that make it useful in the real world. GNU coreutils also tend to focus on having tons of features (compared to BSD/busybox/plan9/other), which can slow them down. If you really want to get competitive, I bet an AVX-512 implementation would be fastest, and that's more doable in C, but this is a bogus comparison in any case. It's just people doing this because they like a specific language.
Furthermore this isn't "system programming", he just replaced a "user land" program with another. But I blame the Go team for turning "system programming" and "realtime programming" into useless buzzwords to promote their language.
Yeah when I first looked at go that was the first 'no fu' for me. Go isn't a systems or real time programming language. It's a managed language with training wheels. Which is okay by me. Lying about it though is not okay.
The second FU is they claim that decisions they made based on personal preferences were technical ones. That's a very insidious lie that programmers make all the time. Insidious because it destroys trust between programmers and managers.
> Go can be a viable alternative to C as a systems programming language
Stop this nonsense please. This does not show anything even remotely close to systems programming capability of Go. Write a device driver in Go that performs without lagging, benchmark it and then come back.
"systems programming" is not well defined. While you and I probably prefer a similar definition, comments that criticize TFA for not using our preferred definition are really boring. Let's just say that Go is a fine language for `wc` (and not a very good language for device drivers) and move on.
They never said that they had to use c because go is not a systems language, so your assertion looks wrong. They wanted to avoid using unsafe.
In c everything is unsafe by the way, so it makes it less of a systems language?
No Go isn't a systems language because for one you don't have direct control over memory if you need it. For instance Go doesn't even have the volatile keyword which is essential in many cases when interfacing with hardware. The paper you linked laments this as well.
Author here. If you read the article carefully, you'll see that the 70 lines I used to outperform wc was a single-threaded implementation. I multi-threaded it later for overkill.
Yeah, if that figure includes the executable, then it's also probably including the whole runtime (scheduler, GC, etc) since Go programs statically link the runtime by default. In that case, 2MB isn't so bad (especially considering glibc is ~10MB [and with no scheduler or GC!] iirc).
Another article in the series beating C by moving the goal posts. My comment on the original article about the Haskell version on lobste.rs:
Keep in mind in all comparisons to GNU wc that it does extra work, detecting multi-byte characters and decoding multi-byte characters if present, to correctly count the number of words. perf shows a significant amount of time being spent in multibyte character handling. If you trigger a code path that does not do decoding beyond the byte level, it’s much faster:
$ time wc wiki-large.txt
854100 17794000 105322200 wiki-large.txt
wc wiki-large.txt 0.42s user 0.02s system 99% cpu 0.438 total
$ time wc -l wiki-large.txt
854100 wiki-large.txt
wc -l wiki-large.txt 0.02s user 0.02s system 98% cpu 0.034 total
(wc -l looks at every byte, but does no decoding.)
From a quick glance, this is also where the 'Haskell beats C' article fails. It’s comparing apples to oranges, the ByteString implementation does not do the same as GNU/macOS wc and returns incorrect results in the presence of non-ASCII punctuation. The article incorrectly states that wc will handle input as ASCII. Unless you do not use a multi-byte locale, macOS wc uses the combo of mbrtowc and iswspace.
Author here. This is not true - I included a link to the manpage (https://ss64.com/osx/wc.html) in the article to avoid this confusion. I did not use GNU wc; I used the OS X one, which, by default, counts single byte characters. From the manpage:
> The default action is equivalent to specifying the -c, -l and -w options.
> -c The number of bytes in each input file is written to the standard output.
> -m The number of characters in each input file is written to the standard output. If the current locale does not support multi-byte characters, this is equivalent to the -c option.
Moreover, I also mentioned in the article that I was using us-ascii encoded text, which means that even -m would have been treated as ASCII text.
It is not about the character count, but the word count. wc decodes characters to find non-ASCII whitespace as word separators. If you read further in the same man page:
White space characters are the set of characters for which the iswspace(3) function returns true.
That your text is ASCII encoded does not matter, since ASCII is a subset of UTF-8. So at the very least, you need an extra branch to check that a byte's value is smaller than 128 (since any byte that does not start with a leading zero is a multi byte character in UTF-8).
You are right! So that's a Darwin oddity, still a wide char function is called in that code path, iswspace, which adds the function call overhead in a tight loop.
If domulti is not set, the wide char function is not called as far as I can tell. Why would it? It's explicitly meant not to do wide char stuff in that case.
FWIW, when this was going around for the first time I took this Darwin version of wc and experimented with setting domulti to const 0, statically removing all paths where it might do wide character stuff. I didn't measure any performance difference to just running it unmodified.
$ time ./wc ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wc ../wiki-large.txt 0.47s user 0.02s system 99% cpu 0.490 total
time ./wc2 ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wc2 ../wiki-large.txt 0.28s user 0.01s system 99% cpu 0.293 total
Remove unnecessary branching introduced my multi-character handling [1]. This actually resembles the Go code pretty closely. We get a speedup of 1.8x.:
$ time ./wc3 ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wc3 ../wiki-large.txt 0.25s user 0.01s system 99% cpu 0.267 total
If we take the second table from the article and divide the C result (5.56) by 1.8, the C performance would be ~3.09, which is faster than the Go version (3.72).
Edit: for comparison, the Go version from the article:
$ time ./wcgo ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wcgo ../wiki-large.txt 0.32s user 0.02s system 100% cpu 0.333 total
So, when removing the multi-byte character white space handling, the C version is indeed faster than the (non-parallelized Go version).
Thanks, I finally understood what you are saying. Indeed, the code uses iswspace to test all characters, wide or normal. Strange design choice. For whatever it's worth, even just changing
if (iswspace(wch))
to something like
if (domulti && iswspace(wch))
...
else if (!domulti && isspace(wch))
...
got something like a 10% speedup on my machine. And replacing isspace with an explicit condition like yours is much faster still. I checked, isspace is macro-expanded to a table lookup and a mask, but apparently that's still slower than your explicit check. I'm a bit surprised by this but won't investigate further at the moment.
It has the worst of both worlds: it incorrectly counts the number of words when there is non-ASCII whitespace (since mbrtowc is not used), but it pays the penalty of using iswspace. It's also not in correspondence with POSIX, which states:
The wc utility shall consider a word to be a non-zero-length string of characters delimited by white space.
[...]
C_CTYPE
Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files) and which characters are defined as white space characters.
This is excellent for showing how to do X in Y language. But like most benchmarks - it's a bit silly.
If speed was the priority, you could parse only 1% of the file, then multiply the values by 100, assuming the rest of the file looks alike. Do you really care to know the file has 1337 words rather then ca 1300 words?
It's not faster because it's better, it's faster because it is doing less.