Beating C with 70 lines of Go

jstanley · on Nov 20, 2019

Alternatively, here's my entry for "Beating C with 40 lines of C": https://pastebin.com/JzFfE5GB

  $ time wc -w 100m
  2266395 100m
  
  real 0m4.568s

  $ cc -o wc wc.c
  
  $ time ./wc 100m
  2343390 100m
  
  real 0m0.511s

Of course, it disagrees on the answer, because I just used 100M of random data and it doesn't care about wide characters. It gives the same answer as GNU on plain ASCII text.

It's not faster because it's better, it's faster because it is doing less.

Quekid5 · on Nov 20, 2019

> It's not faster because it's better, it's faster because it is doing less.

I can't recall who said it, but this reminds me of this idea: If you don't care about correctness then I can make it as fast as you like.

glandium · on Nov 21, 2019

Alternatively, here's my entry for "Beating GNU wc with GNU wc":

    $ dd if=/dev/urandom of=100m bs=1M count=100 2> /dev/null

    $ time wc -w 100m
    2319144 100m

    real 0m4.543s
    user 0m4.512s
    sys 0m0.029s

    $ time env LC_CTYPE=C wc -w 100m
    2308032 100m

    real 0m0.842s
    user 0m0.842s
    sys 0m0.000s

Your program is faster, though.

fao_ · on Nov 20, 2019

Something I posted elsewhere:

All of these articles are frustrating because they use different environments and test sets and none of the ones I’ve read have posted the test sets up. Some people use random characters, some people use existing files. Some people use files of 1 MiB, some 100 MiB, some several GiB in size. Not only that, but the people programming the replacements don’t even normalize for the difference in machine/processor capability by compiling the competitors and GNU wc from scratch. The system wc is likely to be compiled differently depending on your machine. The multithreaded implementations are going to perform differently depending on if you’re running Chrome when you test the app or not, etc.

This would easily be solved by using the same distribution as a live USB, sharing testing sets, and compiling things from scratch with predefined options, but nobody seems to want to go to that much effort to get coherent comparisons.

eMSF · on Nov 20, 2019

>none of the ones I’ve read have posted the test sets up

I think the earliest(?) entry[1] in this "series" (the one done in Haskell) had their test input shared alongside the source code, and it has been referenced in some others (often multiplied several times over, as the original isn't very big). Beyond that it's ASCII only, the contents matter little.

>nobody seems to want to go to that much effort to get coherent comparisons

I think nobody really wants to get any coherent comparisons, because this thing isn't really a competition between the entries themselves.

[1]: https://github.com/ChrisPenner/wc (see data/big.txt)

ajeetdsouza · on Nov 20, 2019

I tested this on a fresh install of Fedora 31, so I didn’t really see any benefit of running it on a LiveUSB. As I mentioned in the article, the wc implementation I used for comparison has been compiled locally with gcc 9.2.1 and -O3 optimizations. I’ve also listed my exact system specifications there. I’ve used the unprocessed enwik9 dataset (Wikipedia dump), truncated to 100 MB and 1 GB.

I understand your frustrations with the previous posts, but I’ve tried to make my article as unambiguous as possible. Do give it a read, if you have any futher suggestions or comments, I’d be happy to hear them!

derefr · on Nov 22, 2019

> This would easily be solved by using the same distribution as a live USB, sharing testing sets, and compiling things from scratch with predefined options, but nobody seems to want to go to that much effort to get coherent comparisons.

Alternatively, you could have a sort of "shootout CI server" where people upload their compiled binaries as Docker images and the CI server runs them against several a random subset from a set of (hidden) fixed test datasets, averaging the results. (Random and hidden such that you can't just overfit against the test set; fixed so that it's still mostly measuring the same thing.)

I think the Netflix Prize sort of worked like this?

jimbob45 · on Nov 20, 2019

These articles seem like they're playing code golf more than proving anything as a viable alternative to C.

The goal is to analyze your needs and pick the language best suited to your task. The goal is not to find one language that excels at everything.

manningthegoose · on Nov 20, 2019

I actually disagree. If you look at the code written in the article, it's hardly unreadable or very code golf-y as you suggest. I think the author is really just trying to say that programmers shouldn't dismiss Go as a possible systems language in favor of C just because C has a reputation of being faster in all cases.

dijit · on Nov 20, 2019

Are we really saying a garbage collected language should be considered for a systems language? I know the GC is fast but surely memory managed languages are going to have less issues with pausing during execution, no?

zenhack · on Nov 20, 2019

Re: pause times, Go's garbage collector is actually really good here; you're looking at 10s of microseconds.

I'm convinced the term "systems language" doesn't have a coherent meaning at this point. See:

https://zenhack.net/2018/07/14/three-funerals-in-the-name-of...

__d · on Nov 20, 2019

Right, so when I’m trying to process a network packet in about a microsecond (a reasonable target for financial trading software), Go’s GC rules it out. Many low-level, high-performance systems language tasks aren’t feasible in Go for similar reasons.

That doesn’t make it a bad language, but it’s not universally applicable either.

zenhack · on Nov 21, 2019

> but it’s not universally applicable either.

Nothing is universally applicable.

But yeah, I certainly wouldn't use it for hard real-time tasks. If you can't tolerate missing deadlines ever, there is a very short list of acceptable tools.

But (and correct me if I'm wrong; it's not my area) HFT doesn't strike me as hard real-time? See also a sibling comment that asks about Jane Street's OCaml use.

It's not like GC pauses are happening constantly; Go programs don't allocate that much, and a well tuned program can go a long time between collections. It likely is appropriate for many soft or firm real time systems. And you can shut the GC off if there are sections where GC really must not happen:

https://golang.org/pkg/runtime/debug/#SetGCPercent

__d · on Nov 28, 2019

That's how those who use Java do it: write their code to minimize allocations, and then configure the JVM with a GC threshold that's bigger than their whole day's allocations, and manually GC at a suitable time.

Last I looked (a while ago) it seemed like there wasn't quite enough control on the Go GC to do this effectively. It's very likely that's been fixed by now.

crematoria · on Nov 21, 2019

Strange. I know OCaml is used in HFT (thanks to all those Jane Street ads) and OCaml is also GC'd. What does OCaml have here that Go doesn't?

pjmlp · on Nov 21, 2019

Just like in hard real time systems using malloc() rules C out, and its use is explicitly forbidden in certification processes.

I guess C isn't universally applicable either.

Gibbon1 · on Nov 21, 2019

The difference is, you can write perfectly usable programs in C without littering the codebase with malloc() and free().

pjmlp · on Nov 21, 2019

Just like you can in Go, as proven by TinyGo.

Or in dozen of other GC enabled systems programming languages for that matter.

Gibbon1 · on Nov 21, 2019

Tip: Give up dude.

pjmlp · on Nov 21, 2019

Exactly my tip to the anti-GC luddites.

The difference being evolution, takes care of sorting that out, like in all anti-technology groups throughout mankind history.

Gibbon1 · on Nov 21, 2019

[flagged]

pjmlp · on Nov 22, 2019

To the extent that my work is open to memory corruption security exploits, with possible high monetary costs, caused by anti-GC luddites.

https://googleprojectzero.blogspot.com/2019/11/bad-binder-an...

Gibbon1 · on Nov 22, 2019

Great we can have system software that's as shitty as user facing now.

pjmlp · on Nov 23, 2019

"The Singularity research codebase and design evolved to become the Midori advanced-development OS project. While never reaching commercial release, at one time Midori powered all of Microsoft’s natural language search service for the West Coast and Asia."

https://www.microsoft.com/en-us/research/project/singularity...

pleasecalllater · on Nov 20, 2019

I always thought the GC pause depends on the heap size... a little searching... http://big-elephants.com/2018-09/unexpected-gc-pauses/.

naikrovek · on Nov 21, 2019

It depends on how many things need to be collected, which is often a function of heap size.

pjmlp · on Nov 20, 2019

Yes, as proven multiple times, and being pushed forward by the likes of Apple, Google and Microsoft.

Regarding using Go as a real systems language (in the same meaning as C):

- gVisor hypervisor on Google Cloud and Linux sandbox on Chromebooks

- Android GPGPU debugger

- Fuchsia TCP/IP stack and volume management

- Baremetal TinyGo on Arduino Nano33 IoT, Adafruit Circuit Playground Express, BBC micro:bit among many others

- Coreboot firmware

dijit · on Nov 24, 2019

Coreboot is C/ASM though.

agumonkey · on Nov 21, 2019

I think there are already a few gc/jitted components in mainstream kernels (bpf, lua drivers). So in some points, it's not a crazy idea. Now I don't advocate for Java drivers.

IshKebab · on Nov 20, 2019

In the sense that the Go authors used "systems language", i.e. for writing things like servers and command line tools, sure why not? Go's GC pauses are extremely short (under 1ms). That shouldn't really affect much.

crematoria · on Nov 21, 2019

Source? People write CLI tools in Ruby and Node.js these days, so I really doubt that anybody would use the term "systems language" to refer to languages well-suited to building CLI tools.

IshKebab · on Nov 21, 2019

https://www.wired.com/2009/11/google-announces-a-new-program...

> As a systems language, Go is intended to be used for developer applications like, for example, web servers.

Couldn't find a citation for CLI tools, but that should satisfy you (since people write web servers in Ruby and Node.js too).

weberc2 · on Nov 20, 2019

As others mentioned "systems language" isn't well defined. If you want to define a "systems program" as one with hard real-time requirements, then no, Go isn't a very good choice; however, the article defines it differently (such that `wc` satisfies), and demonstrates that Go satisfies that definition.

That said, it's better to use terms that map better to a set of requirements, such as "hard-realtime" or "soft-realtime".

rurban · on Nov 21, 2019

Esp. as a systems language. Safety first.

saghm · on Nov 21, 2019

I'm not sure this proves that Go is a possible systems language, since it could be equally like that the explanation is just "wc doesn't need to be written in a systems language to be fast". For an executable that runs, processes some stuff, and then terminates in the span of a second or so, you're not going to incur much slowdown due to a GC counting references and cleaning stuff up, but for something long-running, it could make more of a difference.

zopppo · on Nov 20, 2019

I could be wrong, but it seems like it wouldn't be difficult to outperform these implementations with a similarly hacky C implementation. I expect that blog post in the coming days.

ajeetdsouza · on Nov 20, 2019

Hey, author here. You are absolutely right - Go would have a hard time outperforming finely tuned C. Rather, the article was more focused on exploring concurrency mechanisms in Go. The (admittedly clickbait) title was more of a reference to the articles before it than anything else.

That said, I think you'll find this SIMD-enhanced wc interesting: https://github.com/expr-fi/fastlwc/

barbegal · on Nov 20, 2019

It's mentioned in the post https://github.com/expr-fi/fastlwc/

zopppo · on Nov 20, 2019

Ah, I didn't see that. Thanks!

phaemon · on Nov 20, 2019

Why wait?! Do your "Beating C with C" blog post and enjoy your HN fame!

ajeetdsouza · on Nov 20, 2019

Author here. Why do you feel it is code golf? My primary focus when writing it was readability - I'm sure I could do it in much less than 70 lines if I had to. If you have any suggestions for improving the readability of my implementation, do let me know!

cjensen · on Nov 20, 2019

The version you are comparing against is ancient and assumes a low-memory footprint. It only uses something like 10KB of data. It also has a ton more features -- like command line parsing, the ability to read from a file on the command line or stdin, and locale-specific recognition of whitespace.

It's really not solving the same problem.

ajeetdsouza · on Nov 20, 2019

Argument parsing is hardly the bottleneck when you're scanning 1 GB of data. The reason I didn't implement the rest of the features was to keep the implementation light and readable, and I don't think a fully compliant wc implemented in Go with these patterns would be significantly slower.

As for having a low-memory footprint, 2 of the 4 implementations I mentioned (one single core, one multi-core) consume less memory than wc, and still run much faster.

aikah · on Nov 20, 2019

How much memory does your program uses VS the C program? Did you profile that?

strange_aeons · on Nov 20, 2019

Yes, it's the last column in all the result tables.

Someone · on Nov 20, 2019

FTA: ”Our parallelized implementation runs at more than 3.5x the speed of wc, while matching its memory consumption”

tuckerpo · on Nov 20, 2019

Yeah, let me just multithread and hand-optimize the piss out of this Go program until it's marginally faster than it would have been in C.

Reminds me of the argument that venison tastes better than beef. The argument roughly being, "If you shoot the deer right, drag it home right, gut it right, skin it right, tenderize it right and cook it _just_ right, it'll be _almost_ as good as store-bought frozen beef"

ajeetdsouza · on Nov 20, 2019

Author here. I think you should go through the article again. I think it's quite readable, and there are no "hand-optimizations" as you say. Also, the single-core implementation was already faster than the C version - the multithreaded version was only done to explore different methods of concurrency in Go.

Hope that clarifies things.

aeyes · on Nov 20, 2019

The GNU tools have been hand optimized for decades! If there is performance left on the table, feel free to submit a patch!

Just look at the amount of optimization that has gone into yes: https://www.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes...

pjmlp · on Nov 20, 2019

Yeah, just like C developers used to do to try out beating Assembly programs coded by junior devs back in the 8 and 16 bit days.

Funny how times change.

zik · on Nov 20, 2019

To be fair the author does say "it has since turned into a game of trying to take on the venerable wc with different languages". Of course the real message is that the original wc isn't particularly efficient and it can be beaten in many different languages - including C.

rpmisms · on Nov 20, 2019

Venison is generally much better-fed meat than grocery store cow.

ksaj · on Nov 21, 2019

Speedy CPUs with lots of cores are cheap enough that the time taken for garbage collection may very well be worth it.

On the other hand, I do still love code golf and other speed/size/wonkiness competitions. It's a lot like the early demo scene whose sole effort was to show how much they could do with very limited resources.

devjam · on Nov 20, 2019

Under the "Better parallelisation" section, you don't need to proxy methods for your sync.Mutex if you embed the type.

    type FileReader struct { 
        File            *os.File
        LastCharIsSpace bool
        sync.Mutex
    }

And your Lock() and Unlock() calls on FileReader would just work.

ajeetdsouza · on Nov 21, 2019

Fixed, thank you!

pleasecalllater · on Nov 20, 2019

How many times have the tests been run? Once? First for C, then for Go? What about disk caching then?

What I can see there is that for the same algorithm, the Go version was not so fast. It was comparable. The memory overhead could be caused by the size of the program, as wc has more functionalities.

Then there is compared a totally different algorithm with a false claim "this way a go implementation is faster than a c one". Sure it is, as this is a different implementation of a different algorithm. A fair comparison would be implementing the same algorithm in c and comparing then. I assume the difference wouldn't be huge.

So, generally, I think it's not a fair comparison.

ajeetdsouza · on Nov 20, 2019

The tests were run 10 times, and I used the median value. There wasn't much variance between runs, so I don't think disk caching played much of a role here.

Being a garbage collected language with a runtime, Go certainly cannot match the performance of C, and it was never my point to prove otherwise - obviously, for the same algorithm, the C implementation would be faster. Instead, I was exploring Go to highlight how simple it is to write safe, concurrent code in it.

esmi · on Nov 20, 2019

wc seems like a bad example because it’s basically IO bound. The title should read, “Go’s built in bufio reader is faster than raw reads on the file descriptor.” Which it should be because that’s the point of bufio.

danieldk · on Nov 20, 2019

No, it's not faster. They are comparing a Go version that does not do character decoding (which is necessary for correctly counting the number of words under the presence of non-ASCII punctuation) to a C version that does decode characters (and matches them against a larger character set with iswspace).

This can be easily shown, by counting words and lines using wc separately. Word counting decodes characters (to find non-ASCII whitespace that may separate words), whereas line counting just looks for ASCII line separators:

    $ time wc -w wiki-large.txt
    17794000 wiki-large.txt
    wc -w wiki-large.txt  0.48s user 0.02s system 99% cpu 0.496 total
    $ time wc -l wiki-large.txt
    854100 wiki-large.txt
    wc -l wiki-large.txt  0.02s user 0.01s system 99% cpu 0.034 total

So, without character decoding, looking at every byte is ~15 times faster. So, if you'd compile wc without multibyte character support (which would be a fair comparison), it would probably beat Go without any parallelization.

ajeetdsouza · on Nov 20, 2019

This is not true, see my comment here: https://news.ycombinator.com/edit?id=21587907

danieldk · on Nov 21, 2019

Take the Darwin version linked from your site. Run perf record wc thefile.txt. Then run perf report and you will see iswspace in the call graph.

As I show in [1], removing this call and replacing it by a character match, gives a speedup of almost 2x.

[1] https://news.ycombinator.com/item?id=21592089

ajeetdsouza · on Nov 20, 2019

Author here. I have addressed this in the article. The bufio-based implementation was the first one, and it was actually slower.

In the second section, I was able to surpass the performance of the C implementation - by using a read call with a buffer. As I mentioned in the article, the C implementation does the same, and in the interest of a fair benchmark, I set equal buffer sizes for both.

esmi · on Nov 20, 2019

I don't know Go so I am not sure what file.Read does exactly but wc, on my system which uses the same wc.c as Jenner's article, is doing blocking reads in a naive way which I argue makes it somewhat of a paper target.

Maybe the Linux wc version you have is better. I don't know. They do exist and Penner's article gave a link to one. I am not sure as you didn't link to the source of your wc. (Edit: I just noticed you did use the OSX wc, you're just running it on Fedora. Sorry about that.)

But in any case using just wall clock time can be deceptive. Here is what I mean. On my machine, a somewhat old Mac Mini with a spinning disk, user CPU is only ~1/5 of the real time. The rest waiting for the OS or the disk.

  % for ((i=0;i<250;i++)); do cat /usr/share/dict/words >> a; done
  % /usr/bin/time wc -l a                                         
   58971500 a
          1.24 real         0.28 user         0.64 sys

big_chungus · on Nov 20, 2019

> I hope it demonstrates that Go can be a viable alternative to C as a systems programming language.

Are you kidding me? This is nothing close to a systems programming language. This isn't much of a comparison at all. wc is a very simple case that doesn't match the complexity of real-world programs. Go comes close here because you're not using the high-level abstractions and that make it useful in the real world. GNU coreutils also tend to focus on having tons of features (compared to BSD/busybox/plan9/other), which can slow them down. If you really want to get competitive, I bet an AVX-512 implementation would be fastest, and that's more doable in C, but this is a bogus comparison in any case. It's just people doing this because they like a specific language.

aikah · on Nov 20, 2019

Furthermore this isn't "system programming", he just replaced a "user land" program with another. But I blame the Go team for turning "system programming" and "realtime programming" into useless buzzwords to promote their language.

Gibbon1 · on Nov 20, 2019

Yeah when I first looked at go that was the first 'no fu' for me. Go isn't a systems or real time programming language. It's a managed language with training wheels. Which is okay by me. Lying about it though is not okay.

The second FU is they claim that decisions they made based on personal preferences were technical ones. That's a very insidious lie that programmers make all the time. Insidious because it destroys trust between programmers and managers.

pjmlp · on Nov 20, 2019

This is not lying:

- gVisor hypervisor on Google Cloud and Linux sandbox on Chromebooks

- Android GPGPU debugger

- Fuchsia TCP/IP stack and volume management

- Baremetal TinyGo on Arduino Nano33 IoT, Adafruit Circuit Playground Express, BBC micro:bit among many others

- Coreboot firmware

- Biscuit POSIX like OS

But whatever, the GC-FUD is strong among C devotees.

johnvega · on Nov 20, 2019

A great example of complex program written in C then ported to Go was the original Go compiler.

pjmlp · on Nov 20, 2019

These ones are:

- gVisor hypervisor on Google Cloud and Linux sandbox on Chromebooks

- Android GPGPU debugger

- Fuchsia TCP/IP stack and volume management

- Baremetal TinyGo on Arduino Nano33 IoT, Adafruit Circuit Playground Express, BBC micro:bit among many others

- Coreboot firmware

- Biscuit POSIX like OS

ISO C does not support AVX-512, and plenty of languages do as language extension.

0xdead · on Nov 20, 2019

> Go can be a viable alternative to C as a systems programming language

Stop this nonsense please. This does not show anything even remotely close to systems programming capability of Go. Write a device driver in Go that performs without lagging, benchmark it and then come back.

pjmlp · on Nov 20, 2019

Yep, real nonsense, that is why this stuff exists:

- gVisor hypervisor on Google Cloud and Linux sandbox on Chromebooks

- Android GPGPU debugger

- Fuchsia TCP/IP stack and volume management

- Baremetal TinyGo on Arduino Nano33 IoT, Adafruit Circuit Playground Express, BBC micro:bit among many others

- Coreboot firmware

- Biscuit POSIX like OS

weberc2 · on Nov 20, 2019

"systems programming" is not well defined. While you and I probably prefer a similar definition, comments that criticize TFA for not using our preferred definition are really boring. Let's just say that Go is a fine language for `wc` (and not a very good language for device drivers) and move on.

xpe · on Nov 20, 2019

This Wikipedia article suggests that Go is not regarded as a systems programming language:

https://en.wikipedia.org/wiki/System_programming_language#Ma...

grumpydba · on Nov 20, 2019

https://news.ycombinator.com/item?id=18399389.

It's been done. Performs well.

rowanG077 · on Nov 20, 2019

They quite literally were forced to use C for some parts because Go is not a systems programming language. They write this in the article.

grumpydba · on Nov 21, 2019

> As we wanted to keep unsafe code to a minimum, we instead chose to employ the C code from the original driver as an opportunity to present cgo.

It's not for performance reasons. I think you misread. Also the driver is pure go now:

https://github.com/ixy-languages/ixy.go

rowanG077 · on Nov 21, 2019

Systems language is not only about performance. Where did you get that?

grumpydba · on Nov 22, 2019

They never said that they had to use c because go is not a systems language, so your assertion looks wrong. They wanted to avoid using unsafe. In c everything is unsafe by the way, so it makes it less of a systems language?

rowanG077 · on Nov 22, 2019

No Go isn't a systems language because for one you don't have direct control over memory if you need it. For instance Go doesn't even have the volatile keyword which is essential in many cases when interfacing with hardware. The paper you linked laments this as well.

bobowzki · on Nov 20, 2019

Because they compare multi threaded to single threaded.

ajeetdsouza · on Nov 20, 2019

Author here. If you read the article carefully, you'll see that the 70 lines I used to outperform wc was a single-threaded implementation. I multi-threaded it later for overkill.

gigatexal · on Nov 20, 2019

Looks like clean idiomatic go code to me.

timClicks · on Nov 20, 2019

I'm really surprised by how much memory this task needs. Using a 16KB buffer and incrementing an integer needs multiple MB.

I guess that's the size of the executable after it's been loaded into RAM that's so large?

weberc2 · on Nov 20, 2019

Yeah, if that figure includes the executable, then it's also probably including the whole runtime (scheduler, GC, etc) since Go programs statically link the runtime by default. In that case, 2MB isn't so bad (especially considering glibc is ~10MB [and with no scheduler or GC!] iirc).

ajeetdsouza · on Nov 20, 2019

The size of a compiled Hello World program is 2 MB on my machine, so you're probably right!

zhangxp1998 · on Nov 25, 2019

Okay, so you are comparing single-threaded GNU wc with your multi-threaded Go implementation? That's fair.

danieldk · on Nov 20, 2019

Another article in the series beating C by moving the goal posts. My comment on the original article about the Haskell version on lobste.rs:

Keep in mind in all comparisons to GNU wc that it does extra work, detecting multi-byte characters and decoding multi-byte characters if present, to correctly count the number of words. perf shows a significant amount of time being spent in multibyte character handling. If you trigger a code path that does not do decoding beyond the byte level, it’s much faster:

    $ time wc wiki-large.txt 
    854100  17794000 105322200 wiki-large.txt
    wc wiki-large.txt  0.42s user 0.02s system 99% cpu 0.438 total
    $ time wc -l wiki-large.txt
    854100 wiki-large.txt
    wc -l wiki-large.txt  0.02s user 0.02s system 98% cpu 0.034 total

(wc -l looks at every byte, but does no decoding.)

From a quick glance, this is also where the 'Haskell beats C' article fails. It’s comparing apples to oranges, the ByteString implementation does not do the same as GNU/macOS wc and returns incorrect results in the presence of non-ASCII punctuation. The article incorrectly states that wc will handle input as ASCII. Unless you do not use a multi-byte locale, macOS wc uses the combo of mbrtowc and iswspace.

ajeetdsouza · on Nov 20, 2019

Author here. This is not true - I included a link to the manpage (https://ss64.com/osx/wc.html) in the article to avoid this confusion. I did not use GNU wc; I used the OS X one, which, by default, counts single byte characters. From the manpage:

> The default action is equivalent to specifying the -c, -l and -w options.

> -c The number of bytes in each input file is written to the standard output.

> -m The number of characters in each input file is written to the standard output. If the current locale does not support multi-byte characters, this is equivalent to the -c option.

Moreover, I also mentioned in the article that I was using us-ascii encoded text, which means that even -m would have been treated as ASCII text.

Hope that clarifies your issue.

danieldk · on Nov 20, 2019

It is not about the character count, but the word count. wc decodes characters to find non-ASCII whitespace as word separators. If you read further in the same man page:

White space characters are the set of characters for which the iswspace(3) function returns true.

That your text is ASCII encoded does not matter, since ASCII is a subset of UTF-8. So at the very least, you need an extra branch to check that a byte's value is smaller than 128 (since any byte that does not start with a leading zero is a multi byte character in UTF-8).

However, if you look at the implementation at

https://opensource.apple.com/source/text_cmds/text_cmds-68/w...

You can see that in this code path it actually uses mbrtowc, so there is also the function call overhead.

eMSF · on Nov 20, 2019

It only calls mbrtowc if domulti is set (and MB_CUR_MAX > 1), i.e. only when given the option -m.

danieldk · on Nov 20, 2019

You are right! So that's a Darwin oddity, still a wide char function is called in that code path, iswspace, which adds the function call overhead in a tight loop.

tom_mellior · on Nov 20, 2019

If domulti is not set, the wide char function is not called as far as I can tell. Why would it? It's explicitly meant not to do wide char stuff in that case.

FWIW, when this was going around for the first time I took this Darwin version of wc and experimented with setting domulti to const 0, statically removing all paths where it might do wide character stuff. I didn't measure any performance difference to just running it unmodified.

danieldk · on Nov 21, 2019

It's about iswspace as I mentioned in the parent comment. Replace the line

    if (iswspace(wch))

by

    if (wch == L' ' || wch == L'\n' || wch == L'\t' || wch == L'\v' || wch == L'\f')

And I get a ~1.7x speedup:

    $ time ./wc ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wc ../wiki-large.txt  0.47s user 0.02s system 99% cpu 0.490 total
    time ./wc2 ../wiki-large.txt                     
      854100 17794000 105322200 ../wiki-large.txt
    ./wc2 ../wiki-large.txt  0.28s user 0.01s system 99% cpu 0.293 total

Remove unnecessary branching introduced my multi-character handling [1]. This actually resembles the Go code pretty closely. We get a speedup of 1.8x.:

    $ time ./wc3 ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wc3 ../wiki-large.txt  0.25s user 0.01s system 99% cpu 0.267 total

If we take the second table from the article and divide the C result (5.56) by 1.8, the C performance would be ~3.09, which is faster than the Go version (3.72).

Edit: for comparison, the Go version from the article:

    $ time ./wcgo ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wcgo ../wiki-large.txt  0.32s user 0.02s system 100% cpu 0.333 total

So, when removing the multi-byte character white space handling, the C version is indeed faster than the (non-parallelized Go version).

[1] https://gist.github.com/danieldk/f8cdaed4ba255fb2954ded50dd2...

tom_mellior · on Nov 21, 2019

Thanks, I finally understood what you are saying. Indeed, the code uses iswspace to test all characters, wide or normal. Strange design choice. For whatever it's worth, even just changing

    if (iswspace(wch))

to something like

    if (domulti && iswspace(wch))
        ...
    else if (!domulti && isspace(wch))
        ...

got something like a 10% speedup on my machine. And replacing isspace with an explicit condition like yours is much faster still. I checked, isspace is macro-expanded to a table lookup and a mask, but apparently that's still slower than your explicit check. I'm a bit surprised by this but won't investigate further at the moment.

danieldk · on Nov 21, 2019

Thanks, I finally understood what you are saying.

I am sorry for the unclear comments. I'll stop commenting on a phone ;).

Indeed, the code uses iswspace to test all characters, wide or normal. Strange design choice.

I agree, it's really strange. This seems to be inherited by the FreeBSD version, which still does that as well:

https://github.com/freebsd/freebsd/blob/8f9d69492c3da3a8c1ea...

It has the worst of both worlds: it incorrectly counts the number of words when there is non-ASCII whitespace (since mbrtowc is not used), but it pays the penalty of using iswspace. It's also not in correspondence with POSIX, which states:

The wc utility shall consider a word to be a non-zero-length string of characters delimited by white space.

[...]

C_CTYPE

Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files) and which characters are defined as white space characters.

danieldk · on Nov 20, 2019

For the people down voting, please read the source code of GNU and Apple wc before down voting.

weberc2 · on Nov 20, 2019

https://en.wikipedia.org/wiki/Hitchens%27s_razor

z3t4 · on Nov 20, 2019

This is excellent for showing how to do X in Y language. But like most benchmarks - it's a bit silly. If speed was the priority, you could parse only 1% of the file, then multiply the values by 100, assuming the rest of the file looks alike. Do you really care to know the file has 1337 words rather then ca 1300 words?

What about power consumption?