Argument parsing is hardly the bottleneck when you're scanning 1 GB of data. The reason I didn't implement the rest of the features was to keep the implementation light and readable, and I don't think a fully compliant wc implemented in Go with these patterns would be significantly slower.
As for having a low-memory footprint, 2 of the 4 implementations I mentioned (one single core, one multi-core) consume less memory than wc, and still run much faster.
As for having a low-memory footprint, 2 of the 4 implementations I mentioned (one single core, one multi-core) consume less memory than wc, and still run much faster.