How we found that the Linux nios2 memset() implementation had a bug

userbinator · on April 23, 2016

We were quite surprised to find a bug in some common code for the NIOS II architecture: we were assuming it would have already been tested on enough platforms and with enough compilers/situations to not have such issues.

I'm betting on the vast majority of memset() uses being to zero memory, so it might've not been tested much with other fill patterns.

Compared to MSVC, I've found GCC's inline-assembler feature a huge pain to use correctly and the few times it was needed, I decided to use a separate file with pure Asm instead (also with Intel syntax) and link that in. Supposedly it's more powerful, but having to essentially understand part of the register allocation process just to use it is a big turn-off. In contrast, MSVC "just works" --- the syntax is far more pleasant and it seems to know automatically what gets modified by the code you added.

I think basic operations like these, bulk fill and copy, are best performed be specific instructions that the hardware can do in the best way for the implementation. The nios2 memset() is over two dozen instructions, most of which are just shuffling bits around. Intel did this right with REP MOVSB/STOSB.

to3m · on April 23, 2016

There was exactly this bug in Android for a long time - memset always cleared the memory to zero. I can't find a good link to this, but try maybe https://review.source.android.com/#patch,sidebyside,14699,1,..., not that Firefox will let me connect to check.

As for the rules for VC++, they are conservative - see https://msdn.microsoft.com/en-us/library/k1a8ss06.aspx. In general, you're right that it's much nicer to use than the shit gcc foists on you, but these conservative rules are symptomatic of its reluctance to integrate C and asm very well.

That's one problem gcc doesn't have - my complaint isn't that you can, just that it makes you do it the hard way. But that's totally unnecessary... a few years ago I used CodeWarrior for PowerPC. It was very pleasant, and did a good job of making it easy to have C and asm coexist. You allocated registers by declaring variables as `register', and the compiler would figure out how to put them somewhere sensible on entry to each asm block, and spill the ones that were modified, if that even proved necessary. You could leave one asm block having just modified a register variable, assert that it had done what you expected, then start another one, along the lines of:

    register int fred=0;
    asm addi fred,1
    assert(fred==1);
    asm addi fred,1

I wish VC++ would do something like that; you can sort-of do it, but it's impossible to request that a given value goes somewhere in particular (register, frame relative, etc.) and this can stymie you sometimes. But they decided to remove inline assembly language altogether for x64, so that's one way of solving the problem. Still nicer than what gcc does.

pjmlp · on April 24, 2016

I don't have any need to type inline Assembly in the type of work I do, but used to back in the MS-DOS days.

The inline Assembly syntax of PC compilers always felt natural and easy to remember.

Every time I look at gcc inline Assembly, I get this feeling it is impossible to get right without having the manual page always open.

merb · on April 23, 2016

Actually Altera the company who is selling NIOS II is a Intel company. So maybe they should just share some engeneering details between them..

wsfull · on April 23, 2016

The only problem is if one uses UNIX instead of Windows to do development.

Fortunately there are still some people doing development on UNIX and sharing their software tools.

http://cr.yp.to/highspeed/fall2006.html

to3m · on April 23, 2016

Here's another horrid gcc inline asm bug: http://robertoconcerto.blogspot.co.uk/2013/03/my-hardest-bug...

As a general rule, to which the usual caveats for general rules apply, when it comes to gcc-style inline asm, just say no. Demand more from your tools.

DSMan195276 · on April 24, 2016

Personally I don't find inline-asm with gcc to be all that bad. It is definitely a very nice way of accessing other-wise inaccessible CPU instructions - and those tend to be generally trivial to implement (IE. Single asm instruction, along with some input and output operands). I admit that on the rare occasion that I do use it, I almost always go reread the documentation. But still, the compiled end result is tons better then having to call a completely separate function in your code every time you just want to use one single CPU instruction.

That said, for longer pieces of code, I definitely avoid inline-asm and just use a separate assembly file. There's no debating it's a cleaner solution when you have more then one or a few lines of assembly. And really, besides those obscure CPU instructions, your written assembly is usually not much different, then the stuff generated from some C code. It's worth using C wherever it's viable.

semi-extrinsic · on April 23, 2016

To be fair, that's a horrid gcc inline asm bug coupled with a crazy CPU design (for the PS2). Still a good story though.

atomic_lock · on April 23, 2016

  >  Stores in %4 the initial pattern shifted left by 8 bytes.

Shouldn't that be 8 bits?

Very interesting write up

asveikau · on April 23, 2016

It's been my experience that it's easy to write GCC inline asm that works the first few times you use it, but subtly breaks depending on what the optimizer does with the code around it. Those input/output/clobber arguments can be tricky to get right and wrongness can be hard to spot.

pja · on April 23, 2016

Yeah, and as one of the comments to that blog post says, for bit twiddling operations like this, modern compilers are good enough to generate the optimal assembler the vast majority of the time: If you write the equivalent C code

    uint32 fill16 = (fill << 8) | fill;
    uint32 fill32 = (fill16 << 16) | fill16;

(copied from the comment, so hopefully correct!) you get the benefit of a) not having to fiddle with abstruse register allocation semantics but b) you’re letting gcc use its register allocator to set things up optimally based on the surrounding optimised code.

My own experience with writing memset / memcpy equivalents 10 years ago on x86 was that gcc did a better job of it. Don’t try and outsmart the compiler for this kind of common case unless you know the compiler generates inefficient code - compiler writers have generally got there before you.

slededit · on April 24, 2016

No compiler (MSVC, gcc, icc) outputs the "bts" instruction for operations like: bitfield[val / 32] = val % 32; which could implicitly perform the modulo.

Using the intrinsic provided significant performance improvements, and we got still more when the rest of the inner loop was rewritten purely in assembly.

pkaye · on April 23, 2016

Doing firmware development, I do find many of the C compilers generate sub-optimal bit field type manipulation code.

pklausler · on April 23, 2016

uint32 fill32 = 0x01010101 * fill;

Drdrdrq · on April 24, 2016

Isn't this exactly the kind of function that you would want thoroughly covered with unit tests?

atomic_lock · on April 26, 2016

Altera's software quality isn't exactly shining. Lots of unresolved bugs, badly documented compiler macros, ...