Hacker News new | past | comments | ask | show | jobs | submit login
What I learned from my first C coding challenge (jasonmooberry.com)
136 points by jasonmoo on Sept 4, 2012 | hide | past | favorite | 83 comments



(I realise the poster came up with a different solution ultimately, but this bears repeating.)

Don't hand-roll standard library functions. You'll very likely regret it later.

In particular, as hardware advances, OS vendors often update their implementations with optimisations specific to new hardware platforms. This both allows improvements for all programs that consume those interfaces, and prevention of regressions when hardware changes in some cases.

In addition, you really don't want to be the one that introduces a security bug into your program because it turns out your optimisation work didn't account for a scenario that the standard library function does.

I'm aware that there may be cases where some believe it's appropriate; don't give into temptation. It's a bad idea to duplicate the standard library.


Your advice is good but a bit too generalized here.

The reason why the standard library functions, qsort in particular, were performing worse in this case is not related to the implementation to the functions themselves. Instead, it is because adding a function call to a foreign function will practically inhibit all compiler optimizations that can happen around the call site. So the "overhead" of the function call itself is a lot more expensive than differences in the function implementation itself. In the case of qsort with function pointer callbacks, the effect is especially devastating. Benchmark qsort vs. C++ std::sort to see how much (a lot).

So when writing performance critical C code, it's a perfectly good option to re-write a standard library function, especially a simple one like this. Make it an inline function in a header file to make sure the compiler can optimize.

If your compiler can do link time optimization to these function calls, then you don't have to re-write them. Before LTO becomes mainstream, sometimes you might have to.


But that's sort of my point; today's optimisation can be tomorrow's performance regression, security bug, or be outdone by an update by the OS vendor.

It's very rare that a program's performance is being held back by a standard library function (speaking of libc here). I remain highly skeptical that the algorithm isn't the real issue instead of the implementation of standard library functions.

It's just not worth it.


> But that's sort of my point; today's optimisation can be tomorrow's performance regression, security bug, or be outdone by an update by the OS vendor.

Code should be written for today, not for tomorrow. Sure, link time optimization may cure the issue but it's not mainstream yet.

Software has to be maintained, it can be changed later if a performance regression or a bug is found.

> It's very rare that a program's performance is being held back by a standard library function (speaking of libc here). I remain highly skeptical that the algorithm isn't the real issue instead of the implementation of standard library functions.

Benchmark it and see. I've done several benchmarks like this and in general proper function calls can be surprisingly costly. Not because the function itself would be expensive but the function call also inhibits compiler optimizations in the call sites.

Any libc function is probably faster than a hand written naive implementation, but that only applies when you have large inputs, like megabytes worth of almost equal strings. A good optimizing compiler can (and in this case, will) make a naive loop implementation perform better.

> It's just not worth it.

The benchmark says it is worth it. Why argue against?


> But that's sort of my point; today's optimisation can be tomorrow's performance regression, security bug, or be outdone by an update by the OS vendor.

>It's very rare that a program's performance is being held back by a standard library function (speaking of libc here). I remain highly skeptical that the algorithm isn't the real issue instead of the implementation of standard library functions.

In the last year, we encountered a bug with Solaris Sun Studio 12 on x64 where memcpy wasn't automatically inlined, forcing a full function jump every time it was invoked. That was a major performance hit, and forced us to switch to an internal implementation(that normally is worse on Solaris). IIRC, we didn't have much luck getting a patch out of Oracle for the issue.

So no, this really isn't true. In an ideal world, it would be.


How about, "use the standard library function unless proven guilty"? Sure, if compiling with Solaris Sun Studio 12 on x86 is a loss, but what about SPARC? Or Linux GCC?


SPARC is pretty much a lost cause for us. I believe that we have had better performance with the system memcpy over anything we've written.

Linux GCC was all over the place, depending on the Red Hat Enterprise version. IIRC, RHEL 4 and above, our internal code worked better, but with 5 and 6, the included memcpy is generally better.

If you're writing code that has serious performance requirements, experimentation is key. There's absolutely no guarantee that the system call will be better than a hand rolled call.


If you are forced to implement your own implementation I would rather switch to it completely. IMO it's better to be bold and get problems detected by having a wide adoption of a function rather than hide it in an edge case where problems might hide.


Whether it was actually a bug is unclear from your description. By default, Sun Studio intentionally doesn't inline functions defined in system header files unless specifically requested.

It's also at the discretion of the compiler whether to permit some functions to be inlined. The compiler man page outlines this caveat, and mentions that inlining standard library functions is discouraged as it can cause errno to become unreliable.

Finally, there's also a question as to whether (again) there was a bad algorithm being used as opposed to the fault being with a standard library function. Yes, it's possible there was a performance pathology with the particular use case you have, but there's almost always a better way to resolve an issue like that than hand-rolling a standard library function which inevitably causes unexpected issues.


This is good advice for a novice, but I have to say from my experience I pretty vehemently disagree.

I suppose there's a sort of spectrum of programmer personality types that may help to explain the disagreement. Please forgive me of I am prone to hyperbole in my description.

One type of programmer says "never rewrite a function if it exists in a library". They do this in part because they are not confident in their own abilities (see disclaimer about hyperbole), and generally trust code if it's written by somebody else.

Another type is more of a cowboy. The cowboy wants to write everything from scratch, for the opposite reasons - because they don't trust code not written by them.

Someone who reliably adheres to only one of these extremes at all times is probably not a good person to work with. However, both of these motivations do have some basis in fact and practicality. On the one hand, use of libraries gets the job done quickly and can save you some headaches. On the other hand, there are some pretty bad to horrible libraries out there, many of them quite popular; introducing dependency on the library can add considerable complexity, code size, and weaken your own "whole-program" understanding of what's going on. I'd say that being distrustful of dubious library dependencies is a good thing.

In practice, even your libc is not necessarily going to be as well maintained as you suggest. The people who maintain such libraries might not have the motivation to do the sort of periodic performance tweaks and updating for the current era that you ascribe to them. They might have a working implementation and decide to stick with it as long as possible. (In my experience this is more common than your rosy picture.) That's fine for them but it doesn't mean the library will work for you, or help you meet your performance goals. One problem with your side of the spectrum is that the people who write the libraries aren't always perfect or even good authorities. You have to know when to let go of your trust in them.

And especially, if I really need something to be faster and it's something that I can spend a few hours on, benchmark, and either decide to take or maybe just revert and go back to the library... Well then yes, I'm going to take a crack at it. One can always revert it later.


Except we're not talking about any old library function here. We're talking about standard library functions such as memcpy, etc. Those functions are optimised specifically for different hardware platforms on every major operating system.

For example, Solaris, Linux, and Windows all feature versions of memcpy that are specifically optimised for different hardware platforms. Intel in particular supplies these optimisation for a number of operating systems directly to the OS vendors.

My advice only applies to the standard library (really, just libc). It's not intended to be nor applicable to anything else for the purpose of this discussion.

For example, in Solaris alone (older version of OpenSolaris actually, but it gets the point across), there are seven easily found versions of memcmp alone:

  http://src.opensolaris.org/source/search?q=&project=onnv&defs=memcmp&refs=&path=libc&hist=
Six of those versions are written in assembler specifically for a particular hardware platform. One is a generic C version. Furthermore, as an example, the amd64 version of memcmp alone has optimisations for at least 14 different cases. Everything from 3DNow! optimisations to optimisations based on data size and alignment.

Also, as for your concerns that OS vendors don't update the standard library functions as often as you think -- you're wrong. I know for certain that Linux, Windows, and Solaris all receive continual updates for new hardware platforms for their standard library. There's a reason you hear these vendors constantly talking about their SPEC benchmark numbers or the like.

So again, as far as standard libraries are concerned -- don't do it; it's not worth it.


While I agree ont he general sentiment of your post, I remember in the case of memcpy that at least the GCC implementation is a "most general case" implementation which tries to have good performance in average usage.

I remember reading about this some time ago when I was doing some tinkering. I read that for example, John Carmack decided to implement his own memcpy tailored for the games they programmed. Additionally, I read about different alternative implementations o memcpy which were better than the standard for specific domains. I even experimented using Microsoft detours to replace the vanilla memcpy with a faster version.


> Six of those versions are written in assembler specifically for a particular hardware platform. One is a generic C version. Furthermore, as an example, the amd64 version of memcmp alone has optimisations for at least 14 different cases. Everything from 3DNow! optimisations to optimisations based on data size and alignment.

None of those optimizations matter in this case because the average query string parameter name is only a few characters in length and an inlined naive loop with compiler optimizations will be faster.


> standard library functions as often as you think -- you're wrong. I know for certain that Linux, Windows, and Solaris all receive continual updates for new hardware platforms for their standard library.

Well I was actually speaking from experience here. I might be more willing to believe you for something like memcpy(), but as an example (and the one I was thinking of), the Windows CRT is generally in pretty poor shape. I've also seen some crufty things in libc trees from some of the *BSDs.


Implementing something using as little foreign code as reasonably possible can actually be a good way to make it correct an secure. You write it once and prove it's correct. After that, you're done, no matter how libraries around you change. If a library function gets a bug or gets slower, your code is unaffected. And your code likely won't get significantly slower as you update the compiler (it's more likely it will get faster).


Also note that only the most basic library functions are likely to be optimized specifically for your platform (e.g. assembly). Mostly basic memory stuff like memcpy, strcpy, memcmp, strcmp, strchr,... Anything more complex can be usually be made very fast with just pure C code, comparable to implementations that come with your compiler.


I think this is a valid point. I usually try to rely on the optimizations that are already done for me. I think in this case the strcmp function may be doing some extra work based on locale/collation that I don't need to consider in my code which is making my code faster. But point well taken.


It's interesting that he points out that testing for zero is cheaper than comparing two numbers. Interesting because this might not always be the case.

For a quick test I used the conditions (i=0; i < NUMBER; i++) and (i=NUMBER;i != 0; i--). Without optimizations turned on, gcc translates both of them to cmpl (compare long) operations followed by a jne (jump not equal). So on an assembly level, they're absolutely identical except that in the one case you're testing against zero whilst in the other you're testing against the number. Fine, so we'll have to dig deeper.

And this is where it gets complicated. This optimization depends entirely on the inner workings of the ALU. Theoretically one can test against zero with just one subtraction, because 0-n == n-0 is always true, whereas a-b == b-a iff a == b, otherwise the two sides will differ in sign. On a hardware level such operations might be parallelized inside of the ALU though, so a comparison of two numbers might actually take exactly as many clock cycles as the comparison against zero.

It's however an interesting excursion into how such basic things work. And concerning my test: They're both about equally fast on my machine. I've done multiple runs with no clear winner.


The second version (test against zero) benefits from the fact that the subtraction instruction sets the Z flag automatically. So the end of the second loop is something like "sub i,1" followed by "jnz top".

The most straightforward implementation of first loop would be "add i,1" followed by "cmp i,NUMBER" followed by "jb top" (or "jl top" if i is signed). It's an extra instruction which may be even slower, depending on specific CPU and surrounding code, if NUMBER is a literal or memory access (as opposed to register value).

My guess is that your compiler will produce code that takes advantage of the savings if you turn on optimizations, but you might have to actually do something in the body of the loop to keep the compiler from optimizing the loop completely out (or tweak flags to enable/disable specific optimization techniques). Adding a loop body will make the timing difference less noticeable, as the single-instruction savings of the zero-test version becomes a smaller percentage of the total time per iteration.

Disclaimer: This is how I'd write assembly by hand. A compiler could conceivably do something quite different, depending on the specific application code, compiler, version and flags. I also haven't yet taught myself 64-bit x86 assembly code, although I understand it's rather similar to 32-bit x86 with more registers available.


The compiler optimizing out the loop completely is one of the reasons why I didn't turn on optimizations here, although there was a loop body. gcc is smart enough to optimize loops with a fixed outcome away in some cases. I probably should have written a more complex loop body.

That said, it's very likely that the compiler will make sense of such an optimization and produce the quicker assembly code like you just wrote it.

What I wanted to say with my post was really that things like these are heavily depended on architecture and that comparisons could be optimized in hardware.


> The compiler optimizing out the loop completely is one of the reasons why I didn't turn on optimizations here

Use a non-constant value for the loop and put a dummy load inside to stop compiler from doing loop unrolling and/or dead code elimination.

I often use time(), rand() or even argc to get dummy values when looking at assembly from compiler optimizations.


One can just write the loop as

  for (..; ..; ..) asm("");
GCC doesn't introspect the asm statements, and so it leaves the loop there. (At least, it did for me, even with -O3.)


(Just as a note, the count-down condition is normally (i = NUMBER; i--; ), the condition you have gives different results: i is 1 larger than one expects.)

> Without optimizations turned on, gcc translates both of them to cmpl operations followed by a jne

Why would you expect anything different? It isn't doing any optimisations, so it is doing the naive method of transliterating the C to ASM.

The actual test is when one turns optimisations on. The code generated by GCC -O3 for your two conditions is (respectively):

  .L21:
        addl    $1, %eax
        cmpl    %edx, %eax
        jne     .L21
and

  .L13:
        subl    $1, %eax
        jne     .L13
And for the condition I gave above

  .L7:
        subl    $1, %eax
        cmpl    $-1, %eax
        jne     .L7
You can make your own conclusions from that, but clearly your count-down method uses fewer instructions.

(Also, one can test for 0 extremely quickly: just check that all the bits are 0.)


AT&T syntax makes me want to gouge my eyes out.

You have to add these suffixes "l" or "w" to all your instructions to tell the compiler how big they are, but of course if you're using EAX you're talking about a 32-bit register; if you wanted to talk about a 16-bit register, you'd say AX, or AL (or even AH) for 8-bit. An assembler that isn't smart enough to figure out something that simple and obvious is a sorry tool indeed.

And then there's the fact that operands are backwards. "subl $1,eax" translates to "eax -= 1" in C-like syntax. Why switch the order of operands? It's completely unnecessary and confusing. It's like someone inventing a programming language where integer literals were negative if specified with no sign, and only positive if you put a + on them -- technically possible, but the sort of thing that serves no useful purpose and defies all prior conventions, common sense, and sanity.

And putting dollar signs on constants and percent signs on registers? I can't even imagine any possible explanation for that, other than a lazy parser author who didn't care about making extra work for their users. It's the type of corner-cutting that you'd expect to find in some prototype thrown together by a single person in a single evening.

Please, for the sake of your own sanity and those around you, use Intel syntax instead!


It depends on the architecture. I know that a register load on the Motorola 68k CPU will set the flags (Z for zero, N for negative) while a register load on the Intel x86 line doesn't change the flags and thus, a comparison needs to be made.

Both architectures, however, include special instructions to handle numeric loops (for instance for (i = S ; i != E ; i += step) that count downwards, but they have a slightly different ending condition (Intel ends on 0, Motorola ends on -1) and is really specialized on the Intel (it requires the use of the CX (ECX, RCX) register; on the Motorola you can use any of the data registers) so it makes it kind of hard to use in portable C code (your best bet---don't use the for index in the body of the loop and hope the compiler can transform it to use the specialized instructions).

If speed is really critical, you need to check the output from the compiler, and make sure you always measure twice (and keep the original code around as a reference when the hardware changes).


There is no need to compare - ORing the register with itself will suffice to set the Z flag. This is more concise on x86 - i think it's 2 bytes to OR a 32-bit register with itself, vs 3 for a compare with a (sign-extended) 8-bit immediate zero.


Any more detailed information about the coding challenge?

For example the repo references a 5Murls.txt file, but it isn't part of the repo and the blog says it needs to "output in a standardized format:", but doesn't specify what "standardized" means nor does the current code actually output anything (the printf is disabled). Does it specifically need to go to stdout or just that it exists somewhere in memory? Does it require a char* or will this char* be instantly hashed and tossed away? Does the challenge forbid the use of threads/cpus/workers? In the blog it says: "In this particular context the plugin architecture we were writing against allowed for returning the original string or a new malloc’d string" Are you allowed to muck with the original string? In the code on github it has a function that takes a const char* forcing a malloc, but it could easily just re-write the string in place if allowed.

edit: As for sorting the params, (based upon your comments about the common usage) pretty sure there is a way to do this without any string comparisons at all. Post a sanitized 5Murls.txt file and give it a go to make a patch.


The best algorithm depends upon the data presented. What is the:

- Average number of parameters

- The % that are already sorted

- The % that don't have keys that start with same letter


It also depends on the assumptions we make about the problem.

- Do we have to validate the URL strings in any way?

- What do we need to do with duplicate names?

- Are cases with small parameter names and/or small numbers of parameters common?

- How common are parameters with lengthy common prefixes?

- Is there a bound on the length of the longest common prefix of two different parameter names?

Tuning or special-casing with common cases in mind can often produce big wins.


I made a varnish urlsort module a while ago: https://github.com/cyberroadie/varnish-urlsort

Here my blog post about it: http://cyberroadie.wordpress.com/2012/01/05/varnish-reorderi...

It parses the url and every token gets added to a binary tree which than gets traversed in order to get the parameters alphabetically

It's been used in production in several companies I worked for, so tried and tested :-)

Olivier


Interesting. Do you have any benchmarks handy to show how much of a difference this makes? I'd wonder if the dynamic allocations necessary to build the tree might undercut the lookup advantages.


As a note to the author about separating the arrays instead of using a structure, it was probably an alignment issue? Check out http://en.wikipedia.org/wiki/Data_structure_alignment


More specifically, a pointer in x86-64 (and most other 64-bit architectures) is going to be 8-byte aligned, which means that a struct of an 8-byte pointer plus a 2-byte short will actually consume 16 bytes.

Alternatively, keeping them in separate arrays removes the alignment overhead, and each pair of values will consume 10 bytes. That's a substantial savings when you're trying to fit into something like a 32KB L1 data cache.

Another technique that can help you fit as much useful data as possible into L1/L2 cache is to store offsets rather than raw pointers. Using a 2-byte unsigned short for an offset rather than an 8-byte pointer would save another 6 bytes of cache per entry, assuming you're happy limiting your URL size to 64 kilobytes. The extra address arithmetic should be essentially free on a modern pipeline.


I was hoping someone would fill in the details; it's been quite a while since I've dealt with anything like that. Thanks!


AoS vs SoA strikes again?


Thanks guys. This is great info. I messed around with different types for the length value in the struct. Figured something like this was involved.


> Pointer arithmetic I have always heard of pointer arithmetic but now I understand why it’s so useful. Taking a pointer to the first element in an array and incrementing it to iterate through it’s elements is elegant and more efficient than taking an integer i and using it as an array index. I wish more languages afforded this.

Trust me, no you don't. Pointer arithmetic is one of the easiest things to screw up in C, and the raw form can be very dangerous to work with.

For sure, treating a pointer as an integer and incrementing by a fixed number of bytes each time is fine, but more often than not languages have this implemented, just as a JIT or compiler optimization.


For sure, treating a pointer as an integer and incrementing by a fixed number of bytes each time is fine, but more often than not languages have this implemented, just as a JIT or compiler optimization.

And here, you've inadvertently pointed out one of the major pitfalls (and amazing virtues) of pointer arithmetic: in C, pointers increment by the size of their data type. You don't add some number of bytes to the pointer, you add the number of elements by which to increment.

In other words, incrementing a uint8_t* will add one byte to the pointer address, but a uint32_t* will add four bytes, and a pointer to a 64-byte struct will add 64 bytes.

Here's an example. This code:

  #include <stdio.h>
  #include <inttypes.h>
  
  struct foo {
  	char bar[67];
  };
  
  int main()
  {
  	struct foo monkey[60];
  	struct foo *zoochild;
  
  	printf("sizeof foo: %zu\n", sizeof(struct foo));
  	printf("sizeof monkey: %zu\n", sizeof(monkey));
  
  	zoochild = monkey; // child points at first monkey
  	zoochild++; // child points at second monkey
  	printf("(monkey + 1) - monkey: %zd\n", zoochild - monkey);
  
  	printf("(bytes) (monkey + 1) - monkey: %zd\n", (uint8_t *)zoochild - (uint8_t *)monkey);
  
  	return 0;
  }
produces this result

  $ gcc -O4 -Wall -Wextra -pedantic -std=c99 ptr.c -o ptr
  $ ./ptr
  sizeof foo: 67
  sizeof monkey: 4020
  (monkey + 1) - monkey: 1
  (bytes) (monkey + 1) - monkey: 67


You are correct, of course. But he is correct in saying that this method is elegant. Thus, iterators, which, in a much safer form, can be found in Java and Python and many other places.


Good read. Reminds me of the oldie but goodie http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...


Interesting find.


I'm trying to picture a non-synthetic program where normalizing a URL is a significant part of the inner loop and I'm drawing a blank. Can someone help me out?


Here's a similar case: Ranking of poker hands. That can involve many of the same steps, like sorting a small list of inputs (5 or 7 cards) into a normalized configuration (three-of-a-kind is represented as AAABC where B > C) and taking a hash to memoize the result. And you sure might want that in a tight inner loop if you're writing a poker simulator or AI where you want to crunch a few billion iterations.


Actually, for a poker hand evaluator the last thing you want to do is sort the hands. There's a small enough number of unique card combinations that you can arrange it so that you look up in pre-computed tables more often than not.

One clever trick is assigning a prime to each card value. By multiplying the values together you get a unique number representing each hand without needing to sort.

There's a great description of an algorithm at http://www.suffecool.net/poker/evaluator.html.


They want to use a normalized URL as the hashkey to cache system.

Depending on exactly what type of hashing they are doing, the normalization could be a significant fraction of the lookup time.

If you are going to get all micro-optimized, you should probably combine the normalization and hash steps, as there is likely some string parsing / building that could be eliminated.

Also: it's a pretty fun exercise.


Very interesting. The author at first seemed not that familiar with C, but in the end he got a good result

Depends also if there's pre validation (even with C you're never sure, so for example, reading that he used strcpy makes my head go "security hole", but this is a proof-of-concept so it's ok)

The best thing to keep in mind is that "There is no such thing as the fastest code"

Edit: strcmp to strcpy (or any str C function)


Reading strcmp() makes you think "security hole"? Why?


It does not apply here, but strcmp is, due to the early bailout that any sane implementation uses, suspectible to timing attacks, as in http://www.cs.rice.edu/~scrosby/slides/SCISS-latency.1.ppt.


A timing attack against URL query strings as processed by a normalizing cache proxy?


Quoth the parent: "It does not apply here"


Sure. The flaw here isn't strcmp. In fact, most crypto compares don't use strcmp, even in naive code; an HMAC-SHA1 MAC, for instance, is an array of 8-bit bytes, not the hex string that programs encode them into for human consumption. "memcmp" is the normal culprit.

Timing attacks aren't a flaw in memcmp or strcmp. Touching every byte of a string is stupid behavior in the overwhelming majority of cases.


Unless, of course, you're Nintendo: http://wiibrew.org/wiki/Signing_bug


Good find. Again, notice how the problem here isn't strcmp or it's timing behavior; it's mistakenly using ASCIIZ strings to hold ciphertext.


Actually I meant strcpy, but the point stands

Point your string function to a non-null terminated byte buffer for example. At the very least you can crash the app

What's recommended (one of the recommendations) is to use the 'n' functions like strncmp that takes a maximum size.


I'm sorry, I should have been more direct. No, I meant to say. Using "strcmp" in a typical C program is not a security flaw. It was clear to me you were thinking of "strcat" or "strcpy".

Using strncmp in this situation makes very little sense and is probably more dangerous. The lengths given to strncmp() are inevitably going to be derived from something else that requires a NUL terminator. Meanwhile, strncmp() leaves you open to logic flaws where you compare too few bytes.


It doesn't take the length/amount of characters to compare as an argument relying on the null terminator, meaning it's susceptible to a buffer overflow attack.

Moral of the story, if you're going to use C strings, use the strn* variants.


No, strcmp is not susceptible to buffer overflow attacks.


https://buildsecurityin.us-cert.gov/bsi-rules/home/g1/847-BS...

If passed an unterminated string, the function will fail at least. How much you could exploit from that, I guess I exaggerated.


That's not a buffer overflow.

This whole subthread of picking on the guy's implementation because of "strcmp" is pretty silly. There are times where strcpy() is safe to use, but most of the time it's a red flag. There are conceivably times when strcmp() is unsafe to use, but to a professional reviewer, it is very rarely a red flag.

I should have just come right out and said that, rather than begging for the rationale for picking on strcmp().


I prefer strcpy to strncpy, and make sure that the string will fit before the call. I use strncpy if I want to copy n characters from s2 to s1, strncpy can give a false sense of security imo since it may not add a terminating zero, for example using strncpy with strlen. The BSD strlcpy does always null terminate the string, but it's non standard.


strcmp() should be the least of security concerns. The input buffer in test.c is directly passed to the url_sort() function => Buffer overflow attacks.


The hand rolled strcmp almost certainly performed better than the native one merely because it can be inlined; consider using rep cmpsb or directly including another assembly version of strcmp (though since the strings are short the latter's overhead might not be worth it). Ditto memcpy.

Perhaps also consider using a radix sort?


Good work, one obvious speed up is, if possible, removing the use of dynamic memory. Simply supply a static output buffer to url_sort so that it doesn't have to allocate/free memory for the resulting URL.

That took my execution for 5M lines of (not really representative but it's all I could be bothered to sort out at 11pm):

  /test.php?b=b&c=bb&a=3&a=4
down from 5.05s to 4.49s (so down 10%) on an old 1.5GHz Athlon Linux box that whirrs away in the corner of my flat. (Down to 2.81s when compiled with -O4).

(I assume you're compiling with -O4 as that's the only way I could make it worthwhile using your str_compare() function over libc's strcmp().

The only other thing I can think of that may speed it up more is a kind of binary search on the insertion as that is currently just a flat loop that does a comparison against each param; and therefore O(n).

In a pessimal case where you've already got 30 params and your 31st is going just before the tail (but not as the tail so you can't short cut straight to adding it as the tail) then it'll do a whole bunch of string comparisons on entries 1, 2, 3, ..., 30 before finding the right place. It'd be better to do them on, say, 1, 15, 23, 27, 29, 30. (This will only be beneficial when you have more than a certain number of params). An O(log n) algorithm will hopefully outweigh the slight expense of having to compute the midpoints.)

But, given you've got to 2M urls/sec, do you really need to eek out any more performance? That's an awful lot of traffic if you think each HTTP request will be about 500 bytes (including TCP headers, HTTP headers, etc). 2M * 500 bytes * 8bits = ~8Gbps, and that's just the incoming traffic from established connections.

For the binary search suggestion I get to point out the minor nitpicks:-

* Not all C compilers accept C++ style comments (gcc is way too lenient by default)

* Not all C compilers accept local variables to be defined mid-function

* unsigned long is not the same as size_t (fun fun fun when porting to different word sizes)

* If you ever have to port this code to work on a different CPU you may find yourself spending time adding lots of casts to (ptrdiff_t) to avoid it breaking due to unexpected sign extensions/etc.

These may seem OTT for a project like this, but it's not trivial when you've continued those practicises for years (and new devs have copied the house style) and have a bunch of projects totalling 20MLOC+ that you then need to port to a new compiler on a new architecture (with a different word size) and the code is riddled with assumptions about the old compilers and the architectures. Some things can be automated but spending days moving variable definitions to the top of functions and fixing types of variables that hold return values of strlen() in an seemingly endless list of thousands of files gets boring after a very short period. Me? Bitter?

"gcc -Wall -ansi -pedantic" and lint should be your friends.


// comments and mid-function variables are C99 standard. The platforms that don't support them are probably not worth supporting yourself, and are not likely to be seen running varnish. This is the same mindset that held mozilla back for ages and ages because the hpux compiler from 1987 didn't support some c++ feature, so therefore nobody was allowed to use it even in 1997.


True, but not using // comments and mid-function variables is not quite the same thing as not using a whole feature.

I just know how much work there is to do, at short notice, when a customer comes along offering $$$ for a version of your product on an architecture that you hadn't originally considered. It was definitely worth the hassle of porting and cleaning up to code to provide them with it, but it would have been so much easier if we'd kept it clean from the start.


Awesome code review. Thank you.


[EDIT - Fix unintentional italics]

Another ~5% off by getting rid of the branch inside the str_compare() loop. Here was your original:-

  static inline int str_compare( const char *a, const char *b )
  {
        for(;*a == *b; ++a, ++b)
        {
                if (*a == '\0' || *b == '\0')
                {
                        return( 0 );
                }
        }
        return( *a - *b );
  }

  Inside the loop, if *a is a NUL then so will *b, so you don't need to test for both to be NUL. Also, if *a is NUL and *b is NUL then ( *a - *b ) will be 0. You can just iterate through the string when *a are equal and non-NUL, and then return the difference between the two characters so:-

  static inline int str_compare( const char *a, const char *b )
  {
        for(; (*a == *b) && *a; ++a, ++b);
        return( *a - *b );
  }
[ C-Holy-War ] Personal style preferences; I prefer while loops, so much so that it has been years since I've purposely written a for loop in C and writing C is what I do for my job. For loops likes that end up looking like line noise, I'd find the following much easier on the eye and, more importantly, more immediately obvious to anyone else reading the code as to what it does:-

  static inline int str_compare( const char *a, const char *b )
  {
        while( ( *a == *b ) && *a )
        {
                a++; b++;
        }
        return( *a - *b );
  }
But, as I said, that's my own style preference. There's nothing wrong about using for loops, perfect acceptable implementation.


Hey thanks for the suggestions. One thing that I found is that testing both strings for null actually performs faster because you may have parameters of different lengths and bailing on the null character of the shortest is ideal. And actually testing for an ampersand would probably bail even earlier in cases of duplicate params.

And maybe you can shed some light on this. I am a fan of the while loop too, but I found the for loop to be more performant. Is it possible that there are optimizations for variables touched in the for(;;) definition that do not exist for the while() definition?


How is testing for a terminator the right approach? The comparator is only supposed to look at a single parameter, and the only null byte in the url string is at the end...


Both implementations have exactly the same outcome for all inputs, it's just one has less code (well, code that is less complex) inside the loop and therefore is more pipeline friendly and/or just executes faster.

Yes the string comparator is comparing items like:

  "a=3&a=4" and "a=4"
but it doesn't really matter since it'll never put such items in the wrong order. It's a quirk of passing the original url as 'const char *' so that it can't be modified (to temporarily replace the & symbols with NULs) and the OP was trying to avoid the cost of copying the input string except when creating the output string.

You should only look at the next byte in either string if both of the current bytes are non-NUL. If both are equal, and one is non-NUL then the other is non-NUL and so you can continue.

If they're non-equal then you stop and return the difference between the two which will be +ve or -ve.

If they're equal and one is NUL then you stop and return the difference which will be 0 since they are equal.


I'm aware that it will return the right answer, but I was worried that it may look at too much of the url string in order to do so.

Consider the (admittedly perverse) input "test.php?a=1&a=1&a=1&a=1&a=1". At each comparison, the comparator you suggest will look at the entire remaining part of the string.

Maybe duplicate elements are considered invalid input and aren't a concern, I'm not sure. But fixing it is not hard and doesn't require fiddling about with copying strings or adding terminating nulls: just find and remember the parameter length once, and use a length-based comparator.


True, and it's a simple change to stop the comparison at either a NUL or a '&', so that it doesn't that extra unnecessary time.

The other two improvements I'd be looking to do are:

1) try to rework the insertion sort to make it use a binary search (if over a certain threshold that is yet to be determined by experimentation). Should be faster than looping through each element each time.

2) Not automatically extend the tail each time (where insertion somewhere between head and tail is required). The element being added may be one from the head so it's best to work out where it goes first and then pick the extension direction that requires the fewest number of elements to be shuffled up. Also the shuffling can be done in bulk using memmove() (can't use memcpy since the areas of memory will overlap when shuffling up by more than one position); this should also be faster than doing them one at a time.

Anyway, i'm off on holiday so I've got about another 15 minutes to look at this so I'm not going to get far. If I remember I'll try and check up on the status of it in github when I get back.


Your insertion sort scans downwards from the highest populated index, bumping elements up as it goes. Given that you have space on both sides, I'm surprised that you didn't start in the middle and pick a scan direction based on the comparison result there.


When I read that, I feel OLD !


We're getting ready to deploy Varnish in (non-devops) work. Not having control over production and having the potential for C inline in the config was already making me a bit nervous. This thread is not helping!


Varnish has actually been a great tool for us. The 3rd party url-sort plugin that we tried was not. Had a memory leak that caused the server to crash periodically. Once the url-sort module that we're building as part of this code challenge is tested and fully formed we'll make it available as open source.


I've never come across the Duff's Device previously and after staring at it for a few minutes it made horrifying sense. I can't tell whether it's a piece of true evil or true genius.


More like evil genius… Variations of this trick are a common optimization in embedded systems for years. (Stuff like this is why “premature optimization” is considered such a bad thing.)

On that note, I find it interesting that the article writer wanted to see more languages support pointer arithmetic. Language mechanics like that have developed a reputation of being “unsafe” due to they enabling common programming errors like buffer overflows and the like. Depending on the level of abstraction between the language and underlying iron, it might not even be possible to do stuff like this anymore. (Some modern languages/VMs/compilers use the more abstract concept of “references” over “pointers.” Even Apple’s flavor of Objective C, which AFAICT is still using pointers as an implementation detail, prefers to couch documentation and API conventions in the more abstract convention to deter some of the direct memory access abuses that were discovered during the Classic-to-Carbon Mac migration way back when…)


I had this same experience. Literally giggled a little bit on the subway when it hit me.


Nice, i remember my first coding challenge in c. Nice result for first attempt.


Thanks!


A great read. I am curious about your double-wide insertion sort algorithm.


Thanks. I haven't seen it anywhere else, but can't really be sure I came up with it. Just seemed like a good optimization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: