Redis on the Raspberry Pi: Adventures in unaligned lands

drej · on July 14, 2017

I never deal with such low level issues, so I don't have to read this, but... reading these posts by antirez is such a joy. He makes this topic so clear and understandable, he doesn't assume much, he doesn't use overly complex explanations, he just "says it like it is" :-)

Thanks!

hellwd · on July 14, 2017

++ :)

drewg123 · on July 14, 2017

I fondly remember unaligned access faults "back in the day" with FreeBSD/alpha. We implemented a fixup for applications, but not for the kernel. I seem to recall that even though x86 could deal with unaligned accesses, it caused minor performance problems, so fixing alignment issues on alpha would benefit x86 as well.

Most (definitely not all) of the mis-alignment problems were in the network stack, and were centered around the fact that ethernet headers are 14 bytes, while nearly all other protocols had headers that were a multiple of at least 4 bytes.

I've said it before, and I'll say it again: If I had a time machine, I would not kill Hitler. I'd go back to the 70s and make the ethernet header be 16 bytes long, rather than 14.

IgorPartola · on July 14, 2017

Why in god's name did they make it 14?!

pjc50 · on July 14, 2017

Ethernet was invented in 1973 and the first 32-bit processors were available in 1979.

While you've got the time machine, can you fix it so that "network byte order" and Intel endianness are the same too?

swiley · on July 14, 2017

Or rather keep Intel from munging the order their processors write bytes in.

adrianratnapala · on July 15, 2017

What?

drewg123 · on July 14, 2017

The 32-bit Vax 11/780 was introduced in 1977 and IEEE 802.3 was not finalized until 1983. So they could/should have done something about it I think.

Come to think of it, the Vax was little endian (like Intel).

jandrese · on July 14, 2017

It's all they needed. 6 bytes per address, and 2 more bytes to mark the protocol. Back in the 70s and 80s memory was very expensive and developers bent over backwards to save bytes everywhere. This is also why IP addresses are only 32 bits long, even though they knew that it wouldn't be enough if the protocol went global.

Hindsight is 20/20, and a lot of times people don't appreciate the constraints these old systems had. This was being developed decade before the Commodore 64 came out with its luxurious 64 kilobytes of memory (39k usable).

jimktrains2 · on July 14, 2017

They didn't feel like they needed those 2 bytes and, hey, why waste space?

Also, was a "byte" standardized at the time? Didn't they still have systems working in not-8-bit "byte", nibbles, byte, and 2-byte boundaries?

danesparza · on July 14, 2017

From https://en.wikipedia.org/wiki/Byte#History - a byte was in the process of getting standardized at around the same time.

blattimwind · on July 14, 2017

There is a funny mode on ARM processors (turned on in some images, by default) which causes unaligned reads to silently return bogus data (just increasing a kernel counter).

PowerPC, and really, most non-x86 architectures, do this one way or another.

faragon · on July 14, 2017

PowerPC (and POWER) has reasonable hardware support for unaligned memory access, at least for 32-bit data, and if the data is in the data cache. Depending on the processor, the exceptions that reach the OS can be more or less frequent.

ARM v6-A and later (except for some microcontrollers, like Cortex M0/R0, that don't support hardware unaligned access at all, triggering a exception) is similar to the Intel x86 case (reference in transparent unaligned memory access -except for SIMD, where x86 can raise exceptions, too, in the case of unaligned load/store opcodes-), where there is hardware support for unaligned memory access.

For software that uses intensive non-aligned data access, e.g. data compression algorithms doing string search, PowerPC, ARM v6-A (and later ARM Application processors), new MIPS with HW support for unaligned memory access, and Intel are pretty much the same (i.e. b = * (uint32_t * )(a + 23) will take 1-2 cycles, not requiring doing a memcpy(&b, a + 23, sizeof(uint32_t))).

For SIMD, though, there is no transparent fix, although there are specific opcodes for aligned and unaligned memory access (e.g. load/store, unaligned load/store).

antirez · on July 14, 2017

I would say that ARM v6 and later is a major step forward, but is v8 that really seems to be similar to Intel finally. The v6 was able to deal only with single fetch/store unaligned instructions, but things like accessing a double or multiple words with the same instruction would raise an exception.

faragon · on July 14, 2017

Accessing unaligned 64 bit data in 32 bit ARM mode can generate exceptions, even in ARM v8 CPUs when running code in 32 bit mode. Full unaligned memory access for 16/32/64/128 bits is only guaranteed in AArch64 mode, if I recall correctly.

pm215 · on July 14, 2017

For 64-bit ARM (AArch64), load-exclusive/store-exclusive and load-acquire/store-release require aligned addresses. (Seems reasonable to me, trying to handle atomic accesses to aligned data is no fun). You also get a fault for any kind of unaligned access to Device memory, but you're not going to have that unless you're the kernel, and unaligned accesses to hardware registers are definitely not going to work out very well...

(The rules are all fairly clearly documented in the Architecture Reference Manual.)

antirez · on July 14, 2017

Yes, I think you are right.

throwaway000002 · on July 14, 2017

I'm probably the only weirdo that thinks this, but if you support byte-addressing you'd better as well be happy with byte-alignment. Atomics being the only place where it's reasonable to be different.

Which brings me to padding. I wonder what percentage of memory of the average 64-bit user's system is padding? I'm afraid of the answer. The heroes of yesteryear could've coded miracles in the ignored spaces in our data.

wzdd · on July 14, 2017

> if you support byte-addressing you'd better as well be happy with byte-alignment

All ARM processors do this. The concept is called "natural alignment" and it's pretty common on non-x86. See e.g. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.... . The problem here is that a lot of code written for x86 wants more than that, e.g. byte addressing for non-byte-wide values.

throwaway000002 · on July 14, 2017

I understand. What I mean is that if your word-size is not your addressing-size, you'd better not have a concept of mis-aligned accesses. It's trouble you brought on all by yourself.

tedunangst · on July 14, 2017

The cray did this, iirc, and the result was that char pointers were extra fat because they needed to include the word address and the byte address within the word. That's not an efficiency improvement.

pm215 · on July 14, 2017

Alignment requirements are and have historically been very common -- you can see them on the PDP-11, the 680x0, and so on. It's only because a few very popular architectures like x86 have had very loose or no alignment requirements that we've ended up with a lot of code that assumes there is no alignment requirement, and this has dragged other architectures down the "we need to support this" path. If your architecture faults on misaligned accesses it's really not hard to deal with -- you have to be doing something a bit odd to even run into the problem usually.

throwaway000002 · on July 14, 2017

I can understand the historical requirements for alignment, the necessary transistors, what not. But, much like branch-delay slots, there is no modern reason to expose this to the programmer. Of course, I gave an exception to atomics, but if you will, they're like memory-mapped communication, and now that all I/O is memory-mapped, with no concept of ports, the (ordering) semantics of memory access becomes really important.

I'm also the weirdo that feels process isolation, memory management, and I/O mechanisms need a rethink. But that's something that would take me forever to get into.

One thing I will say, though, is alignment issues "infect" everything. Assume your architecture doesn't allow misaligned access. Now, all your data has to be naturally aligned. Your structs now have to be aligned to the alignment of the largest sub-structure within them. This is all because code is alignment sensitive. Given a pointer to a struct, generic code is unnecessarily larger. Any why would we care? Communication, of course. If we're exchanging data between systems then idiosyncrasies such as this suddenly become globally visible.

Endian-ness must be little. Byte-aligment a non-issue, and network-bit order should be from bit zero up, with any upper layer need, say for cut-through forwarding, expressed as a data ordering requirement, so for example an IP4 address is not a blind 32-bit word, but specifies the structure of those 32-bits.

pm215 · on July 14, 2017

Even today, allowing unaligned accesses is still not free -- there is an implementation cost in transistors and in design complexity. There's a tradeoff here, as usual. There are a lot of places with a CPU architecture where there's a choice of "do we handle this in hardware, at the cost of having to have more hardware, or do we say it's software's job to deal with this, and hardware provides either nothing or just some helpful tools". You can see this for instance in whether software has to perform icache/dcache maintenance vs the CPU doing a lot of snooping to present the illusion of a completely coherent system; in whether hypervisor virtual machine switching is done with a single "switch all my state" operation on by letting hypervisor software switch register state itself; and in many other places. x86 has in my view generally ended up on the "handle things in hardware and make software's life easier", which it's been able to do because its natural territory is desktop/server where extra transistors don't hurt much. Other architectures tend towards different points on this spectrum because their constraints differ -- in embedded systems the extra power and are cost of more transistors can really matter. "Tend to prefer that software do something" is also a strand of the original RISC philosophies.

Practically speaking, the world is not going to converge on a single endianness or on a no-alignment-restrictions setup any time soon, so we have to deal with the world as it is. If you're programming in a sensible high-level language, it will deal with this kind of low-level nit for you. If you're programming in a low-level language (like C), well, I think you wouldn't be doing that if you didn't have fun at some level in feeling like you had a mastery of the low-level nits :-)

throwaway000002 · on July 14, 2017

You sound like an architecture person (I'm not, btw), so maybe you can give the lowdown on this.

Why registers? I haven't studied the Tomasulo algorithm in any detail, but if you're going to do "register renaming", why have registers at all? You could, for example, treat memory as a if-needed-only backing store, and then add a "commit" instruction that commits memory (takes an address, or a range). Sure you need to make changes with how you do mm i/o and protection, but at a basic level: why registers?

I'm glad FPGA's are becoming a thing, and I think we're about a decade or two away from ASICs as a service, because if you're not beholden to tradition, you really can work some magic. Of course I'll be pretty rusty by then, but who knows, maybe medicine will keep me feisty.

bluGill · on July 14, 2017

Because that would require longer instructions and thus more memory.

Instructions on a CPU are something like to following (this is based on MIPS since x86 is a mess) The first 6 bits are the instruction, the rest is command specific. For add the next 12 would be 4 bits for each of the source registers and then the destination register and then various flags (overflow for example).

If instead they only worked on memory they would have a lot more possible instructions - but there isn't enough room on CPUs to design that many instructions anyway so who cares, followed by the all three memory addresses. This means that every CPU instruction needs to read 3 times as much memory before doing anything. Worse, most of those are pointers: when you compile the code you don't know the location of those address, so in most cases it is read the instruction from the program, then go back to the stack to read the address of the next values, then read those locations. That is a lot of memory access and memory access is expensive. Of course as you can say you can just use caching, but cache is expensive and now you need to add 3 times as much - this is too big for the fast level one cache so now you are expanding level two cache and seeing a lot more cache misses in the level one cache.

The above would all be okay, but it turns out that given enough registers (x86 fails here) in most cases you are operating on the same set of values all the time, (indeed the stack locations each of the above is referring too is probably a small set of variables) so if the compiler is careful it can manage all that. The compiler has better information on when things need to be committed to memory anyway so let it handle that.

spc476 · on July 14, 2017

Not really. The TMS9900 [1] used memory as registers and had a fairly compact instruction set. Yes, it does have registers, but only three (a program counter, a status register, and a pointer to the current "register set" in memory). At the time, it was regarded as a slow machine, probably because of all the memory-to-memory operations.

[1] https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900

throwaway000002 · on July 14, 2017

Sure, this is one technique. You could also, akin to jump instructions, have a concept of data locality, versus instruction locality. You can do this is a lot of ways without resorting to something like segmentation, which everybody hates. Trivial would be something like a current "data pointer", which would see useful implicit updates, and well as explicit ones (akin to a long jump).

pm215 · on July 17, 2017

Not all CPUs do register renaming, and almost all architectures will have started out being defined for a CPU which didn't do renaming. Even today, lower end CPUs (think the embedded market) don't do register renaming. If you want your architecture to be able to cover down to the low end then anything that drops the idea of a register file is a non-starter. Also, a register-based architecture is well understood, in terms of how to implement it effectively, how to exploit it in compiler design, and how to hand-code assembly for it when necessary. You need a really strong argument to justify taking the weird and innovative route, usually.

MrBuddyCasino · on July 14, 2017

Accessing memory locations ending in 0x7? Gather round the campfire folks, James Mickens has a story to tell: https://www.usenix.org/system/files/1311_05-08_mickens.pdf

luhn · on July 14, 2017

> Redis is adding a “Stream” data type that is specifically suited for streams of data and time series storage, at this point the specification is near complete and work to implement it will start in the next weeks.

This sounds like it could be really exciting. Is there anywhere I can find out more?

Specifically, I've been struggling to find an appropriate backend for HTTP Server-Sent Events, could this feature help with that?

antirez · on July 14, 2017

Hello, please check my two Redis Conf 2017 talks on youtube. There is info about Streams.

luhn · on July 17, 2017

Thanks antirez! This looks exactly like the feature I've been searching for. :)

For posterity, here's the referenced videos:

General overview: https://youtu.be/U7J33pd3hLU?t=23m54s

Implementation details: https://youtu.be/Wzy8dIjsY6Y

fancy_pantser · on July 14, 2017

Did my enhancement make it into the skip list implementation being used for the STREAM type? I am hoping it would be in place before you publish benchmarks for it.

https://github.com/antirez/redis/pull/3889

antirez · on July 14, 2017

Hello, very interesting! I missed this, just commented on the issue. The Streams are not based on skiplists, but instead will be implemented using http://github.com/antirez/rax

fancy_pantser · on July 16, 2017

Thanks for having a look! I only read the early proposals for the data structure behind Streams and haven't had a chance to go over the final implementation. I hope to dive into the source this week and make more contributions down the road!

yeswecatan · on July 14, 2017

Here's a discussion on reddit. There's a link to the proposal on github, too.

https://www.reddit.com/r/redis/comments/4mmrgr/stream_data_s...

johnny22 · on July 16, 2017

I'm pretty sure I saw implementations that used the existing publish subscribe mechanism in Redis to handle it and seemed happy with it. I have no personal experience with it though.

msarnoff · on July 14, 2017

Recently I've been doing a lot of low-level work with ARMv7-M microcontrollers (specifically, NXP's Kinetis Cortex-M4 chips) and was quite pleased to find out that they are pretty lenient about unaligned accesses. To quote from the ARM Cortex-M4 Processor Technical Reference Manual:

"Unaligned word or halfword loads or stores add penalty cycles. A byte aligned halfword load or store adds one extra cycle to perform the operation as two bytes. A halfword aligned word load or store adds one extra cycle to perform the operation as two halfwords. A byte-aligned word load or store adds two extra cycles to perform the operation as a byte, a halfword, and a byte. These numbers increase if the memory stalls."

However, multi-word memory instructions (LDRD, STRD, LDM, STM, etc.) always require their arguments to be word-aligned.

type0 · on July 14, 2017

Great article, this project just begs the name of Redisberry Pi

JefeChulo · on July 14, 2017

In future project I might be interested in the use of Redis for queuing jobs, this comes very handy to now early the main issues I could get when developing.

amelius · on July 14, 2017

Could Rust's typesystem catch unaligned pointer dereferences?

bbatha · on July 14, 2017

Sort of, Rust is supposed to make references to packed structure members unsafe, but currently doesn't. An RFC was accepted to change the behavior but it has not been fully implemented. Here's the tracking issue: https://github.com/rust-lang/rust/issues/27060

wofo · on July 14, 2017

Considering dereferencing a pointer after doing some arithmetic on it can only be done within unsafe blocks, I would say you are at least warned about it. But it will happily compile.

dis-sys · on July 14, 2017

wondering what kind of performance overhead it is going to cause by letting the kernel to handle unaligned access vs. fixing the software to actually always use aligned access?

crncosta · on July 14, 2017

Nice article!

k__ · on July 14, 2017

OT: Is blattimwind shadow banned?

make3 · on July 14, 2017

? I see his post

yorwba · on July 14, 2017

Probably someone vouched for it.

retox · on July 14, 2017

No, but posting while green will usually get your comment downvoted to oblivion, even if you are erudite and contribute to the conversation.

Turn on "show dead comments" and see how many greens are deleted. I screenshot many examples.

taneq · on July 14, 2017

Is this cause (ie. people downvote greens out of prejudice) or effect (greens are often created to shitpost?

And to concentrate all my meta in one place... Is shadow banning a thing at HN? I thought they just, well, banned you.

Rjevski · on July 14, 2017

OT - new user here, what does green mean? I assumed it was staff/moderators.

mbel · on July 14, 2017

It means new user (hence green color).

jancsika · on July 14, 2017

I just assumed green meant admin or some kind of paying customer.

I'd venture to guess green users could use this ambiguity to play subtle rhetorical tricks on users with a moderate number of points here.

parthdesai · on July 14, 2017

your username will appear green till you get a certain amount of karma/upvotes.

striking · on July 14, 2017

I thought it was based on time.

parthdesai · on July 14, 2017

hmm not sure. My username appeared green till i got 50/100 upvotes. I think that makes sense since the number upvotes on HN usually indicates that the person makes valid comments and is not trolling.

Jtsummers · on July 14, 2017

They added shadowbanning some time ago and later added a user (the rest of us) ability to vouch for dead posts to revive them. There were reasons at the time they did this that the admins could probably explain better.

icebraining · on July 14, 2017

I doubt it's downvotes. Probably cases of the spam/ringvoting detector gone wrong.