NanoLog – a nanosecond scale logging system for C++

muststopmyths · on Sept 1, 2019

I spent a lot of time trying to build a fast logging system in my last couple of jobs. The basic lesson (and I'm only talking about C/C++/C# here) is that you will spend most of your time formatting strings if you do your file I/O asynchronously.

Since this system has a preprocessor mode, I assume they learnt the same lesson.

The bigger lesson is that it really doesn't matter how many millions of logs you can generate per second if you don't have the infrastructure to store and analyze them easily. No one is going to enjoy digging through these things and the more you generate, the harder it is to extract meaningful information.

In other words, pay a lot more attention to what you'll do with the logs rather than how fast you can spew them out. I have rarely needed more than a few thousand log messages/sec even for a loaded MMO server. I spent way more time creating ways to look at these logs and making them accessible in near real time.

cjensen · on Sept 2, 2019

Your points about logging with care saving more cycles that logging efficiently is very right.

That said, here's how I've handled efficient logging in realtime threads since the old days of 100MHz processors: postpone the formatting to a non-realtime thread. Assume your log_printf() will have no more than 10 int-sized args. Grab and save the 10 "ints". When doing the format in the non-realtime thread, pass in the 10 ints. Obviously this won't work with every CPU arch, but it does work fine for all the most common ones. Now every time you do a log_printf(), 1 pointer and 10 ints are copied to a buffer, which is very quick.

The catch is that your format string must be not be changed (e.g. use a string literal). Also any strings you want to print must also be const.

gpderetta · on Sept 1, 2019

> is that you will spend most of your time formatting strings if you do your file I/O asynchronously.

That's why you are supposed to do your formatting in the background thread :).

> doesn't matter how many millions of logs you can generate per second

Usually the issue is not generating millions of line of logs (in whitch case dispatching to a background thread just adds overhead), but being able to log with minimal overhead on the surrounding code.

Matthias247 · on Sept 1, 2019

> That's why you are supposed to do your formatting in the background thread :).

Which then has the drawback that every argument that gets transferred to the background thread must first be heap allocated or static - instead of only the final buffer being that. My gut feeling is that the additional allocation and memory management costs typically outweigh the cost of formatting in the current thread. Even if one pools all logging buffers less buffers are needed for transferring only formatted buffers.

gpderetta · on Sept 1, 2019

Normally you would copy the arguments to be formatted. This is usually less bytes than the formatted output.

paulddraper · on Sept 1, 2019

Or you'd copy them.

Unless you do something dumb (like copy a vector when you only need to print the size), copying is significantly faster than string formatting.

SnorkelTan · on Sept 1, 2019

I think what OP is hinting at here is cache invalidation/eviction caused by the additional thread's processing and memory operations. If your requirements are down to nanosecond granularities then cache misses are probably being measured and noticeable. A third party logging thread doing memory copies and other log processing sounds like a fine way to unintentionally evict a bunch of cache entries. You might be able to mitigate this to some extent with CPU affinities for threads I suppose. Another option would be to move logging to the network and record packets in/out with an out of band monitoring solution.

Rapzid · on Sept 1, 2019

Logging the raw structured log entry and off-loading any string formatting to another system is a very interesting idea. Just push the whole "friendly message" stuff off to other logging infrastructure where the latencies matter less.

paulddraper · on Sept 1, 2019

Well yes, using another thread is a only good idea if you have an extra core available.

L1 cache is per-core anyway, and I doubt L2 is going to hurt much.

jnordwick · on Sept 1, 2019

At a few places I've worked, logging and util stuff had it's own core to prevent l1/l2 cache pollution (either using threads or shared memory just as long as you got it out of the hot path).

Matthias247 · on Sept 2, 2019

Yes, but you need to have a location where to copy them to. They can’t be on the stack, since the logging task runs asynchronously. So each captured argument must be in some form heap allocated. Eg passing 2 strings as arguments and one integer might require 3 additional heap allocation - where the malloc/free overhead might outweigh the string formatting costs. You can try to optimize here with specialized allocators (eg Arenas) or trying to generate dedicated structures for each logging callsite where all arguments can be carried inside a single heap allocated struct (instead of 3 here). But that might lead to other disadvantages - eg code bloat.

gpderetta · on Sept 2, 2019

That would be a terrible implementation strategy. You would simply copy each argument into the communication ring buffer. You can also use specialised strategies for some types, for example string contents can be transmitted inline instead of copying the strings themselves.

RafaGC · on Sept 2, 2019

On my logger the thread queue was(is) node based, so it was just a matter of making a bigger contiguous allocation and placing things there contiguously, the log entry and the data copies. 1 allocation.

paulddraper · on Sept 2, 2019

Good point

cma · on Sept 1, 2019

Shared memory to a background process is better, in that it can survive most classes of process crash.

gpderetta · on Sept 1, 2019

Yes it is significantly more robust, the downside is that you have to copy everything, including static vars and format strings. An option is to fork early on process startup and possibly share some memory.

We simply catch all signals and do a best effort attempt to flush the queue before aborting.

holy_city · on Sept 1, 2019

I've dealt with this problem too, mostly in trying to log/debug real-time media processing where a log trace equivalent would be long chunks of floating point data at multiple trace points in the data flow.

I came to the same conclusion. Raw bandwidth to your logging endpoint is meaningless when you start dealing with big chunks of information, which is common sense I guess. You need a fast way to digest it into whatever format works for the person reading the log later, and it's tough to make it abstract since how data is interpretted can be quite varied even in the same application.

cbsmith · on Sept 4, 2019

> you will spend most of your time formatting strings if you do your file I/O asynchronously

Which is why good logging systems try to defer as much formatting logic as possible until the last possible moment (ideally render time).

I wish more logging frameworks would log in some efficient intermediary serialization format like Cap'nProto or Protobuf.

profitnot · on Sept 2, 2019

This is, of course, true. Many fail with the mindset that if only they can build big enough haystacks, the needles will coalesce like shiny seaglass upon the shore, because surely they will hire the algorithmic wizards, and oh yes.. AI(!!) and ML(!!).

Still, this really could have useful applications in scientific applications where the goal is to capture extremely fast events in detail... super-colliders, fast-fusion events, etc...

sharpshadow · on Sept 1, 2019

That's why I open comments first.. hypetaker +1

MauranKilom · on Sept 1, 2019

The code overall is pretty clever, but at the core of all this they completely ignore strict aliasing to dump stuff into a char* buffer indiscriminately... Look at this function:

https://github.com/PlatformLab/NanoLog/blob/master/runtime/N...

     T argument = *reinterpret_cast<T*>(*in);
     ...
     uint32_t stringBytes = *reinterpret_cast<uint32_t*>(*in);

A total of 6 reinterpret_casts in that file alone.

I didn't see any indication that you have to turn off strict aliasing to use this library (and would you really want to make all your remaining code slower just so your logging has better benchmark scores?). Which makes all this code UB as far as the C++ standard is concerned (no, their allocator does not magically placement-new the correct object types into the right places).

They could just inspect their T values as char* and/or memcpy them around instead of all this pointer aliasing (well, as long as their values are trivial types - I couldn't figure out whether you can log non-trivial stuff here?) at virtually no additional cost.

Edit: And this right here convinces me that I definitely wouldn't want to use this library in production... Adding a volatile to fix a threading bug is just a big no-no.

https://github.com/PlatformLab/NanoLog/commit/e9691246ede6da...

gpderetta · on Sept 1, 2019

> Adding a volatile to fix a threading bug is just a big no-no.

The fence instructions also don't make much sense. Not sure why one would not use std::atomic in a c++17 only library.

Leszek · on Sept 1, 2019

You'd think so, but in fact reinterpret_cast from char* (and unsigned char, and std::byte) is explicitly allowed by the type aliasing rules.

MauranKilom · on Sept 1, 2019

To the best of my knowledge, casting to char* is totally fine (inspecting an object as bytes) under certain constraints.

What is not fine is pretending that an object lives at a position in memory where it does not (i.e. treating bytes of memory as some object through a reinterpret_cast away from char* ).

Edit: To clarify, casting away from char* is of course allowed if you cast to whatever object type actually lives at the pointed-to location.

userbinator · on Sept 1, 2019

(i.e. treating bytes of memory as some object through a reinterpret_cast away from char ).*

That's how people have used C and C++ for decades in low-level work, and it still works exactly as you'd expect, so stop saying "don't do it" because you're only encouraging the compiler-writer-UB-optimisation-nonsense crowd to make things even worse. It's already bad enough that they think the Holy Standard is the only thing that matters.

I believe Linus has several memorable rants on this topic already.

pingyong · on Sept 2, 2019

Doesn't matter anymore, Clang and GCC will both make any and all optimizations they can find by assuming you will never hit UB, including just erasing all UB code. You can dislike it, but that's just the reality now, and ignoring it won't make it go away. The only way to change it would be a change in the standard.

detaro · on Sept 1, 2019

If I remember correctly, it is fine if there actually was an object of that type (or a "similar" type) in that location. So casting to char* and back is fine, casting to char and then to int-type to inspect multiple bytes is fine as long as alignment plays correctly, ...

It seems like the function only cast to T* if the input pointer hasn't been changed (the paths that do modify it return early), so there's a value there?

MauranKilom · on Sept 1, 2019

> it is fine if there actually was an object of that type (or a "similar" type) in that location.

That is correct. See my other nearby reply.

> So casting to char* and back is fine

Indeed.

> casting to char and then to int-type to inspect multiple bytes is fine as long as alignment plays correctly, ...

Only if the place in memory you are pointing to actually has a live int object. In particular, inspecting the byte representation of an object is fine (including copying the bytes elsewhere, which is why they could just use memcpy instead of reinterpret_cast).

See also http://eel.is/c++draft/basic.types#2 and surroundings.

gpderetta · on Sept 1, 2019

> casting to char and then to int-type to inspect multiple bytes is fine as long as alignment plays correctly

It is only fine if there were the same exact int types at that address. Remember, UB usually isn't the reinterpret_cast but the dereference.

detaro · on Sept 1, 2019

I thought the int-types had a special exception making that always possible (if there's a bunch of bytes, you can turn them into an int), but I could very well be wrong on that detail.

gpderetta · on Sept 1, 2019

Char is the only exception.

And now std::byte IIRC

Leszek · on Sept 1, 2019

I had thought that, as long as the object in that location is valid (e.g. constructed with placement new and not since destructed, with the `char` as a pointer to the storage), then accessing it through the `char` is valid -- but maybe that's only true for `void*` now that I think about it. Either way, fair enough, placement new probably hasn't happened (haven't read the rest of the code).

Iwan-Zotow · on Sept 1, 2019

char* is universal memory format, you could specialize from that into anything you want

MauranKilom · on Sept 1, 2019

That's not how the C++ object model works. Objects in C++ have clearly defined lifetimes. If you reinterpret_cast<int* >(someCharBuffer) and then dereference that, you have Undefined Behavior unless an int object is alive at that exact location.

You can do

    int value[2] = {0, 0};
    char* ptr = reinterpret_cast<char*>(value) + sizeof(int);
    (*reinterpret_cast<int*>(ptr))++;

or (given knowledge about compiler padding):

    struct XY { double x; int y; };
    XY xy = {};
    char* ptr = reinterpret_cast<char*>(&xy) + sizeof(double);
    (*reinterpret_cast<int*>(ptr))++;

or even (placement new):

    unsigned char* buf = new unsigned char[12];
    buf += 4;
    new (buf) int(0); // Create an int object in the buffer.
    (*reinterpret_cast<int*>(buf))++;
    delete[] buf;

But in each case an int object has had its lifetime begin before you can treat the bytes in memory like an int. Everything else is UB.

bonzini · on Sept 1, 2019

You cannot do the last unless you know that the argument of placement new is suitably aligned to receive an int object, however.

jnordwick · on Sept 2, 2019

What's your problem with the volatile? Using it to prevent register caching of or a value or prevent optimizing out repeated reads is perfectly fine. That is how it is supposed to be used.

The bug was that one thread was reading the gcc thought it wouldn't be update between reads. It is even used properly by grabbing a copy of the volatile value instead of reading it repeatedly.

edit: I also don't understand your aliasing issues. A `char*` can point anything and you are allowed to cast back out of it to the correct type. It is casting to int because the string is prepended by its length (the var names are "size" and "nibble").

MauranKilom · on Sept 3, 2019

> Using it to prevent register caching of or a value or prevent optimizing out repeated reads is perfectly fine. That is how it is supposed to be used.

The code has a race condition, plain and simple. volatile does not lead to thread synchronization, and that function is (as gcc correctly identified) nonsensical in a single-threaded world. The code expects that a different thread writes to the variable at the same time that this thread might read from it - that's the definition of a race condition, which is straight up UB (volatile or not).

You are trying to reason about compiler optimizations from the perspective of the machine and memory model you know ("prevent register caching of a value" etc.). That's wrong though - compiler optimizations operate on the abstract machine that C++ is defined on. In the abstract machine, this code has a race condition (the absence of which is guaranteed to compilers by the C++ standard) and so optimizations could easily violate whatever mental model you are using to conclude the code is fine. The code will likely break again some time in the future, just like it broke this time for a newer compiler version. Or maybe it compiles to subtly broken code right now, who knows!

> I also don't understand your aliasing issues. A `char* ` can point anything and you are allowed to cast back out of it to the correct type.

int is not the "correct type" because no int object lives at the pointed-to location after the cast. They could just memcpy and everything would be fine, but writing to/reading from type aliased pointers (other than char* and friends, in specific circumstances) is UB.

Here is a good description (in the context of C, not C++) of strict aliasing: https://stackoverflow.com/questions/98650/what-is-the-strict...

jnordwick · on Sept 4, 2019

> The code expects that a different thread writes to the variable at the same time that this thread might read from it - that's the definition of a race condition,

No, that isn't. In one thread it reads a value over and over. Another thread will periodically set it. This isn't a race condition and requires no synchonization. On writer thread only, and you're fine. If the code was doing more, I didn't see it.

I just wrong something very similar yesterday where I had a counter that is read over and over and an async handler updated it. I had to make it volatile to prevent the the spin on it from optimizing out the load.

Yes, in low level programming, you have to know about the hardware, what a register is, what cache it, etc. C++ has a very lose (almost nonexistent) abstract model. This isn't UB.

> int is not the "correct type" because no int object lives at the pointed-to location after the cast

The string had its length prepended to it. There was an int there. Also, those aliasing rules are more about optimization issues, not as much about alignment issues (eg, if you are only writing for intel/amd hardware alignment only really matter for SIMD, and there is no penalty on unaligned access (with a few small exceptions that didn't apply here).

MauranKilom · on Sept 4, 2019

> No, that isn't. In one thread it reads a value over and over. Another thread will periodically set it. This isn't a race condition and requires no synch[r]onization. On[e] writer thread only, and you're fine.

Sorry, wrong term by me. I meant data race, not race condition (although many would regard the former as a form of the latter). A data race is two or more threads accessing the same location without synchronization, where at least one of them is modifying the value. A data race results in undefined behavior. Here are the relevant sections from the C++ (draft) standard:

http://eel.is/c++draft/intro.multithread#intro.races-21

http://eel.is/c++draft/intro.multithread#intro.races-2

> I just wro[te] something very similar yesterday where I had a counter that is read over and over and an async handler updated it.

If you wrote this code in C++, you created a data race and your program has undefined behavior.

> I had to make it volatile to prevent the the spin on it from optimizing out the load.

volatile may prevent the load from being optimized out, but it does not prevent the data race (because it does not prevent instruction reordering). Your program still has undefined behavior.

Further reading:

https://www.kernel.org/doc/html/latest/process/volatile-cons...

> C++ has a very lose (almost nonexistent) abstract model

I don't know what you mean with "lo[o]se" or "almost nonexistent" but here are a few specifications of the abstract machine that C++ is defined on:

http://eel.is/c++draft/basic.memobj

http://eel.is/c++draft/basic.exec

> There was an int there.

No. Just because you operate on the memory as if it was an int doesn't create an int there.

> Also, those aliasing rules are more about optimization issues

Correct! And those optimization issues are the primary way in which UB manifests itself in your program.

jnordwick · on Sept 7, 2019

It seems like you're one of the UB brigade that demands rigid adherence to the standard because some 8-bit microprocessor might not be able to read a 32-bit value atomically. Nobody care about that, and when writing high performance code, you write the hardware you target and care about.

Reading and writing to the same int in memory is atomic for every hardware platform I care about. Even alignment issues don't matter for anything this is going to be run on.

> No. Just because you operate on the memory as if it was an int doesn't create an int there.

Don't be dumb. They wrote an int to the beginning of the buffer and then had to cast back to it, like every piece of networking or file code that had to save a binary number.

SnorkelTan · on Sept 5, 2019

You can leverage the cpu's atomic operations such as compare and set, compare and swap, as well as the properties of the memory subsystem to achieve thread safety without an operating system mediating the interaction. Here's a blog post from some people that have street cred to give you some idea of the potential performance benefits: https://mechanical-sympathy.blogspot.com/2013/08/lock-based-...

rafa1981 · on Sept 1, 2019

Author of mini-async-log and mini-async-log-c here.

https://github.com/RafaGago/mini-async-log

https://github.com/RafaGago/mini-async-log-c

I wrote both some time ago and AFAIK they were the fastest, how does nanolog compare against them?

At least before the source file preprocessor Nanolog was slower.

Here I had a benchmark project with Nanolog on an old version, maybe you can PR an update:

https://github.com/RafaGago/logger-bench

RafaGC · on Sept 2, 2019

As comments can't be edited: there are two nanolog projects on github. They may not be the same one.

wrs · on Sept 2, 2019

I did something very similar to the “preprocessor” version of this back in 2001 for Microsoft application crash reporting (the problem then being upload size, not so much logging overhead). It’s very cool to see how you can do the whole thing with constexpr now. As I recall, people really didn’t like using the preprocessor!

usefulcat · on Sept 1, 2019

The description says "the compilation builds a version of the NanoLog library that is non-portable, even between compilations of the same application". They also mention that a decompressor application also gets built and can be used to reconstitute a human-readable version of the log file. My question is, is the decompressor application also tightly coupled to a particular compilation? It sounds like the answer is yes, but I'd really like to be wrong about that.

That would really suck for cases where you want to store old log files for later use. You'd have to also store a copy of the decompressor program that corresponds to each log file. Or you'd have to convert to human-readable format before storing the log files, which loses a number of the benefits of the binary log format (compact size, faster to parse).

nemetroid · on Sept 1, 2019

In the related paper[1], this paragraph (5.1.2) strongly suggests that the decompressor is non-portable as well:

> NanoLog performs best when there is little dynamic information in the log message. This is reflected by staticString, a static message, in the throughput benchmark. Here, NanoLog only needs to output about 3-4 bytes per log message due to its compaction and static extraction techniques.

1: https://www.usenix.org/conference/atc18/presentation/yang-st...

Mathnerd314 · on Sept 2, 2019

Did you skip over the section where they say to use the C++17 version that doesn't have this issue?

I looked in the paper to find the explanation for why the preprocessor version is non-portable; there wasn't anything clearly written but it seems to be related to how it assigns unique ID's to log statements.

gmueckl · on Sept 1, 2019

I built a similar system with a focus on profiling events with extremely low overhead. When taking an timestamp, the processor cycle count register is read, and pushed into a ring buffer with a tiny descriotion including a static string label. A background thread perodically writes that buffer contents to disk. An offline tool visualizes the recorded events and intervals over time and can do a bit of statistics.

This system has served me quite well, but I need to rework the UI because zooming to microsecond resolution on a 30 second trace overflows the range values in the Qt scrollbars.

liuliu · on Sept 1, 2019

We use custom format and an offline process to transform the format into Google Catapult for visualization. Catapult visualization for 30s trace should be pretty reasonable.

Const-me · on Sept 1, 2019

Recently built similar stuff. Even using same __rdtsc() for time. I didn't require any preprocessing. Instead, I require the strings come from readonly section of a module, and logging pointer values. Also logging into circular buffer in shared memory instead of file, as I was only interested in knowing what the app did immediately before a rare crash which takes hours to reproduce.

rafa1981 · on Sept 1, 2019

I did exactly the same on my two loggers, as a C literal is always read only and unique, passing a printf-like string is passing a pointer.

The it's just a matter of adding validation and passing the arguments in a compact way.

bonzini · on Sept 1, 2019

How do you deal with ASLR (logging from shared libraries or from position-independent executables)?

Const-me · on Sept 1, 2019

In that particular case I’ve cheated. I’m writing out of band file which maps pointer addresses into strings. In runtime it’s CAtlMap<const char*, int> really fast because the keys are just addresses. I only have ~200 unique messages (ignoring format arguments) so the file is only written for the first few milliseconds. I did that because it was the simplest thing to do. I could use debug symbols + crash dumps, they can resolve these pointers regardless of ASLR, just it was way more complex to implement. BTW, the only reason I was doing that, the main executable is gta5.exe, it’s encrypted, it resists debugging, and I don’t have debug symbols. If it was my own app, I would probably do something much simpler instead.

nemetroid · on Sept 1, 2019

From a quick reading of the paper, it sounds like Nanolog is internally translating logging messages into a compacted format, which must be further processed to become human readable. This further processing is not included in the benchmark.

section 4.3, "Decompressor/Aggregator"

> The final component of the NanoLog system is the decompressor/aggregator, which takes as input the compacted log file generated by the runtime and either outputs a human-readable log file or runs aggregations over the compacted log messages. The decompressor reads the dictionary information from the log header, then it processes each of the log messages in turn. For each message, it uses the log id embedded in the file to find the corresponding dictionary entry. It then decompresses the log data as indicated in the dictionary entry and combines that data with static information from the dictionary to generate a human-readable log message. If the decompressor is being used for aggregation, it skips the message formatting step and passes the decompressed log data, along with the dictionary information, to an aggregation function.

section 5.1.2, "Throughput":

> [...] NanoLog performs best when there is little dynamic information in the log message. This is reflected by staticString, a static message, in the throughput benchmark. Here, NanoLog only needs to output about 3-4 bytes per log message due to its compaction and static extraction techniques. Other systems require over an order of magnitude more bytes to represent the messages (41-90 bytes).

> [...] Overall, NanoLog is faster than all other logging systems tested. This is primarily due to NanoLog consistently outputting fewer bytes per message and secondarily because NanoLog defers the formatting and sorting of log messages.

section 5.1.3, "Latency":

> All of the other systems except ETW require the logging thread to either fully or partially materialize the human-readable log message before transferring control to the background thread, resulting in higher invocation latencies. NanoLog on the other hand, performs no formatting and simply pushes all arguments to the staging buffer. This means less computation and fewer bytes copied, resulting in a lower invocation latency.

So NanoLog performs less work and therefore has much higher throughput. This separation of logging and formatting may possibly be a great idea, but it should probably be more prominently mentioned wherever the benchmark table is posted.

bt848 · on Sept 1, 2019

Looks interesting but how does one reproduce the comparative benchmarks from the paper? I see the code that benchmarks nanolog but how do I reproduce the baseline for glog?

rafa1981 · on Sept 1, 2019

Author of mini-async-log and its C variant here.

Some time ago I wrote a benchmark for different loggers, but it uses an old version of Nanolog.

https://github.com/RafaGago/logger-bench

bt848 · on Sept 1, 2019

Neat. Your code actually contains the issue I was curious about. Why do you log at level ERROR? As I understand it this causes glog to synchronize the output stream after every message.

rafa1981 · on Sept 1, 2019

But worst case latency, the most interesting property of an async logger is still unaffected I guess...

rafa1981 · on Sept 1, 2019

Lack of knowledge on glog

IloveHN84 · on Sept 1, 2019

You can't claim it is faster if you don't provide reproducible benchmarks.

RafaGC · on Sept 2, 2019

Is this a new incarnation of this project?

https://github.com/Iyengar111/NanoLog

Otherwise it seems that the name is already taken.

RafaGC · on Sept 3, 2019

Both projects are a different C++ logger implementations. The first commit of both repositories is unrelated. If I wanted to benchmark both name clashes are likely to happen.

This is a legitimate question. Why downvoting?

jart · on Sept 2, 2019

Cool stuff. Judging by his code, this guys mind is probably going to get blown when he discovers mmap() and bcd arithmetic.