Timeline – a high-resolution event log

mbrock · on April 7, 2016

I've been running into Luke Gorrie's cool projects in different languages since my early university days in 2006. Somehow he always does inspiring stuff, and gives off a really friendly vibe. Cheers!

amelius · on April 7, 2016

> The timeline is a high-resolution event log that stores entries in a shared-memory ring buffer

What problem is this solving exactly? Why can't one write multiple log files (i.e., without sharing anything). Also, shouldn't log entries be stored on disk in most cases?

jsnell · on April 7, 2016

This is for latency sensitive applications, where you're counting cycles and can't possibly afford to write nicely formatted log messages to a file. (Or can't really even afford system calls in the main loop).

So there's a main application process, which does its logging by adding messages to the shared memory ring buffer, with a message format that's designed to be compact and cheap to write. And then there's going to be a reader which reads the log messages, and actually does something with them (filter, format, write to log files, whatever).

amelius · on April 7, 2016

Can't the logging be handled by a different thread, which subsequently writes the log entries to a file? This is how most loggers work I suppose (keep the main thread free from logging-work as much as possible).

And because you use threads instead of processes, you get shared memory for free.

vardump · on April 7, 2016

> Can't the logging be handled by a different thread, which subsequently writes the log entries to a file?...

So what happens if the process crashes? All of the most recent log messages are lost. Exactly those messages that are the most useful to figure out what went wrong. Well, I guess you can dig them out of a core (crash) dump... if you're lucky enough to have one.

And what about I/O load. If you have multiple log files open, you're going to need to sync them separately too. Sync goes all the way to the physical device.

Then you have the problem of putting the log messages from multiple files in temporal order. Please don't tell me you tell me you just use TSC timestamps to order them. Different CPUs can have TSCs running at different rates and starting from different offsets. So the order can be lost unless all processes are locked to same single CPU core.

If you then plan to use a different timer than TSC, like ACPI or HPET, just forget about it. HPET timer takes a whole microsecond to read and ACPI timers are even slower. TSC can be read in a few nanoseconds.

> And because you use threads instead of processes, you get shared memory for free.

Actually the cost is exactly same. All memory is really "shared" memory, MMU just maps a page to some point in physical memory.

Shared memory just means two processes have virtual page(s) mapped to same physical memory page(s).

From the hardware point of view, both are exactly the same thing and thus exactly as fast.

jasonwatkinspdx · on April 7, 2016

This is not a replacement for an application logger or syslog or the like. This is for sampling traces of extremely high throughput and low latency networking systems built atop snabb.

dang · on April 7, 2016

You may be right, but please don't make the criticism personal ("you're just not getting it", "while failing to understand", etc.) It makes the thread worse and typically provokes worse yet.

jasonwatkinspdx · on April 7, 2016

I edited my comment to reflect your criticism, but I disagree with it. It's pretty clear the person I was replying to just didn't even read the original posting.

dang · on April 7, 2016

Assuming that's true, you still shouldn't be personally abrasive in HN comments. Being right doesn't make it ok; it often makes it worse.

frugalmail · on April 7, 2016

This is trying to replicate Java/Log4J2/Disruptor in Lua?

jsmthrowaway · on April 7, 2016

No. You could build dtrace with software like this, whereas the entire stack you mentioned only cares indirectly about per-message cycle overhead. This is for when your tight loop, like your routing engine or game loop, wants to kick out numbers at timestamps as fast as it can and other software deals with making them presentable. It's a great technique.

This is built for networking software/hardware, where every cycle matters, and is actually extremely useful to me in a couple domains. I'm impressed disabled messages only sink five cycles so I assume there was hand tuning done here.

Edit: And I'm right, it's a hand tuned logical compare (cool): https://github.com/lukego/snabb/blob/timeline/src/core/timel...

If you want reasons to learn assembly, there you go. Too few folks wielding it in typical Web operations, where it would be extraordinarily useful in any number of hot paths. Just requires care.

vardump · on April 7, 2016

  local memory      = shm.map(shmpath, ffi.typeof("char["..size.."]"))

mmapp'ed, so buffer memory is always 4096 aligned.

  -- Log entry format (ring item)

  ffi.cdef[[
    struct timeline_entry {
      uint64_t tsc;                    // CPU timestamp
      uint64_t ip;                     // Instruction pointer of logger
      uint32_t aux;                    // TSC_AUX: core (bits 0-7) + numa (12-15)
      uint32_t reserved;               // (pad to 64 bytes)
      uint64_t msg_id;                 // category:24, message:24, priority:16
      uint64_t msg0, msg1, msg2, msg3; // message arguments
    };
  ]]

And 64 bytes long. Just like a cache line, probably not a coincidence? No unnecessary slow memory traffic or cache pollution. Nice.

Such a shame TSC_AUX is always 0 on Windows and most Linux kernels as well. Oh well. It's not that much code or work to write CPU core index and NUMA group to an MSR when bringing up a CPU core.

jsmthrowaway · on April 7, 2016

Yeah, no kidding. The author did their homework and this is really good code to study. I'm with you, it's almost certainly not a coincidence.

It's aging a bit, but Ulrich Drepper's seminal paper on memory comes to mind: https://www.akkadia.org/drepper/cpumemory.pdf

Say what you want about Drepper's style and personality, but that paper told me (a) he's an incredibly knowledgable dude and (b) I'll always have more to learn, especially about cache.

vardump · on April 7, 2016

Looks like he's not writing to "reserved" field. So CPU will definitely need to read the cache line at current log entry before write (RFO, read for ownership).

I'm not sure whether current CPUs are smart enough, but in theory writing 64 bytes at once to cache aligned address could avoid RFO. If it's L1/L2/L3 miss, that could be 100-200 cycles saving.

Might have also been better to do RDTSCP first. Otherwise it'll also avoid reordering instructions. And sabotage attempt to avoid RFO.

Anyways, not sure, didn't profile. (And by profiling I mean using those countless CPU performance counters to figure out what's going on in the mysterious black box.)

frugalmail · on April 7, 2016

Not sure why a legitimate question gets downvoted here.

In reality with the intelligent response by jsmthrowaway, and me spending way more time than I wanted to looking at their repo, it clearly is using the same strategy that I mentioned above. The difference is that it's at a lower level for a different platform (not lua).