Hacker News new | past | comments | ask | show | jobs | submit login

lock free does not guarantee all participants have a fair treatment, e.g. one thread can keep losing the CAS (CompareAndSwap / LoadLinked/StoreConditional / etc.) and effectively spinning or need to yield. releasing the CPU resources.

A non-blocking write should just copy the data to the socket buffer, definitely not taking milliseconds.

Allocate buffers in separate threads - ok, that's the crux of it - the memory is shared amongst the threads within a process, so the allocator has to be non-blocking as well or the large allocation would prevent allocations in other threads. Next release memory to the OS (or unmapping memory mapped files) - mumnmap requires TLB flush for all cores assigned to the process. There is a lot that can 'block' sometimes and void the "real-time" properties.

> interrupted by the GC doing OS interaction

Normally GCs trigger only at 'safe points' and unless they need to allocate more memory (should never be the case for a real-time application), GCs should have no OS interaction. Copying and moving memory within a process is no OS functionality. Concurrent GC with read-barrier exist as well, so no need for stop-the-world pauses. (Edit: creative use of memory mapping hardware to avoid copying may need OS calls)

It seems quite common to see on hacker news - "No GC <language> for real-time". Writing a good/multi-threaded real-time application is a lot harder than just GC's dreaded stop-the-world.




> A non-blocking write should just copy the data to the socket buffer, definitely not taking milliseconds.

That's the theory, right? It could be I measured something wrong, but sometimes dozens of ms is what I got, and I concluded that non-blocking I/O avoids indeterminate blocking (such as reading from a TCP socket until the sender sent N bytes) but does it completely avoid taking in-kernel locks etc? Probably not.

I didn't find any other measurements on the internet, please point me to them if you find them.

Thinking back again, another explanation could be false sharing effects that I didn't know well at the time.

> Allocate buffers in separate threads - ok, that's the crux of it - the memory is shared amongst the threads within a process, so the allocator has to be non-blocking as well or the large allocation would prevent allocations in other threads.

I've written a SRSW lock-free ringbuffer for example. The ringbuffer memory is allocated at startup. The cost to allocate from this ringbuffer is very little. It's probably not super important, but the ringbuffer even caches the read or write pointer from the other thread, so it only needs a cache line transfer when the cached value is not sufficient to accomodate the read or write.

This is way outside my experience, but I think it should be possible to do some similar stuff with multiple readers and/or writers? I think you should still be able to allocate from a ringbuffer without live-locking (unless the buffer is full of course, in which case events are dropped anyway).

> Next release memory to the OS (or unmapping memory mapped files) - mumnmap requires TLB flush for all cores assigned to the process.

I think most programs do only "static" allocation at program startup. There's no OS interaction for memory management from then on.

> Normally GCs trigger only at 'safe points' and unless they need to allocate more memory (should never be the case for a real-time application), GCs should have no OS interaction.

Ok, if the GC is configured to never return memory to the OS, it makes sense that probably the GC in an event handling system won't need to get more memory from the OS, or only rarely, once it is "warm".

Still, GCs are generic systems that have to be less efficient than specialized systems. And what are 'safe points'? I'd rather decide myself. I think a GC trace of <1ms, as some new GC allegedly do (is this widely acknowledged? I would assume it depends a lot on the allocation patterns / granularity etc), could be enough, but I'd rather control this myself (for the somewhat real-time app that I've been working on, I need << 10ms latency).


> It could be I measured something wrong, but sometimes dozens of ms is what I got, and I concluded that non-blocking I/O avoids indeterminate blocking (such as reading from a TCP socket until the sender sent N bytes) but does it completely avoid taking in-kernel locks etc?

Perhaps what you were seeing were occasions where the system decided during your system call that your processes' time slice had expired or some I/O that some higher priority process was waiting for completed and it gave the CPU to some other process?

When timing things on preemptive multitasking systems it is often best to make a lot of measurements and then look for clusters. The time of the smallest cluster should be the time the operation itself takes.


Preemption can happen even outside of system calls of course, but I guess my takeaway has been to just avoid calling into the system where a lot of things can happen that I don't really understand or control. That, plus setting my threads to high-priority and possibly pinning them to the right cores.


>That, plus setting my threads to high-priority and possibly pinning them to the right cores.

What language was that - [normally] Java does not respect priorities at all for instance. Running with higher priority may require root as well.


Language was C, running as root on some Linux 3.x on some ARM/FPGA development board. By setting threads to high-priority I mean something like pthread_setschedparams(... SCHED_FIFO ...).


>I didn't find any other measurements on the internet, please point me to them if you find them.

It has been years since I have written thousands sockets servers (used in forex), yet even a couple milliseconds per write would have made the entire operation useless.

>another explanation could be false sharing effects that I didn't know well at the time.

False sharing sucks, of course, but milliseconds seems way way too much again penalty. You'd need all cores updating the same cache line, even then I doubt it'd be that bad.

Could it be the virtualization layer? (again I have run 'that' on bare metal only)

>I think it should be possible to do some similar stuff with multiple readers and/or writers

Multiple writes should be avoided entirely, contention on writing is what prevents scaling - false sharing is pretty much that. A writer honoring the readers is pretty simple - each reader has its own pointer and the write should fail (or spin) when the buffer is full. I think that's quite a classic structure and indeed - it requires no extra memory to communicate/hand-off.

About latency - this is java's current GC that does still has a compaction phase. https://malloc.se/blog/zgc-jdk16


Correction, one of Hotspot's GC implementations, there are plenty of Java implementations to chose from.

Including ones with soft real time GC for performance critical deployments like PTC and Aicas.

https://www.ptc.com/en/products/developer-tools/perc

https://www.aicas.com/wp/products-services/jamaicavm

Anyone picking a traditional JVM for such workloads is doing it wrong.


What I meant is that low latency GC is sort of a commodity/mainstream now. Not advising to use java or a specific JIT+Collector.


Ah, fair enough, got it wrong.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: