Understanding /proc

aub3bhat · on Oct 5, 2016

This is a great post, recently I have been trying to re-learn and understand linux (specifically) ubuntu using monitoring tools. In my opinion htop and Facebook osquery are the two best available tools for understanding how an operating systems and processes work. The osquery approach of recording all OS data in form relational tables (with PIDs as keys etc.) is very useful.

https://hisham.hm/htop/

https://osquery.io/

The osquery query packs are especially useful: https://osquery.io/docs/packs/

Here is an incomplete draft about a similar post: https://github.com/AKSHAYUBHAT/TopDownGuideToLinux

dorfsmay · on Oct 5, 2016

Isn't htop just top with colours?

For a quick global view I like atop, then if I need to drill in a subsystem, iftop, vmstat, free, etc...

digi_owl · on Oct 5, 2016

Not quite. You can access lsof and strace from inside htop. There is also a process tree view, and you can select the process you want to manipulate via arrow keys.

leetrout · on Oct 5, 2016

And you can click on things in htop with your mouse. That feature never gets old. I wish more command line tools supported that...

digi_owl · on Oct 5, 2016

This reminds me that MC supports mouse interaction.

And playing around with it, i find that it is more elaborate than i first anticipated. The menus can even be operated via the scroll wheel, and it is sensitive to where the mouse is hovering.

All in all i find myself pondering if more up to date console web browser is possible. Perhaps using the framebuffer rather than X to display sites (or maybe go all out and implement it using sixel).

realusername · on Oct 5, 2016

I feel stupid now, I've been using htop for years and never noticed that...

leetrout · on Oct 5, 2016

Same here! I picked up htop ~2011 and learned about mouse support in ~2014. :D

Also- if you're on a Mac and you use tmux I just started trying out the tmux integration with iterm2. Pros and cons but it's interesting to see first class OS windows for tmux.

Edit: add missing iterm2 reference. https://gitlab.com/gnachman/iterm2/wikis/TmuxIntegration

dorfsmay · on Oct 5, 2016

Oh! Thanks. I never knew...

gbrayut · on Oct 5, 2016

Very nice write-up, and a great way to dive deep into an interesting system! But if you plan on maintaining a project like this long term I would recommend using one of the many existing libraries like https://github.com/prometheus/procfs or http://pythonhosted.org/psutil/

There can be a lot of edge cases, and inevitability things will change in the future. Centralising the work of parsing /proc files goes a long way and helps keep things sane for maintenance.

gmjosack · on Oct 5, 2016

It's worth noting that `man proc` has pretty exhaustive, though incomplete, documentation of the various files. It's a great read to learn about some of the files available.

Rapzid · on Oct 5, 2016

Best place to learn about proc IMHO... And possibly linux header files..

stouset · on Oct 5, 2016

I wrote something to do similar parsing of process state recently. It seems nuts to me that you can't get this all in one call. The naïve way of `fopen()`ing the files you need has a race condition if the PID is reused between two calls to `fopen()`.

Admittedly, probably rare. But why route through calls to `fopen()` and `read()` when you could just provide a function that returns OS-defined structs?

cyphar · on Oct 5, 2016

You can use openat() with a file descriptor corresponding to the /proc/pid directory to avoid the race condition.

yrro · on Oct 5, 2016

Huh. Does this mean that I can open() all the numerically-named directories in /proc, in a loop, and eventually prevent the system from being able to create new processes?

Or will operations on a file descriptor corresponding to an exited process fail?

dezgeg · on Oct 5, 2016

According to my reading of the code, an open process file just holds a reference to a relatively lightweight handle to that process (struct pid), which means 1) resource exhaustion by holding onto full process structures doesn't happen 2) numerical process ids can get recycled even if someone is holding a long-dead /proc/PID directory open for that pid, but the old opened directory keeps referring to the dead process.

yrro · on Oct 7, 2016

Thank you so much for working that out!

stouset · on Oct 5, 2016

Yep, just saying the "obvious" approach is wrong, which is (IMHO) always a crappy way to design an API.

jerf · on Oct 5, 2016

Based on what I learned a few months ago trying to deal with all these problems, the conclusion I've come to is that basically the entire "standard" set of POSIX calls are wrong. Correct handling requires an entire parallel set of calls like "openat". However, this parallel set of calls postdates the more conventional calls by a couple of decades and in some cases we found may still not yet be entirely complete. (I'd guess they probably are by now, we were a couple of versions back on the kernel.)

The problem is that those old calls have immense, immense inertia. They are how people think of files. They are how almost all, if not actually all, higher level languages interact with files by default, saving the "correct" calls for external modules, if you even get that. In fact, many higher-level languages are actively inimical to correct file handling by trying to abstract away the "file handle" so you only have to deal with file names, but for correct handling you really need to consider the file handle the real file and the file name merely a transient method for obtaining a file handle, which is to be never used again once you have the handle.

Bear in mind the "crappy" API in question is probably older than you are, so it's not really that surprising that it has needed some work as our world has changed.

to3m · on Oct 5, 2016

Even if you fix the pid reuse problem, you'll still have difficulties parsing things like /proc/PID/maps. Unless your pseudo-file consists of fixed-size records - and as I recall, /proc/PID/maps doesn't, as each line has a variable-width path in it - your best option is just to read the entire file in at once, and fingers crossed it ended up being atomic (Obviously the system can't block operations that affect the maps file...)

signalfd gets this bit right.

It didn't take much POSIX programming before I started to look at Windows in a whole new light...

cyphar · on Oct 6, 2016

> signalfd gets this bit right.

I'd be careful saying that any API based around signals is "right". In general, signals are just horrible and I really wish that the UNIX history had played out differently.

tcoppi · on Oct 5, 2016

I agree that parsing /proc/pid/maps is a complete nightmare, but I'm pretty sure that the way pseudo-files on linux work, it is guaranteed the contents won't change out from under you while you are reading it.

cyphar · on Oct 6, 2016

Not all pseudo-files are like that. For example, /sys/fs/cgroup/.../cgroup.procs will only have consistent content if you read everything in a single page. Which is kinda dumb IMO.

tcoppi · on Oct 6, 2016

Interesting... good to know.

amelius · on Oct 5, 2016

Why would the OS recycle its process ids anyway? I mean, if you'd use 64 bits for the id, there's no way the counter could cycle in the lifetime of the hardware.

to3m · on Oct 5, 2016

http://man7.org/linux/man-pages/man5/proc.5.html - "On 32-bit platforms, 32768 is the maximum value for pid_max. On 64-bit systems, pid_max can be set to any value up to 2^22 (PID_MAX_LIMIT, approximately 4 million)."

And this is Unix, so you can't do anything without running a process ;) - and I think PIDs and threads share a namespace on Linux, too, so the chance of wraparound is even higher again. (I also don't think there's any guarantee that PIDs will be a simple incrementing counter in the first place! Though that's the most obvious thing to do, so you can probably indeed be pretty certain that's exactly what will happen..)

OS X has a 64-bit thread ID that promises to be unique across all threads for the uptime of the system. What a good idea! - no prizes for guessing whether this comes from the BSD part, or the Mach part...

tcoppi · on Oct 5, 2016

I think it is a bigger problem that you can't get a snapshot of all processes atomically. It is easy enough to solve the case of atomically accessing a single process, though annoying.

jerf · on Oct 5, 2016

It isn't clear to me what an "atomic" snapshot of processes would even be in a multicore world, though. Even if process creation and destruction is strictly linearizable in how the kernel handles it, which I do not assert to be true, there is certainly a lot of other stuff in that snapshot which is not. And without that, there isn't anything like an "atomic" snapshot that makes sense to me.

(Even in a single core world I suspect you'd hit a lot of resistance in any method you could use to take an atomic state, for performance reasons. Even just telling the kernel to "stop doing everything else and give me a copy of all this information" is not going to go over well.)

tcoppi · on Oct 5, 2016

Sure it means something. Both Windows[1] and OS X[2] have easy ways to do it, and the task list in the Linux kernel is just a linked list that could be easily atomically copied. Obviously a lot of the metadata associated with the processes couldn't be easily accessed in a consistent way, but the number of running processes, their pid, basic info like the command line and process image is totally knowable to the kernel at any given instant.

The fact that you can't get that info easily on Linux makes a lot of tools harder to write than they should be, and often give misleading/incorrect information.

1. https://msdn.microsoft.com/en-us/library/windows/desktop/ms6... 2. https://developer.apple.com/legacy/library/documentation/Dar...

jerf · on Oct 5, 2016

At least the Windows one is the function I wasn't willing to assert exists. However, it does not appear to have the "a lot of other stuff in that snapshot which is not [linearizable].". The resulting structs[1] are missing many things that ps may want to display in Linux, which you will have to fetch nonatomically.

The OSX one, I don't know; I scanned over the docs and it went over my threshold for what I'm willing to poke through for a HN post.

In other words, this is exactly what I left myself an out for, and I see nothing that contradicts what I said for the Windows case.

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

tcoppi · on Oct 5, 2016

I don't disagree that there is a lot of information you cannot include in such a snapshot, and neither the Windows or OS X equivalents(it's the KERN_PROC argument to sysctl btw) give you much detail beyond all pids(which you can get with a directory listing of /proc in linux, though this is not completely atomic either, even with getdents(2)!) and their cmd line/exe path. Those latter two are completely unavailable to get reliably with any method I know of on linux, and that results in some inconsistent and/or incomplete info for certain use cases, especially when combined with the fact that PIDs, though randomized, can be reused on linux.

gshrikant · on Oct 5, 2016

Minor nitpick: calling it a a clone of the Unix 'ps' wouldn't be exactly right, since I understand /proc is Linux-specific. On the other hand, how did the Unix 'ps' or the 'ps' in other Unix-clones work? Is there an alternative method to expose the process data to userspace instead of using a VFS like procfs?

binarycrusader · on Oct 5, 2016

/proc is not Linux-specific; my late, great Colleague Roger Faulkner did a lot of work on it in Solaris:

https://www.usenix.org/memoriam-roger-faulkner

...and there's a general history here:

https://blogs.oracle.com/eschrock/entry/the_power_of_proc

The part that may be specific to Linux is Linux provides a text-based interface, whereas systems like Solaris provide a binary interface.

gshrikant · on Oct 5, 2016

That link is really interesting! Thanks for that. I should have probably said procfs is Linux-specific. Anyway, learned (rather unlearned) something new today.

Also interesting is that early Unix systems (before v8) used `ptrace()` for gathering process information - the same system call programs like strace/ltrace use today.

binarycrusader · on Oct 5, 2016

No, procfs is also not Linux specific ;-) /proc is a filesystem, so most of us refer to it as 'procfs' for short. In fact, the header file you include a C program in Solaris to use it is '<procfs.h>'.

As I said before, the only thing that's really Linux-specific is Linux chose to represent it as text instead of something machine-parsable.

pietromenna · on Oct 5, 2016

Which other OS uses procfs?

konstmonst · on Oct 5, 2016

Hier is the history https://en.wikipedia.org/wiki/Procfs I think the one from linux was inspire by plan9's implementation.

pietromenna · on Oct 5, 2016

Thank you!

geofft · on Oct 5, 2016

There are a couple of options. A good number of other UNIXes do in fact have a procfs. Some others, like macOS, have a system call that gives you the information you want (on macOS this is a sysctl, CTL_KERN / KERN_PROC / KERN_PROC_ALL, which returns a set of structs). And finally, especially on older UNIXes, the ps command is setuid root, and goes and opens /dev/kmem or similar and looks around in kernel memory and parses the live kernel's process table directly.

digi_owl · on Oct 5, 2016

You may be thinking of sysfs.

keeperofdakeys · on Oct 5, 2016

I wrote a small ps clone as a side project, and "man proc" was invaluable in understanding what everything meant.

There was interesting work happening on a proposed newer api though, "task_diag" https://lwn.net/Articles/685791/ https://criu.org/Task-diag.

skun · on Oct 5, 2016

Thank you for writing this. This is a wonderfully insightful post :)

TorKlingberg · on Oct 5, 2016

Is there something similar for /sys and /run?