I saw the Linux kernel was recommended many times here, but how many people actually read it? Where do you even start? The Linux kernel has around 60,000 files and 25 million lines of code...
Most of the kernel code is in the drivers, the general purpose subsystems (VFS, I/O scheduler, task schedulers, memory management etc.) are a small fraction of those 25 million LOC and largely independent of each other so it is not that hard to build some understanding of them.
(understanding how the I/O and networking system calls work is quite helpful for application developers, even if you work in node.js, python or another high level language)
> (understanding how the I/O and networking system calls work is quite helpful for application developers, even if you work in node.js, python or another high level language)
Being able to semi-quickly figure out how the kernel actually performs some I/O operations and what the exact semantics are is tremendously useful when working with low-level I/O applications (e.g. database-like applications).
Nobody ever wrote a 25MLOC program from start to finish, so I don't think it makes much sense to read it that way.
I'd read it the way it was written: from the beginning. What's Linux 0.01 look like? What's the next changeset after that release look like? What was necessary to add the driver for your favorite device? What changes were made for your particular CPU?
Programs are not static works (except maybe TeX and Metafont). They exist in the form they do in order to be amenable to changes. So look at the changes that drove it.
Read the early versions, that's exactly what I am doing too!
I read Bitcoin 0.1.5 [1], which only has 15,000LOC and is the first tagged version on Github. Compared with the current Bitcoin codebase with 320,000LOC, it's much less daunting!
Not for Linux, but it's how I approach new programs I have to work on.
I can't decipher this 1000-line function, but it came from somewhere. What did it start out as? That's what the author originally intended it to be. What caused it to grow? That's what features someone else thought it needed.
In my view, "reading the kernel" is not really a useful exercise. Source code isn't a book, and most code isn't really a pleasure to read by itself.
Instead I would suggest that you try to debug a problem, or try to understand what a particular syscall does rather that starting from init/main.c. If you have a goal in mind (trying to solve a problem or understand a specific aspect of the kernel) you are far more likely to get useful information out of the exercise.
Another useful hint is that you should just ignore the majority of code that looks alien. Linux uses a lot of macros and synchronisation primitives that you probably aren't familiar with (RCU for instance). It's much more useful to take note of the things you don't understand and just move on to the code code that you do understand. Most of the macros and synchronisation primitives are used all over the kernel so you're very likely to get to grips with them over time.
I really hope that bpftrace will make it easier for people to get into kernel debugging and thus have a nicer "in" to kernel development.
Where do you start depends what you're after. You should start by defining questions, then by searching for existing documentation, then reading the header files and data structures, and then perhaps some code. Also read userspace counterparts of the interface you're studying if applicable.
Worked fine for me when getting into media subsystem to write camera and camera soc interface drivers.
The linux kernel has got, conservatively, thousands upon thousands of defects. The only reason to read it is in case you really need to know how it actually works, because the documentation for some syscall is wrong or your systems aren't working right in practice, or you need to know how some undocumented hardware works. Otherwise I'd say it's best left unread.
Another list of reads worth mentioning is the Architecture of Open Source Applications book [1]. Each application is small (< 500LOC) and contains design rationale in an approachable literate programming style.
Also, I found a blog post which links to a C version of the exercise in the spirit of Knuth's approach, and a Haskell version in the spirit of McIlroy's:
So, is wiki.c2.com something that people keep writing on or is it really readonly nowadays? Is there a way to become a member? I just mistakenly signed up to hypothes.is from the right side-bar thinking it was c2.com.
Ward Cunningham decided to close c2 to new material in part because he got tired of battling a determined vandal who was all bent out of shape over grammer and spailing, and second to try out his new "Federated Wiki" concept. So far, Federated Wiki is a flop in my opinion.
C2 was interesting in that software developers and IT professionals tried to categorize debates and disagreements over software engineering issues. Almost nothing was ever settled, but you got different perspectives. For example, is the "best" software about maximizing economics & psychology (human grokking), or symbolic parsimony? Is it better to hire 5 geniuses or 10 ordinary engineers?
> Ward Cunningham decided to close c2 to new material in part because he got tired of battling a determined vandal who was all bent out of shape over grammer and spailing
It's honestly pretty upsetting that one vandal is enough to ruin one of the oldest and most significant institutions on the web for everybody.
The reworked version doesn't show the original indentation of discussion trees, making it harder to read. All bullet points have been flattened to one level, compared to the the original, which showed many levels.
Seems like it hasn't been updated in a while. Each page has an "last updated" timestamp. The homepage says: "Last edit December 19, 2014". There's also a Github project but hasn't been updated in 2 years: https://github.com/WardCunningham/remodeling
Hi, for some reason you're shadowbanned? As a new account I find this really unusual. It's likely you might need to verify your email or something. I've vouched for this comment, which is why it appears.
It seems that he's stuck with some files that are mixed-encoding and can't repair them, it's likely the wiki will be read-only until that problem is solved, unfortunately.
It's probably because your first comment contains a link, and automated account creation for spamming purposes is unfortunately a thing. The site relies on users with high enough karma clicking a button to vouch for legitimate users to unban accounts (or draw the mods' attention), like what happened to you here.
Hacker News uses shadow-banning, so the person banned for problematic behaviour can't tell and doesn't just switch to a new account.
I went through the list and the comment about Apache Lucene really piqued my interest. I found the Lucene code in GitHub but some of the sources have more boilerplate Java comments than code! Anyways, I remembered Java Docs are often excellent so here's the Lucene Java Docs:
One claim that caught my eye: Tex being unbuggy because it's written in the Literate Programming style. I like LP but that's a big claim. Occam's razor suggests a more likely answer: it was written by Knuth! A great programmer, and also one who grew up programming non-interactively: http://ed-thelen.org/comp-hist/B5000-AlgolRWaychoff.html#7
On the one hand, I see your point. But on the other hand, if a master of some skill says "This is how I achieve the results I achieve." — perhaps imitating that might just be valuable. Does a dancer not get better by learning to imitate (at least at first) a master? (Though perhaps eventually, a deeper understanding of the artform might take hold, at which point experimentation and improvisation allow the dancer to explore undiscovered variations of the artform.)
And I don't think it is just that Knuth does it that catches my attention either. Ideally, I think (if I understand the premise of literate programming), literate programming would force the programmer to take the time to explain the code being written, and ideally, the rationale behind why that code solves the problem at hand. And to do that, the programmer must first understand the code being written, and I think upon attempting to do that, they'll find they have not sufficiently thought the problem through.
I fear too many of the coders I know would not honestly attempt it; they are in too much of a hurry to get the code written.
Now, I don't think literate programming is the only way to accomplish that; simply writing good docstrings and comments might suffice to a good degree (and good ones: I too often review code where the docstring is a basic repetition of the function name and arguments, or where the comments explain what is being done, but not why). But far too often I see no comments, no docstrings, no tests, just miles and miles of code, and so it is no wonder that Knuth is what he is.
And though you didn't ask me, I read it (having been annoyed by your LP post for a few years now) and it strongly resonates with me — the mission, and everything until the start of the “Constraints on a solution” section. IMO the way to give more people more understanding of programs is not to write an entire new programming language / operating system and hope that enough people switch to it, but to work on delivering that “understanding” for existing programs in existing languages. It may be harder, but is more likely to be useful, and you also get feedback on what kind of understanding people want and lack.
Many thanks for the correction, and for the comment! It is absolutely likely that delivering the understanding for existing programs is the way to go. Unfortunately I just don't know enough for that yet. It's a hard chicken-and-egg problem: to understand the current stack I need precisely the concepts and tools that I'm trying to work out.
Over the last few years I've switched back and forth between the two extremes. I spent a couple of years off and on learning how operating systems (OpenBSD, Sortix, a little bit of Linux) work. I've gained some hard-won facility for poking around inside GNU package sources. So it's yet possible that I'll find a way to make progress on an existing stack.
It also depends what the criterion for 'success' is when we consider this "strike out vs work within" tradeoff. If the goal is some level of adoption then it's a no-brainer that working with existing platforms is the way to go. But I may be content just to figure out the right answer for myself.
Working within an existing platform requires a time commitment to a single (or a tiny subset) of the many projects that our platforms have balkanized into. After all that time commitment, effecting change from within requires a level of politics that is definitely not my strong suit. These projects have real users and justifiably shouldn't be giving me the time of day anytime soon.
One alternative I've considered is forking a mature platform. It remains on my radar, but at the moment I think the drop-off in benefits the instant I hit 'fork' is way too great. Consider a platform like OpenBSD that frankly is created by way smarter people than myself and is way more mature. The level of adoption it gets from being POSIX compliant is so miniscule; would making incompatible changes really be that counter-productive? It's worth asking if you're baiting big to catch small.
On some level my real target is to change the customs that influence how open source projects are governed. Even if I managed to overcome all the previous hurdles, it still seems impossible within the existing framework to do things like encourage more forks in projects, or convince more people to cull their dependencies, or read their sources. There's just too much baggage. Starting afresh may paradoxically make a new way easier to see.
This is the "idea maze" as I see it. I'd love to hear more, what you think of it.
Well it's a noble goal and I certainly don't want to discourage you, however you go about it! Perhaps you will learn something new and useful.
But just to make my meaning clear, by “existing programs in existing languages”, I meant doing it in a way that does not require effecting change at all. What understanding is it is possible to deliver for an existing, complicated, messy codebase? For example, you correctly noted that early versions tend to be easier to understand, and code tends to accumulate complexity that makes the global structure less clear. This is true and something I use often: use "blame" to look at where a particular chunk of code was introduced, and look at the corresponding change, along with its description/commit message, which is often simpler. (And I see you've written a tool (http://akkartik.name/post/wart-layers) for those who choose to stay conscious of this and write code in a particular way.) But most software today is available with version history. So, what if this were easier? E.g. imagine if when you view code there's a slider that you can move back and forth to see older or newer versions, while the changes fade out. Or, imagine highlighting the “base” of the code versus the less important changes. Or something; experimentation will reveal what tends to be useful for existing codebases. (And if people find the tools useful, that may even effect change in how code is written, as authors get feedback on what the tool thinks versus their mental understanding, and tweak until there's a match. I've seen Typescript being sold not for some putative benefits of typing on code correctness but simply for enabling IDE autocomplete for instance.)
The broader point is that, to me it seems that your writing and efforts have the implied assumption that understanding of global structure is hard to acquire because everyone is making mistakes, and if everyone is just careful to do things differently, the difficulties will disappear and understandable programs will magically emerge. That is something worth investigating, but I think there's a good chance that perhaps not everyone is making mistakes (as even programs written by the best programmers tends to become hard to understand eventually), and/or that it's not feasible for everyone to be super careful when trying to get things done. (Rather, they're making tradeoffs, and are likely to make similar tradeoffs in future.) Not all the accumulated complexity may be accidental; some is inherent in the fact that the problem in the real world does have messy corner cases (as Spolsky said: https://www.joelonsoftware.com/2000/04/06/things-you-should-...). Similarly most of the causes you identified (backwards compatibility considerations, churn in personnel, vestigial features) (and those identified elsewhere, e.g. in “out of the tar pit”) can be real and unavoidable: there may be features that are needed only for (say) users of old systems but still cannot simply be removed. The best that can be hoped for is to make this fact clearer, not to make them go away.
Finally, there's also the fact that “understanding” is not a property inherent in the system (code, program, whatever) itself, but something that grows in the head of the reader. (Perhaps trying to influence the writer is not the best way...) And different readers come with different questions and goals, and at least as far as the first paragraph of your http://akkartik.name/about goes, may need different sorts of help in different contexts. It's unlikely a fixed organization of the program is going to satisfy everybody.
An example: error handling. We've all seen functions that spend only a few lines doing their “main” job and many more lines checking for errors and dealing with them. This can obscure what the main job of the function is, and make it appear as though error-handling is the main part. (Aside: Knuth observes this causes a psychological barrier against writing too much error handling, while with his literate programming one shunts off the error-handling to a different section/module, and one tends to write better error-handling there.) But consider https://danluu.com/postmortem-lessons/ which says “If you care about building robust systems, the error checking code is the main code!” So depending what kind of understanding a reader is looking for at a certain time, sometimes they may want to understand the “happy” path and sometimes the error handling.
Similarly in general: given a program, sometimes we want to understand roughly how it's organized / what its major components are, sometimes we want to understand the precise boundaries/interfaces between these components, sometimes we want to understand the sequence of operations the program performs and sometimes the frequency, sometimes we want to understand what it does in the “steady state” and sometimes what it does at startup or shutdown or some corner case. And in fact not always the global structure of a program but sometimes only enough to understand how it solves a specific problem — if I want to make a change to (say) Firefox in an afternoon, I may just want to know how it does (say) font fallback.
All that said, yours is an interesting project, and for the reasons you mentioned, and I look forward to what comes out of it. Apologies for a verbose comment; I'll stop here :-)
I really appreciate the detailed comment! I love talking about this stuff.
---
You're absolutely right that code reading is a non-linear activity and that won't ever change.[1] I'm not trying to make all code reading a linear activity; that would be truly quixotic. I want to keep the code organized for the convenience of the writer but still provide an initial orientation for newcomers, when they aren't concerned with error handling or precise interface boundaries.
---
Finding the right version to look at is part of the solution, as you noticed. But there's a second half: making sure the information newcomers need is actually in the repo. Somewhere, anywhere. My sense is that existing codebases don't actually contain all the information needed to truly comprehend them. The context the system runs in, and all the precise issues it guards against. Tests are a huge help here, but I'm constantly making changes to code that I tested in some other terminal or browser window with some complex workflow. Then I often save the code in one window and close the other window. That's a huge loss of information, and it's compounding over and over again in current platforms. All because there are manual tests we can do that aren't easily encoded as automated tests.
It's certainly possible to port my ideas to an existing stack so that more tests can be represented. But how do we recover all the knowledge that has been lost so far?
---
I don't think the problem is that authors make mistakes[2]. They understand the domain far better than an outsider like me. No, the problem is that authors don't capture everything that is in their heads, and that means that I make mistakes when I try to build on their work.
One selfish reason I care about making the global structure of my codebase comprehensible is the very slight hope that others can take it in new directions and add expertise I won't ever gain by myself, in a way that I and others can learn from.
---
"...most of the causes you identified (backwards compatibility considerations, churn in personnel, vestigial features) can be real and unavoidable: there may be features that are needed only for (say) users of old systems but still cannot simply be removed. The best that can be hoped for is to make this fact clearer, not to make them go away."
If the codebase is more comprehensible and captures all compatibility and other concerns as tests, then it becomes easy to fork. Users of old systems could stay on one fork and others could be on a simpler fork that deletes the compatibility tests and the code that implements them. That way they aren't paying a complexity penalty for what they don't use. The tests would make it tractable to exchange code between the two forks, even if they diverge pretty far over time. Or so I hope.
---
Thank you again for your detailed feedback. Now I have a sense of a few more issues to avoid in my writing.
[1] That's partly my complaint with typography in LP: we end up polishing one happy path to death.
[2] If you notice places in my writing that strengthen this impression that I'm trying to reduce mistakes, I'd really appreciate hearing about them. I'm explicitly trying to avoid the failure mode that I think projects like TUNES fall into, of trying to come up with the one perfect architecture.
Reading TeX original source is easier said than done... I think I managed to get PDF some time ago but took me like 30 minutes of browsing. Not talk about running it...
It's available as a book (TeX: The Program, volume B of Computers and Typesetting), and Knuth considers the book as the canonical thing to read. You almost surely don't have a PDF of the entire book, but it's easily available in good libraries: http://www.worldcat.org/title/tex-the-program/oclc/826569131 (or you can buy it: as I was typing this comment I just finally bought a used copy from Amazon for $9 which is pretty good for a hardcover book).
If you have a typical TeX distribution installed, you can see an approximation by running `texdoc tex`, which brings up the PDF that is also available online e.g. here: http://texdoc.net/texmf-dist/doc/generic/knuth/tex/tex.pdf -- but it is a bit hard to make sense of if you're not familiar with the conventions of WEB (the book starts with a "How to read a WEB") -- as a workaround maybe you can read webman (http://texdoc.net/texmf-dist/doc/generic/knuth/web/webman.pd...) though that's meant for the programmer (working with the raw .web files) rather than the reader (working with the typeset output).
And it helps to be familiar with Pascal, and the typical conventions of programming style in the 1970s (e.g. that of the P4 (Pascal) Compiler and Interpreter, also published as a book and also available online: https://homepages.cwi.nl/~steven/pascal/ )
I've been reading the TeX program in bits and pieces for over a year now, and only now do I have a rough sense of it all :-)
The bad thing about a print book is that you would have to type the code to run it... How would you go about running TeX? I find the most effective way to read source code is to run it (to answer questions like: "what's the output if the input is this thing", etc.)
I wish someone who's familiar with the moving pieces would put together a version that runs on a browser.
TeX is written in a literate programming system called WEB, that Knuth developed especially for (re-)writing TeX and METAFONT, and knowing that the source code was going to be published as print books (people had asked for it, so he kept this in mind when rewriting them). To overcome the limitations of Pascal (which was the most widely available language at the time at the places where TeX was being used: universities and the like), the WEB system adds quite a few features as a preprocessing step.
So, in WEB, the programmer types into a .web file, which runs through a program called TANGLE to produce a .pas Pascal file, and independently through another program called WEAVE to produce a .tex file (which is typeset into what you read). See Figure 1 here: http://www.literateprogramming.com/knuthweb.pdf#page=2
So to run TeX, the idea was that you have to pass the code through TANGLE, then run the resulting Pascal code through a Pascal compiler to get a binary, that you run. Of course Pascal compilers are less common today (and in any case Knuth was writing in the dialect of Pascal available to him at the time, and the language has diverged since) so in practice most TeX distributions pass the WEB (or tangled) code through something that converts (translates) the WEB/Pascal to C, then compiles that C code, etc. So running it is not exactly trivial. (See also: "LuaTeX says goodbye to Pascal" https://www.tug.org/TUGboat/tb30-3/tb96hoekwater-pascal.pdf and the Introduction to Martin Ruckert's "web2w" project: http://mirrors.ctan.org/web/web2w/web2w.pdf#page=13) This also makes running TeX under a debugger quite hard (https://tex.stackexchange.com/a/384881/48 , http://www.readytext.co.uk/?p=3570)
Nevertheless, putting together a version that runs on a browser is precisely one of my dreams :-) Maybe in a few years! (There is already a version that runs in a browser: http://manuels.github.io/texlive.js/ but that's not designed for readability / understandability.)
Note that, even if you had the entire program running in a browser, understanding it interactively would not exactly be trivial. The program itself (if viewed at the "raw" Pascal code level, i.e. without the conveniences afforded by the literate programming system) is rather far from modular; it's monolithic, with lots of global state, lots of data structures with the same fields used for different things in different contexts, abundant goto-s, etc. (All of these were the right choices at the time, given the constraints.) It's described by someone at http://wiki.c2.com/?TexTheProgram as “the last hurrah of procedural/functional structured (read: non-OO) programming. [...] achieved without some of the modern code-structuring tools we have today. I don't think the modern programmer raised in an OO culture could do the same”. But what literate programming gives here is a way to still understand the program despite these impediments -- one could say that, since the 1970s, many programming languages/communities have evolved some solutions for managing complexity, and Knuth's literate programming is an independently evolved different solution to the problem. Of course, one could combine the two...
Question, where does one start, when planning to read the Linux kernel? There is so much code. I have read it, but randomly. I have read contents of net/ kernel.
I have read the main method where it attempts to launch pid1 of /bin/bash, etc.
Is there a really good place to start reading? For example how does Linux talk to the hard drive?
What's the first thing that happens in the kernel?
Start skimming through the included documentation files. You don't have to do a thorough read, but the Documentation folder is structured in a way mirroring the organization of the kernel. They will help you to understand the structure, and you can get deeper understanding of the essential parts that way.
Once you have mental map of major areas of the kernel it will be easier to relate to the sourcetree.
There used to be a number of great books on fairly esoteric issues of the kernel (like, comprehensive explanations of network stack). However most of them seem to be stuck at 2.6 and are no longer very up to date.
Well, some famliarity with the hardware doesn't hurt. Years ago when I studied the Linux kernel myself, I started with qemu, and the gdb stub, and walked line-by-line through the boot process.
But as others suggest, taking an older version, or even simpler unices(xv6,netbsd) might be easier.
Indeed! It took me a while a appreciate (because it can be painful). Reading bad programs can be very educational, especially when you try to infer the decisions or processes that caused it to be bad in a particular way.
Many years ago I enjoyed Code Reading[0] which is a whole book which basically just discusses snippets from open source codebases.
It might be a bit old (I honestly don't remember the specifics) but most of the code was already "mature" at the time so I think it would still be valid.
If you like this kind of reading code list, I run a newsletter at https://betterdev.link/ where we have a small section that include one or two interesting Github repo per languages per issue.
Agree on nginx - I wanted to make a crazy change and found it extremely easy to understand how I could have done it and why doing it would be stupid :)
On the topic of C, I'd also strongly recommend mandoc[1]. It solves a number of hard problems (indexing and searching of man pages, rendering and parsing markup and translating said markup to HTML, PDF and tty output. The code remains fairly accessible. Definitely one of the codebases I regularly refer to for style and practices.
I think smaller projects are better for learning purposes. If you are interested in reading some smaller projects, check out my project here https://github.com/CodeReaderMe/awesome-code-reading.