A million-line program isn't readable by anybody, no matter how readable the language is. If the equivalent program can be written in, say, a thousand lines in some more concise language, that's more than worth the learning curve, even if the language is strange and off-putting.
I doubt that K can reduce line count by a factor 1000 though. The examples in the article is mostly about shorter identifiers like "!" instead of "range" and a compact notation. That is perhaps a factor 5 not a factor 1000.
The example with a for-loop for summing a range is a blatant strawman. In which modern language would that be idiomatic?
Furthermore the examples only show a particular use case: Processing lists of numbers. How does the benefits measure up for all the other stuff a million-line program does?
That said, if you really have a million-line program doing mostly numerical processing, then I'm sure it would be a massive benefit to switch to a programming language optimized for this task.
OTOH, a more idiomatic version would be (and I would consider using the loop facility to be idiomatic):
(defun count (pair-list)
(let ((hash (make-hash-table)))
(loop for (val key) in pair-list
do (incf (gethash key hash 0) val))
(loop for key being the hash-key of hash
collect (list (gethash key hash) key))))
This drops us from 10 to 7 lines, which is clearly shorter. But, it's also relatively simple to change it from summing a list of (<count> <key>) to a list of (<key> <count>), something that I genuinely can't comment on for the K version. I suspect it would be as simple as first doing a permutation on the input, and (if needed) un-permute the result.
It's a little funny comparing a language with another language+plus-its-entire-ecosystem, but k fares remarkably well I think.
Counter could be #:'=: (count each group)
c:#:'=:
Values would just be dot.
Counter[x]&Counter[y] is a bit tricky to write, because while the documentation says set intersection, what they really mean is the set intersection of the keys with the lower of the values.
This is entirely clear in k; First, understand I have a common "inter" function that I use frequently (it's actually called "inter" in q):
n:{x@&x in y}
and from there, I can implement this weird function with this:
i:{(?,/n[!x;!y])#x&y}
I've never needed this particular function, but I can read out the characters: distinct raze intersection of keys-of-x and keys-of-y, taking [from] x and y. It sounds very similar to definition of the function I gave above (if you realise the relationship between "and" and "lower of the values"), and this was painless to write.
most_common could be: {k!x k:y#>x} but if I'm going to want the sort (in q) I might write y#desc x which is shorter. In this I can save a character by reversing the arguments:
m:{k!y k:x#>y}
so our finished program (once we've defined all our "helper functions" in a module) is:
m[3] i . c'(x;y)
So we're looking at 13 lexemes, versus 19 - a minor victory unless you count the cost of:
from collections import Counter
and start counting by bytes! But there's more value here:
All of these functions operate on data, so whether the data lives on the disk or not makes no difference to k.
That's really powerful.
In Python I need more code to get it off the disk if that's where it is.
I also may have to deal with the possibility the data might "live" in another object (like numpy) so either I trust in the (in)efficiency of the numpy-to-list conversion, or I might have to do something else (in which case really understanding how Counter[a]&Counter[b] works will be important!).
I might also struggle to make this work on a big data set in python, but k will do just fine even if you give it millions of values to eat.
These things are valuable. Now if you're interested in playing games, let's try something hard:
"j"$`:x?`f
It creates (if necessary) a new enumeration in a file called "x" which persists the unique index of f, so:
> It's a little funny comparing a language with another language+plus-its-entire-ecosystem
This is pretty much my point: any ‘reimplement this weird built-in thing in $OTHER_LANG’ is going to involve overheads unless $OTHER_LANG also has that thing. I don't think keeping most types out of prelude should be a downside; you can always ‘from myprelude import * ’ if you disagree.
> How would you do this in Python?
I don't really know what "persists the unique index of f" means, but this seems similar to shelve. I can misuse that to give the same effect as what you showed.
with shelve.open('x') as db: db.setdefault("f", len(db))
>>> 0
with shelve.open('x') as db: db.setdefault("g", len(db))
>>> 1
with shelve.open('x') as db: db.setdefault("h", len(db))
>>> 2
with shelve.open('x') as db: db.setdefault("f", len(db))
>>> 0
> I don't think keeping most types out of prelude should be a downside
I don't understand what that means.
> I don't really know what "persists the unique index of f" means, but this seems similar to shelve. I can misuse that to give the same effect as what you showed.
I think you got it, but it seems like a lot of typing!
How many characters do you have to change if you want it purely in memory? In k I just write:
‘Prelude’ is the set of standard functions and types that you don't have to import to use. Python comes with a rich standard library, but most require imports to use.
> How many characters do you have to change if you want it purely in memory?
When the question is if ‘K can reduce line count by a factor 1000’, shorter keywords hardly cut it. ‘setdefault’ is wordier than ‘?’, but as something you're only going to use on occasion, so what? And d.setdefault(_, len(d)) is something you'll use maybe once a year.
shelve acts like a dict, so you can do the setdefault dance the same way on in-memory dicts.
> ‘Prelude’ is the set of standard functions and types that you don't have to import to use. Python comes with a rich standard library, but most require imports to use.
I might not understand this about python.
$ python3
Python 3.7.6 (default, Dec 30 2019, 19:38:26)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> Counter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'Counter' is not defined
Should that have worked? Is there something wrong with my Python installation?
> but as something you're only going to use on occasion, so what? And d.setdefault(_, len(d)) is something you'll use maybe once a year.
If it takes that many keystrokes, you're definitely only going to use it once per year! This is the normal way you make enumerations in q/k.
The low number of source-code bytes is the major feature in k. I've written some on why this is valuable, but at the end of the day, I don't really have anything more compelling than try it you might like it, and it works for me. This is something I'm working on though.
Counter is not in Python's prelude. Therefore you have to import it in order to use it.
My comment about prelude was in response to you saying I was using Python's “language+plus-its-entire-ecosystem”. This is not true; Counter is part of the standard library that comes with the language. It is not from a third-party package.
> If it takes that many keystrokes, you're definitely only going to use it once per year! This is the normal way you make enumerations in q/k.
You don't use this frequently in other languages because you almost never care about the index of a value in a hash table (because it's ridiculous), and you almost never want to brute-force unique a list (because it's a footgun).
> you saying I was using Python's “language+plus-its-entire-ecosystem”. This is not true; Counter is part of the standard library that comes with the language. It is not from a third-party package.
I'm not sure I agree "third-party package" is a good/useful place to draw the line, but I don't think it's terribly important. Sorry.
> you almost never care about the index of a value in a hash table (because it's ridiculous)
So I want to query all of the referrer urls I've seen recently. I only see a few of them, so I want to have a table of all the URLs, and another table with one row per event (landing on my page). In SQL, you might have a foreign-key operation, but in q, I can also make the enumeration directly. I don't really care about the value of the index, just that I have one.
I don’t understand any of this. I don’t know what footgun is. I have no idea what a “URL” object is or how you stick it in a table of events (on disk? in memory? in a remote process called a “database”). I’m don’t know what you mean by temporary calculations.
I have fifty billion of these events to handle every day, and I do this on one server (that is actually doing other things as well). It is not an option to “just” do anything at this scale if I want to keep costs under control.
Sorry, it's a bit weird not sharing a lexicon. I get the impression Q is a whole different genealogy of programmers.
In Python, although you can just use shelve to store data on disk, in practice this is considered a bad idea beyond very simple cases. Valuable data wants the guarantees that real databases provide, like ACID. shelve doesn't provide this, and IIUC nor does kdb+.
So if you're handling 50 billion events a day, live, and you need these to persist, you'd use SQL or something similar. That would then ultimately determine how you add and manipulate records.
If you don't care that much if you lose the data on a crash, that's when we're talking about temporary calculations. In Python, rather than having two tables like (eg.)
you would make a custom type, aka. ‘Url’, containing ‘url’ and ‘count’ as object fields, and then store your requests as a list containing references to those ’Url’s.
> ... like ACID. shelve doesn't provide this, and IIUC nor does kdb+. So if you're handling 50 billion events a day, live, and you need these to persist, you'd use SQL or something similar. That would then ultimately determine how you add and manipulate records. …
ACID is overrated. You can get atomicity, consistency, isolation and durability easily with kdb as I'll illustrate. I appreciate you won't understand everything I am saying though, so I hope you'll be able to ask a few questions and get the gist.
First, I start write my program in g.q and start a logged process:
q g -L
This process receives every event in a function like this:
upd:{r[`u?y`url;y`metric]+:1}
There's my enumeration, saved in the variable "u". "r" is a keyed table where the keys are that enumeration, and the metric is whatever metric I'm tracking.
I checkpoint daily:
eod:{.Q.dd[p:.Q.dd[`:.;.z.d];`r] set r;.Q.dd[p;`u] set u;r::0#r;system"l"}
This creates a directory structure where I have one directory per date, e.g. 2020.03.11 which has a file (u or r) referring to the snapshots I took. I truncate my keyed table (since it's a new day), and then I tell q the logfile can be truncated and processing continues! To look at an (emptyish) tree right after a forced checkpoint:
total 24
drwxr-xr-x 4 geocar staff 128 11 Mar 16:59 2020.03.11
-rw-r--r-- 1 geocar staff 8 11 Mar 16:59 g.log
-rw-r--r-- 1 geocar staff 206 11 Mar 16:55 g.q
-rw-r--r-- 1 geocar staff 130 11 Mar 16:59 g.qdb
geocar@gcmba a % ls -l 2020.03.11
total 16
-rw-r--r-- 1 geocar staff 120 11 Mar 16:59 r
-rw-r--r-- 1 geocar staff 31 11 Mar 16:59 u
The g.q file is the source code we've been exploring, but the rest are binary files in q's "native format" (it's basically the same as in memory; that's why q can get this data with mmap).
If I've made a mistake and something crashes, I can edit g.q and restart it, the log replays, no data is lost. If I want to do some testing, I can copy g.log off the server, and load it into my local process running on my laptop. This can be really helpful!
I can kill the process, turn off the server, add more disks in it, turn it back on, and resume the process from the last checkpoint.
You can see some of these qualities are things only databases seem to have, and it's for that reason that kdb is marketed as a database. But that's just because management has a hard time thinking of SQL as a programming language (even though it's there in the name! I can't fault them, it is a pretty horrible language), and while nobody wants to write stored procedures in SQL, that's one way to think about how your entire application is built in q.
That's basically it. There's a little bit more code to load state from the checkpoint and set up the initial days' schema for r:
but there's no material difference between "temporary calculations" or ones that will later become permanent: All of my input was in the event log, I just have to decide how to process it.
I mean, sure, if your problem is such that a strategy like that works for you, I'm not going to tell you otherwise. You can log incoming messages and dump data out to files easily with Python too. I wouldn't want to call that a ‘database’, though, since it's no more than a daily archive.
Yes! "databases" are all overrated too. Slow expensive pieces of shit. No way you could do 50 billion inserts from Python to SQL server on a single core in a day!
It's a bit unfair to compare the speed of wildly inequivalent things. RocksDB would be more comparable, but even there it is offering much stronger resilience guarantees, multicore support, and gives you access to all your data at once.
Calling them expensive is ironic AF. Most of them are free and open source.
You're mistaken. There is no resilience guarantee offered by rocksdb. In q I can backup the checkpoints and the logs independently. It is trivial to get whatever level of resilience I want out of q just by copying regular files around. RocksDB requires more programming.
> gives you access to all your data at once
You're mistaken. This is no problem in q. All of the data is mmap'd as soon as I access it (if it isn't mmap'd already).
> Calling them expensive is ironic AF. Most of them are free and open source.
If they require 4x the servers, they're at least 4x as expensive. If it takes 20 days to implement instead of 5 minutes, then it's over 5000x as expensive.
No, calling that "free" is what's ironic, and believing it is moronic.
> How do you figure that? RocksDB is not a programming language.
I'm comparing to the code you showed. You're using the file system to dump static rows of data. All your data munging is on memory-sized blocks at program-level. Key-value stores are the comparable database for that.
> You're mistaken. This is no problem in q. All of the data is mmap'd as soon as I access it (if it isn't mmap'd already).
Yes, because you're working on the tail end of small, immutable data tables, rather than an actually sizable database with elements of heterogeneous sizes.
> In q I can backup the checkpoints and the logs independently. It is trivial to get whatever level of resilience I want out of q just by copying regular files around.
Yes, because you don't want much resilience.
---
What you're doing here is incredibly simplistic. It's not proper resiliency, it's not scalable to more complex problems, and it's not scalable to larger workloads. An mmap'ed table and an actual database are different things.
It works fine for you, but for many other people it's not.
> You're using the file system to dump static rows of data
That's what MySQL, PostgreSQL, SQL Server, and Oracle all do. They write to a logfile (called the "write ahead log") then periodically (and concurrently) process it into working sets that are checkpointed (checked) in much the same way. It's a lot slower because they don't know what is actually important in the statement except what they can deduce from analysis. Whilst that analysis is slow, they do this so that structural concerns can be handed off to a data expert (often called a DBA), since most programmers have no fucking clue how to work with data.
That can work for small data, but it doesn't scale past around the 5bn inserts/day mark currently, without some very special processing strategies, and even then, you don't get close to 50bn.
> All your data munging is on memory-sized blocks at program-level.
That is literally all a computer can do. If you think otherwise, I think you need a more remedial education than the one I've been providing.
> What you're doing here is incredibly simplistic. It's not proper resiliency, it's not scalable to more complex problems, and it's not scalable to larger workloads. An mmap'ed table and an actual database are different things.
Yes, except for everything you said, nothing you said is true in the way that you meant it.
Google.com does not query a "database" that looks any different from the one I'm describing; Bigtable was based on Arthur's work. So was Apache's Kafaka and Amazon's Kinesis. Stream processing is undoubtedly the future, but it started here:
Not only does this strategy get used for some of the hardest problems and biggest workloads, it's quite possibly the only strategy that can be used for some of these problems.
Resiliency? Simplistic? I'm not even sure you know what those words mean. Putting "proper" in front of it is just weasel words...
Matter of taste; I cannot imagine having to program Java (give me Clojure/Scala any time if jvm is required) for a living anymore; deciphering a lovely 200000 class codebase where every call in every 30 lines bring you into a deep spelunking trying to figure out where/how/what and hoping it's not in some ancient undocumented .jar etc. So I have very much the opposite of what you have. Then again, I have been doing k/j/apl for a long enough time that I would not call reading it deciphering, most of the time. While for Java (& C#) it is always deciphering, even after 20+ years (which was when I became a professional Java programmer) because, unlike with k, you simply cannot know all the libraries and files or the brilliant imagination of people who like to use design patterns 'just a bit' wrong for everything.
Well that would be an interesting test; it is not that easy to do that while it is absolutely trivial (I would say run-of-the-mill) to abuse design patterns.
It has recently been discovered that human speech has a constant bitrate of 39 bit/second[1]. People speak faster in some languages but convey less information per word, and the converse holds true in languages where people speak slower.
In a 'code golf' language where you reduce a 1 million line program to 1000 lines, the 1000 line program is going to be just as hard to read. More concise syntax doesn't help as the same number and complexity of concepts are going to exist in both programs, and that's what the human has to understand.
There are languages where more powerful concepts are available for less code. Erlang for instance, with concurrency. In those cases, yes, readability is aided and maintainability is improved.
Claim B: The intelligibility of a computer program is proportional to its complexity of the problem it solves, not to the number of characters in its source code.
First off, I don't see how B follows from A.
Secondly, even if there does seem to be some similarity, one of the reasons any comparison breaks down is that human languages are evolutionarily adapted to our cognitive capacities, whereas programming languages are designed, for the most part without a very rigorous understanding of how they interact with our cognitive capacities (this is an active area of research, but not very advanced, I think.)
Finally, I reject claim B. The intelligibility of a computer program is heavily affected by how clearly the underlying concepts that define a solution to the given problem are mapped into the structures available in the given programming language, and that is obviously heavily affected by the choice of programming language. And I believe that in general more concise languages are more concise precisely because they make available more and more straight-forward mappings from problem-space to language-structure space. So more concise languages make it possible to write more easily comprehensible programs once you are over the barrier of holding all the mappings they make available in your head.
Human speech having a constant bitrate seems to indicate that there's a limit in the bandwidth we have when receiving or giving information, where that bandwidth is measured in terms of the actual amount of information, independently of the media. Therefore, in a computer program, what matters is not the number of characters but the actual ideas behind the code. In other words, the more complex the ideas are, the more time we need to process it, independently of the number of characters.
> The intelligibility of a computer program is heavily affected by how clearly the underlying concepts that define a solution to the given problem are mapped into the structures available in the given programming language, and that is obviously heavily affected by the choice of programming language.
I don't think so. Nobody uses "just" a programming language. They use a generic programming language plus libraries (or DSLs) plus their own code that maps the concepts of their problem to code. Unless the problem to be solved is really generic and/or simple, a programming language alone is very very far of having direct mappings from problem to language.
And you can't escape the complexity of the problem at hand. You can hide it, but in this regard there's no difference between hiding it behind a compiler, behind libraries or behind your own code. You still need to understand the problem and the underlying concepts.
At some point you run into entropy. You can only compress code until it covers the requirements 100% and nothing else. You can't go past that point without losing features.
When my assumptions about the underlying problem are wrong a million line program is much easier to fix than a thousand line one.
Or rather, a thousand line concise program in a quirky language will become a million line program in a quirky language when exposed to changing business requirements.
If the language allows you to simplify the program sufficiently it might be easier to just rewrite it from scratch when the requirements change. I doubt that this is the case for any language, but you know, in principle it could happen.
>Languages that are often derided as write-only include APL, Dynamic debugging technique (DDT), Perl,[2] Forth, Text Editor and Corrector (TECO),[3] Mathematica, IGOR Pro and regular expression syntax used in various languages.
You've got me wondering about the utility of a genuinely write-only language.
As a thought experiment, would there be benefit to letting functions only be written once?
No-one could come along and break code by changing a function. If you wanted to fix a bug in a function you would have to write a new copy and explicitly update callers to use the new version.
A lot of maintenance overhead? Possibly, but tooling would take care of the majority of it. It would be useful to be explicitly know not just when a function has changed but when the functions it calls have changed.
You would need a naming convention, instead of Main you would need something like Main#23145. It would tick over very regularly.
Library functions like CalculateTax#12 would tick over less frequently.
By calling the old revision you could call the old code if you didn't want to update a module or function.
Perhaps you could extend that name to be something like Namespace.Name#IndirectRevision.DirectRevision.Hash which would allow tools to more quickly extract when the change was made in the logic of the function itself or when the change was in dependents.
This would be an alternative strategy to maintenance to IoC. Instead of treating dependents like they don't affect the code being executed, we can control instead the version of code
By forcing code change in all callers, you build up an explicit picture of where bugs happen but more important see also at the impact of those bugs on other places.
It would need great tooling to get over the paradigm shift of moving away from IoC but I think it would be interesting.
This is exactly the main idea behind the Unison language.[1] All functions are immutable and identified by a hash rather than name. When you make any change you are creating a new function with a new hash. It's definitely a very interesting idea.
What an absolutely fascinating idea! As the field of software engineering matures, how million line repos get maintained is going to become a subject of academic study.
> If you wanted to fix a bug in a function you would have to write a new copy and explicitly update callers to use the new version.
Finding all callers is the hard part though! And in modern languages, that need to be addressed both at compile (or "compile") and run time.
Assuming you've got all the call sites with what constraints you were given and have imposed, the second part, where you call a specific version of a function, is where it gets tricky.
Is the bug in the calling code or the called code? because huge software maintenance is obtuse. The original programmers have all long since left, so what we're left with is a smattering of random specific version calls, with only the tiniest of tidbits of history, locked away in someone's email, and saved in the previous ticketing system.
So it sounds interesting, but I worry (though I worry about a great many things), that the programmer three years later, ends up coming across calls to three different versions of some code, with no guidance on which ones can be upgraded, should be upgraded, and must not be upgraded. I can't say which one would involve digging up more ancient history though.
> explicitly update callers to use the new version
In a genuinely write-only language, how do you update those callers? You can't, you'd have to write new ones to call your new bugfixed function. And then you'd have to write new callers of those callers...
You'd have to rewrite the whole program every time you wanted to make a change. Which would make bug-fixing a much more cerebral activity since you'd want to work out as many as possible before the rewrite to save time.
The idea is interesting. However, worth noting that you would have to increment main() with literally every change, because when you fix a bug in foo#10 (by rewriting as foo#11), you'll need to update the affected call sites to use foo#11 instead, which means they get incremented, too, so you need to do the same to their callers, all the way up to main.
You'd absolutely need tooling to automate some of this (present a list of call sites; you select which should use the new function). It would also increase the size of your code base by a lot, although I'm not sure that's as much of a problem since we may be spending less time reading code and more just writing a new function.
Only if it's written in a style which affords reading. K programmers seem to want to write their code in a mathematical style, but without the natural language prose which makes mathematical papers readable. In a paper by a mature mathematician, the equations only perform some of the work in expressing the idea. The rest of the work is done by prose written with an intent to be lucidly expository, to allow the notation to be terse because the high-level concepts are being expressed in a natural language.
We have this style in the programming world. It's called Literate Programming. How many K programmers write in that style?
> but without the natural language prose which makes mathematical papers readable
A mathematical paper introducing some meaningful worker function uses English because mathematical notation is insufficient for describing computation. That's one of the important goals of Iverson's notation.
> We have this style in the programming world. It's called Literate Programming. How many K programmers write in that style?
Quite a few! I would actually suggest there are more literate K programmers than there are literate C programmers! Every 4-5 lines of code probably has 2-3 lines of prose in a large application that I work on, but in another language that might be something like 500 lines of code for 2-3 lines of prose which puts the prose offscreen for most of the code it's describing.
Some C programs will try to keep the prose-to-code density higher than that, but even amongst Knuth's code that's rare.
I find that in these discussions people think that complexity can be escaped. A program cannot be less complex than the problem it solves. Of course you can add even more complexity if the programmers are not good, but that lower bound cannot be surpassed.
In your example, assuming that a million-line program is well done in its language, you'd have exactly the same complexity but now in a thousand lines, meaning that now each line does more, is more complex and changing it/understanding it is more difficult and prone to errors. It would be equally (if not more) impossible to understand.
If it were me, I'd choose the million-line program and keep the ability to understand parts of it and performing maintenance with just the program reference by my side, not the language reference too.
In your example, assuming that a million-line program is well done in its language, you'd have exactly the same complexity but now in a thousand lines, meaning that now each line does more, is more complex and changing it/understanding it is more difficult and prone to errors. It would be equally (if not more) impossible to understand.
Disagreed. Reductio ad absurdum: It's not a good idea to force everyone to program in machine code, writing sequences of 1s and 0s. Why? Because the right abstractions (and associated notation) do make complexity more manageable.
Ok, you're right. It's not so clear-cut, but there's definitely a balance to find.
Of course, assembly is far better than machine code. C is almost always better than assembly, because it maps better the mental ideas to code than assembly. After that, you have higher-level languages and then the change is not that clear. High-level languages tend to hide the complexity of dealing with the computer: in Python, JS, Haskell or K you don't need to worry about memory allocations, or how objects are defined, or how to pass arguments to functions. Sometimes you need to worry about that, and having a language that does that for you makes your program harder to understand (how does JS work with parallelism? When is an object an object and not a copy? Suddenly you need to know how does JS do certain things and that complexity reappears in your program).
Now, for the example, when I said that the million-line program is well done, I was thinking about some JS, Python, C++, or Java application without too much unnecessary cruft and that maps properly problem concepts to code concepts. If the 1000x line reduction (if that even exists) is done on the basis of smart one-liners, unreadable code and constructs and language-specific, it's not worth it. Say your application was a physics simulator of a certain situation. Most people would want the million-lines Python code where each function and each part is understandable and lets you focus on the problem at hand, instead of a thousand lines where you not only have a difficult problem but also a difficult code.
Chrome is well north of a million lines. Anyone familiar with C/C++ can at least meaningfully explore the codebase with the use of mechanisms like ctags and understand parts of it. If it were 1/1000 lines fewer but in K instead, I don't really think that would make grokking it any easier.
I would hope that someone familiar with K could probably figure it out, but the benefit of C++ is that someone familiar with Java or Python might have a hope of doing pretty well too.
I think this is something of a strawman: there are plenty of million line (plus) systems and software suites, but I think million line programs are relatively uncommon by comparison.
Especially so if you don't count library code, and I don't because otherwise you end up doing things like counting the number of lines of code that back the Win32 APIs in your program line count, which doesn't seem sensible. Most of the time we treat libraries as black boxes, except on rare occasions, so don't tend to think of them as part of "our" codebase[1].
Most million line plus systems are broken down into subsystems (applications, services, libraries, etc.) that in isolation are smaller, and in many cases much smaller.
Even with those rare million line of code monoliths, for the most part there's some level of organisation in the codebase: layers, subsystems, components, or a mixture of these and whatever else is appropriate.
This organisation, and particularly when coupled with modern tooling, facilitates the understanding of even very large systems.
I don't say that it's necessarily easy but, then again, neither is understanding a large, complex system implemented in K.
[1] Except when thinking about security because third party code has been shown to be a rich source of vulnerabilities.
Alan Kay' VPRI's STEPS project used PEG parsers to create math-like DSLs to in their attempt to go "from the desktop to the metal in 20,000 lines of code."