Another useful datatype is a rope[0]. It is useful for using immutable/shared pieces that can build up to other longer compositions. In particular, an operation such as concatenation can be O(1) since that's matches the representation of a rope.
The Nim compiler used ropes internally for years, until recently[0] because removing them reduced memory consumption and decreased compile times, so ymmv
This is a big tell that maybe we don't want to know exactly the type, but to have way to let the compiler/benchmarks to pick the best performing data structure for the task at hand, but I think that on most programming languages neither the language nor the tooling help you too much here.*
*Saying this mostly with the intent of being proved wrong ;) kxcd://386
Nice library with many features. But I do not always understand the focus on memory usage. I guess that the reason behind this is that less memory allocations, have a positive effect on execution times. In a parser, where you often have to compare identifiers, it is a good idea to put all strings for identifiers into a unique pointer with the help of a hash table.
In my interpreting parser [1] I use a hexa hash tree [2] for storing identifiers. It is not very memory efficient, but very fast. It turns every string (from the input buffer) into a unique pointer for that string pointing to a copy of the string. In this way comparing string (identifiers) is equivalent to comparing pointers.
The idea of the hexa hash tree is that is a tree where each node has sixteen child nodes. Which node is selected is based on a step wise evaluated hash function that first takes the lower four bytes of successive characters in the string, and after reaching the end of the string, the higher four bytes of the characters. The nodes often taken up more memory space than the strings themselves.
Not just allocations, but critically – especially on modern machines – cache coherency. Indirections are an enemy of performance, and indirections to many different parts of memory in particular. A custom allocator can be a great help though, especially in a compiler where you can pretty much just use a fast slab/bump allocator and not worry about freeing memory until the end.
We have terms and languages to express computational complexity, algorithms, temporal logic etc.
But what about data logistics and CPU mechanics?
How good are we at expressing things like memory fragmentation, cache locality, throughput and latency of operations etc? All of these things affect how the machine operates in very real ways, to a degree that often trumps the higher level mathematical concepts above.
I only ever see these things explained in high-level, heuristic terms, observed as a black box or spelled out literally as code.
Is there a language based on how the CPU and memory work that I'm not aware of?
I do not think that such a language has been developed. The complexity of how a CPU, with several levels of caching, RAM, harddrives (with their own caching), and network storage (with their own caching) is simply mind boggling.
For that reason, I think that this route is not often taken. I guess that whenever modeling of execution is performed it is done to analyse the behaviour of an existing implementation in order to find ways to improve it.
If there is someone who has made an attempt to do something in this direction, I think it is Donald E. Knuth in his 'The Art of Computer Programming'. I think, for example about section 5.4 'External sorting' in Volume 3 'Sorting and Searching' [1].
The article would have been stronger if it had benchmarks showing the performance improvements of this technique. My intuition is that the memory overhead of strings is not the main performance issue with most compilers.
Yeah, for compilers, I would like to see a benchmark of small string optimization (which should be a huge win for most workloads) and string interning (also should be huge, not in the post though)
A pet peeve is that benchmark suites often favor measuring performance with fibonacci or mandelbrot-type things, but string/graph/hash table workloads are much different, in multiple dimensions
Compilers and network servers are more like the latter
I’m genuinely disappointed that it’s not done with -moz-element() on Firefox, but still duplicates all the content and just scales it down. Using -moz-element() has the potentially-useful side-effect that things like text selection will be visible in the minimap as well, as well as avoiding the problems of the duplication approach, such as that browser find functionality looks in the minimap too. (The minimap needs `user-select: none` which fixes some of the problems, but not all.)
I still hope that other browsers will eventually implement element(), along with cross-fade() which is also in the CSS Images Module Level 4 draft <https://w3c.github.io/csswg-drafts/css-images-4/>.
And even things which are now relatively simple and the same in every language and hardware platform become complex if you look through the lens of history and see the various alternatives that have been explored. E.g. signed integers
On footnote #4. The change from two shifts to movabs/and is only possible for unsigned. If the size was signed this wouldn't be the correct behavior (Requires sar instead of shr). It's not surprising that RISC-V sticks to using two shifts because it takes a bunch of instructions to load a 64-bit constant into a register. Can't even load a 32-bit constant into register with a single instruction on RV.
>Can't even load a 32-bit constant into register with a single instruction on RV.
Just do it in the standard way, and not concern yourself with hardware implementation details. RISC-V assembly language even has standardized pseudoinstructions to do just that.
Some implementations will see the multiple opcodes as one instructions. Others won't. It's not for the programmer to worry about.
The standard does not actually specify what sequence of instructions to use for the load pseudo-instruction, so it is left up to the assembler to decide.
A 32-bit load still requires 2 instructions (The purpose of the lui instruction is to make this efficient). A 64-bit load requires loading 2 32-bit values, shifting one 32-bits right then performing a bitwise or. Alternatively you can hold the 64-bit value in .rodata and use a single instruction to load it.
The point is, it takes several more instruction bits and/or cycles than a `shr 2; sar 2` for clearing the two MSBs, so the optimisation given for x86 for this sequence is not useful here and the trivial solution is better.
A solution that may be better elsewhere is `btr 63; btr 62`. Since RV can specify a different destination operand you don't need an extra mov instruction as the x86 version would require.
The downside is that you lose the original and can't refer to it anymore. If (like OP) you are writing a compiler, you may need to spit out error messages, and will want to be able to say "line X, column Y", or even just print out the original line itself.
And no, you can't re-escape your unescaped text, as you can't do so losslessly in all cases. For some, the unescaped text might not obviously need re-escaping (e.g. the original text had a unicode escape for an emoji; you'll "forget" that after you unescape it).
Well, that's a trade-off, as most things in the parser design. You can go the way of the compilers of the yore and report only the line numbers. Or you can maintain column numbers only for a single line, everything in the previous lines will get only a line number. Or you can store the column numbers in the tokens which works better if you lexer emits them one at a time and the parser is fine with that.
If you save the offset (and you are, if you use a borrowed string) you can retrieve the original source code by re-reading the input file on the fly. This is expensive (due to IO, but especially also since you need to re-lex and recompute all offsets up to the point of the error), but it only needs to happen in case a parse error is generated, so the cost is likely negligible.
I’ve done this in a parser where I wanted to avoid storing line and column numbers instead of a single byte offset, but still wanted to provide the more user-friendly line and column numbers in the error message.
Storing column numbers for every token just in case there might be an error likely weighs out the string allocation reduction. But for error messages it might suffice to output just the line number together with a snippet of the line at the error location. Most people probably don't care about the exact column number, though it might be bad for other tools.
> Storing column numbers for every token just in case there might be an error likely weighs out the string allocation reduction.
Only if you store all of the tokens before starting the parsing, which would arguably be the bigger fish to fry in terms of conserving memory.
But it actually slows down the lexer: by an insignificant amount if the lexer is the traditionally written one, or by quite a lot if it's one of those new-fangled ones that use SIMD-instructions speed up finding the token boundaries... but such lexers also benefit wrom the ability to re-use unneeded parts of the source input (like quotes or spaces) for storing auxillary data.
If you still have access to the original source in the filesystem you could just re-read the snippet and recover the original escaped form. If your language offers only a single way to escape characters then you may even just re-escape the string when emitting the error (although most languages have some sort of codepoint escape sequence)
You can also be lazy about it -- store the quoted string with 1 escape as 3 separate (fixed-size) tokens/spans, and don't actually allocate or decode strings until you need to.
In an interpreter there may be many string constants that aren't used (e.g. docstrings in Python).
And probably in a compiler too, since you may import a ton of library code and only use 1 function.
Storing escapes as tokens also lets you point to them in warnings. I think most compilers warn about bad octal escapes like \777 these days.
You can, but see, the actual contents of the strings are really only used at the code-gen stage, where it's put verbatim in the .rodata section. Arguably, it could have been discarded by the dead-code elimination pass by that time but that's unlikely: most of the string constants do actually see use in the program.
Pushing that part of the lexer's task so late on the off-chance it probably won't be needed IMHO only makes sense if you're writing a parser for an IDE/LSP in which case you don't need to unescape the string at all.
If the string contents are only used at the code-gen stage, then I see that as an argument FOR being lazy.
On the other hand, the principle of validating your input up front suggests that you should at least recognize the escapes, if not store their decoded from.
That is, you can be lazy by only parsing \\ and \" to find the closing quote, and producing 1 token. But that's nearly the same thing as doing a full-decoding. The downsides are that it allows invalid input deeper into the compiler, and it doesn't alert the user to errors as quickly as possible.
most of the string constants do actually see use in the program
[citation needed] :) I don't see how that can possibly hold across languages (Go, Java, Swift, etc.), and I'm not even sure that's true for C/C++.
only makes sense if you're writing a parser for an IDE/LSP in which case you don't need to unescape the string at all.
Clang's front end is used both for LSP and for code gen, so that distinction doesn't apply in all cases. It seems like most projects want to reuse their front ends these days.
And again the LSP will want to warn about invalid escapes like \777 and \u{9999999}.
----
EDIT: Although I guess in C, emitting all the errors up front, and then storing the decoded form in place, is probably about the same cost. If there's no allocation, then I might treat eager decoding as "free".
I think the problem is that you lose info, and most languages have more structure in their string literals, e.g. Swift has \(var) interpolation, and IDEs in Rust want to do things with {} format! strings, etc.
I was wondering the same thing, you can probably mess a bit with the input if you can deal with the drawbacks (which seem mild, pre-processed input might be shorter and column numbers might be off).
What about \n, which decodes to 0x0ah - two bytes to one byte?
I can't think of any others like that, but newline in different OS's get represented in various different ways (like DOS's famous CRLF, which adds an 0x0dh to the recipe). Operating systems that use 0x0ah alone could very well be the only example.
Reread the suggestion, you’ve got it back to front:
> no escape sequence is shorter than the text it represents
The idea is that you unescape immediately, and if necessary shuffle the remainder back, leaving a blank space at the end, but able to reuse the initial allocation.
Also, I don’t even know of anything that actually turns a \n into {CARRIAGE RETURN, LINE FEED}; if you want that, you’ll need \r\n. (A very few situations are a bit fuzzy about it, e.g. \r and \n are a bit wonky in Vim’s :substitute, but that’s all I can think of).
Note that while in your compiler you don't really care about the representation that the OS expect in real files. It's ok for your intermediate, in-memory representation to be meaningless to the OS.
Also, maybe it's not a literal character for character replacement. Since lengths are probably changing, your pre-processed input might be shorter (or longer if you truly need to expand some escaped chars (and you have a program that just wants to break your compiler by abusing that escape sequence :P))
Yes, but some OS's use 0x0ah alone, which is half the bytes of \n representation. People from the 90's will easily remember navigating the ^M (which is the 0x0dh showing up) showing up at the end of every line in a DOS text file when viewing them under a Linux text editor.
This is a different problem around how to interpret new lines from a Windows file (\r\n) or maybe an old Mac version (\n\r to keep things incompatible I guess). If you are not happy with the OS's choice for new lines, you can pre-process that when reading the input as the very first stage. This is what I meant when I said that your compiler doesn't really care about the OS's semantics once the file is in memory, you only need to respect them if you are writing intermediate state to files (like some IR).
The article is talking about how to handle escaped characters in the source, like literal `\n` you can see on `let hi = "Hello\nWorld\nfrom my compiler!"` using your average text editor (0x5c+0x6e in ASCII), and how to build an efficient string type that mostly borrows views from the in-memory version of the source file (which could be a pre-processed version of the raw file if you hate odd line breaks).
Before the UNIX had the bright idea to represent newlines as single LFs in disk files and then write a bunch of code in the kernel for the tty driver to convert LF to and from CRLF pairs (which is what ttys actually used), everybody else used CRLF pair to indicate a linebreak. That's why almost all network protocols use CRLF, and not because Microsoft invented the time machine and travelled into the seventies to hinder the UNIX adoption.
Is it not a better approach, when already assuming semi-niche usage, to rely on more information about the string? e.g.
* Have some of your strings be guaranteed-memoized, then just pass an index into the global/thread-local array of memoized strings.
* In tight loops working on a few strings, ensure they are all short, then forego pointers altogether and just use the values.
I feel most code in Rust falls into the premature optimisation bucket. To prototype/hack something, TypeScript is a much better choice. To then carefully optimise something, Rust is not high-level enough for me, and too restrictive in what you can do. The kind of high-level environment I am looking for here does not exist yet, so I guess people go for Rust because there is no better choice, but it feels like a dead end to me.
How 'high-level' are you hoping for here? Haskell? Rust has very high abstractions, but I guess the fundamentals are based around giving you control over low level stuff, which seems like a good balance to me. I'm not sure what specifically you'd hope to do which you can't do in Rust.
There is a lot of stuff you cannot do in Rust. For example reasoning about your code, which is the minimum I expect from my language. If you cannot do that, it's not high-level enough. Rust is like all other type-based languages: it restricts what you can do, offering you some guarantees in return. For me, the deal Rust is offering is just not that good. I don't like to live in somebody else's idea of allowed code.
You can't..."reason about your code" in Rust? Look, maybe there's a reasonable take in there somewhere, but the way you put it just makes it sound like gibberish.
It sounds like you're just hoping for either a dynamically typed language (eg. Python) or for a language which is fully type-inferred (eg. Haskell). Which way is it?
I don't see my programs limited to a CPU. Thanks for pointing out another limitation of Rust.
I am sorry that you cannot understand my answer. I am working on remedying this, unfortunately now is not the time for me to educate one person at a time. Hopefully later.
Are you ok? If you think you’re programming on the fabric of the universe and not a chip, then the likelihood is you’re experiencing a major psychiatric issue, which given the tone of the replies is a definite possibility.
If you’re not, then you’re programming for some kind of chip, which inherently and obviously limits your ability to express programs to those that are supported on that chip, be that via the specific instructions, architecture or performance of that chip.
Thanks for asking! Yes, I am ok, I am just moving on a higher plane of reasoning than pretty much anyone else currently. Trying to give you access to that plane as well, hopefully soon.
If you express your programs in logic, first, your programs are not just programs anymore, they are mathematics. Mathematics is not bound to a particular CPU or chip. Sure, you can make it run on one (or not, that would depend on your program).
Now, this has a lot to do with Rust. Rust is a particularly restrictive form of expressing yourself. It basically prescribes a very particular form of mathematics, and in return, promises you certain theorems for free. Much like other type based programming languages, and even logics (although type-based logics are of course much more expressive than type-based programming languages).
Instead, it would be better to express your ideas freely in a logic, just constrained by consistency, and not much else. This is very much the idea of programming as theory building. Where can you build a theory better, than in a logic?
Then of course implementation details are important, such as speed, parallelism, memory use, etc. Rust's idea is to put these details front and center, especially memory use. That can work well, depending on your use case, but usually, it's just premature optimisation. Of course, if your thinking is restricted to Rust, you would not agree with that.
> Yes, I am ok, I am just moving on a higher plane of reasoning than pretty much anyone else currently
Real talk: anyone who talks like this isn’t in a good way. Please take that seriously, and take care of yourself.
I’m not sure engaging any further would be fruitful, but reason about the likelihood of yourself “operating on a higher plane of existence/seeing the inner truth/understanding that everyone else is blind/cracking the code” against simply having a manic episode of some kind.
Because what you’ve described above is just… programming, just stripped down to the abstract and ignoring all practicalities.
There isn’t a great truth in that, or a higher plane of existence, and the lack of your ability to clearly articulate the concepts combined with classic delusional catchphrases is really, really indicative of something else going on.
And the inability for others to understand your great discoveries is not evidence of your higher plane of existence, it’s because you’re spouting semi-delusional abstract nonsense without a clear train of thought or concise argument.
Again, thank you for your concern, but it is not me who is not properly engaging with arguments here. Sometimes it takes a clear mind to receive clear (and arguably pretty simple) arguments, and yours doesn't seem particularly clear.
Here’s what’s likely to happen: you’ve got this big grand exciting idea that you just can’t express to everyone in words. The thoughts are there, but they just don’t come out into words how you want them to. People don’t get it.
But you need to share it with the world. You need to. So you produce something about your grand idea that you think is cohesive. Something to teach people!
Except it doesn’t. It’s equivalent to a mess of paper, handwritten notes and impossible to follow logic, presented terribly. And it’s likely long. Too long.
And so… nobody cares. Nobody reads it. But your grand idea is there! Don’t write programs, write logic! Duh! Why can’t people understand that? Why can’t they see?
So obviously those people who don’t get it have foggy brains. But your thoughts are clear as possible. You’re elevated. They don’t get it because they cant.
This superiority complex impacts your relationships, and you grow more introverted, isolate yourself more etc. This fuels itself into a nasty cycle.
You don’t have a grand idea and you’re not in a higher plane of existence. Coming to terms with that is going to suck, but it’s important to do so.
Honestly, really do take a moment to breathe and step back. And it’s ok. This all happens surprisingly often in our field, and you sound like a cookie cutter case. My email is in my profile, reach out if you need.
> On the other hand, I don't really care to address any of your personalities at this point
You’re completely unable to not reply to this thread. You need the last word because it makes you superior, and you don’t want to concede to a foggy brained person who can’t hope to grasp your big idea.
So your next reply will probably just be something nonsensical and short, because you have to put something but there isn’t really anything to put. And if you don’t, it’s going to really irk you more than it should.
Best of luck. Email is in my profile if you wanna talk, otherwise I’m looking forward to seeing some evidence of your higher plane of existence posted here when it’s ready :)
Programming languages are restrictions. A 'program' is information passed to an interpreter, where the language describes what that interpreter will do with your program. That interpreter has restrictions, upon which the language may add further restrictions.
The only language with zero restrictions relative to the interpreter is the language that interpreter acts on, e.g. aarch64 machine code for a specific processor instance. And even then, the restrictions of that CPU are very much present.
What do you have in mind for an unrestricted language?
See the other thread (oops, it got flagged). Mainly, the right level of abstraction is that of a logic. Current logics are somewhat lacking (first-order logic doesn't have general binding, typed logics don't have a single mathematical universe), but I think I found the one to rule them all. Programming is then just a (arguably very important) special case of expressing yourself in this logic. Hopefully soon(ish) on a website near to you.
[0] https://en.wikipedia.org/wiki/Rope_(data_structure)