Hacker News new | past | comments | ask | show | jobs | submit login
I wrote a string type (mcyoung.xyz)
188 points by todsacerdoti 8 months ago | hide | past | favorite | 75 comments



Another useful datatype is a rope[0]. It is useful for using immutable/shared pieces that can build up to other longer compositions. In particular, an operation such as concatenation can be O(1) since that's matches the representation of a rope.

[0] https://en.wikipedia.org/wiki/Rope_(data_structure)


Also worth noting that just about every javascript engine uses ropes internally to represent strings (most of the time).

https://docs.google.com/document/d/1o-MJPAddpfBfDZCkIHNKbMiM...


The Nim compiler used ropes internally for years, until recently[0] because removing them reduced memory consumption and decreased compile times, so ymmv

[0]https://github.com/nim-lang/Nim/pull/20433


This is a big tell that maybe we don't want to know exactly the type, but to have way to let the compiler/benchmarks to pick the best performing data structure for the task at hand, but I think that on most programming languages neither the language nor the tooling help you too much here.*

*Saying this mostly with the intent of being proved wrong ;) kxcd://386


> most of the time

Looking at the reference[0] given at the very end of the document, wouldn't the word "sometimes" be more appropriate?

[0]: https://mrale.ph/blog/2016/11/23/making-less-dart-faster.htm...


Interesting, seems like things are more complicated than I thought.


A company had me implement that in Rust during an interview. It was pretty hard


Nice library with many features. But I do not always understand the focus on memory usage. I guess that the reason behind this is that less memory allocations, have a positive effect on execution times. In a parser, where you often have to compare identifiers, it is a good idea to put all strings for identifiers into a unique pointer with the help of a hash table.

In my interpreting parser [1] I use a hexa hash tree [2] for storing identifiers. It is not very memory efficient, but very fast. It turns every string (from the input buffer) into a unique pointer for that string pointing to a copy of the string. In this way comparing string (identifiers) is equivalent to comparing pointers.

The idea of the hexa hash tree is that is a tree where each node has sixteen child nodes. Which node is selected is based on a step wise evaluated hash function that first takes the lower four bytes of successive characters in the string, and after reaching the end of the string, the higher four bytes of the characters. The nodes often taken up more memory space than the strings themselves.

[1] https://github.com/FransFaase/IParse/

[2] https://github.com/FransFaase/IParse/blob/master/software/Id...


Not just allocations, but critically – especially on modern machines – cache coherency. Indirections are an enemy of performance, and indirections to many different parts of memory in particular. A custom allocator can be a great help though, especially in a compiler where you can pretty much just use a fast slab/bump allocator and not worry about freeing memory until the end.


Do you mean cache locality? Coherence is not something ordinary single-threaded code has to be concerned with.


Oops, indeed, brain fart.


We have terms and languages to express computational complexity, algorithms, temporal logic etc.

But what about data logistics and CPU mechanics?

How good are we at expressing things like memory fragmentation, cache locality, throughput and latency of operations etc? All of these things affect how the machine operates in very real ways, to a degree that often trumps the higher level mathematical concepts above.

I only ever see these things explained in high-level, heuristic terms, observed as a black box or spelled out literally as code.

Is there a language based on how the CPU and memory work that I'm not aware of?


I do not think that such a language has been developed. The complexity of how a CPU, with several levels of caching, RAM, harddrives (with their own caching), and network storage (with their own caching) is simply mind boggling.

For that reason, I think that this route is not often taken. I guess that whenever modeling of execution is performed it is done to analyse the behaviour of an existing implementation in order to find ways to improve it.

If there is someone who has made an attempt to do something in this direction, I think it is Donald E. Knuth in his 'The Art of Computer Programming'. I think, for example about section 5.4 'External sorting' in Volume 3 'Sorting and Searching' [1].

[1] https://seriouscomputerist.atariverse.com/media/pdf/book/Art...


Heap memory allocation can be very slow. Reducing the number of allocations is often a fantastic way to get very significant speed ups.

Using an integer to identify strings is very common in parsers - it's called "string interning".


The article would have been stronger if it had benchmarks showing the performance improvements of this technique. My intuition is that the memory overhead of strings is not the main performance issue with most compilers.


Yeah, for compilers, I would like to see a benchmark of small string optimization (which should be a huge win for most workloads) and string interning (also should be huge, not in the post though)

A pet peeve is that benchmark suites often favor measuring performance with fibonacci or mandelbrot-type things, but string/graph/hash table workloads are much different, in multiple dimensions

Compilers and network servers are more like the latter


I've never seen an article with a minimap on the side before, that's kinda awesome


I’m genuinely disappointed that it’s not done with -moz-element() on Firefox, but still duplicates all the content and just scales it down. Using -moz-element() has the potentially-useful side-effect that things like text selection will be visible in the minimap as well, as well as avoiding the problems of the duplication approach, such as that browser find functionality looks in the minimap too. (The minimap needs `user-select: none` which fixes some of the problems, but not all.)

I still hope that other browsers will eventually implement element(), along with cross-fade() which is also in the CSS Images Module Level 4 draft <https://w3c.github.io/csswg-drafts/css-images-4/>.


> I’m genuinely disappointed that it’s not done with -moz-element() on Firefox

Given how few of us Firefox users there are out there, I'm sadly not surprised.


Firefox till I die. (Or until they turn into a shit company)


Exactly the same for me! I've been using Firefox forever, and always will.


In this case you might be interested in this recent post: https://news.ycombinator.com/item?id=36757542.


Agreed, it's a joy to scroll through this article using the minimap. IMO a stark reminder how useless modern scrollbars have become.


it doesn't even appear for me on default browser scaling

if you hadn't commented that, I wouldn't have known


Just goes to show how much work goes into the illusion that strings are a simple data type.


Everything is more complex than it first seems in this industry.


And even things which are now relatively simple and the same in every language and hardware platform become complex if you look through the lens of history and see the various alternatives that have been explored. E.g. signed integers


Everything is more complex than it seems.


On footnote #4. The change from two shifts to movabs/and is only possible for unsigned. If the size was signed this wouldn't be the correct behavior (Requires sar instead of shr). It's not surprising that RISC-V sticks to using two shifts because it takes a bunch of instructions to load a 64-bit constant into a register. Can't even load a 32-bit constant into register with a single instruction on RV.


>Can't even load a 32-bit constant into register with a single instruction on RV.

Just do it in the standard way, and not concern yourself with hardware implementation details. RISC-V assembly language even has standardized pseudoinstructions to do just that.

Some implementations will see the multiple opcodes as one instructions. Others won't. It's not for the programmer to worry about.


The standard does not actually specify what sequence of instructions to use for the load pseudo-instruction, so it is left up to the assembler to decide.

A 32-bit load still requires 2 instructions (The purpose of the lui instruction is to make this efficient). A 64-bit load requires loading 2 32-bit values, shifting one 32-bits right then performing a bitwise or. Alternatively you can hold the 64-bit value in .rodata and use a single instruction to load it.

The point is, it takes several more instruction bits and/or cycles than a `shr 2; sar 2` for clearing the two MSBs, so the optimisation given for x86 for this sequence is not useful here and the trivial solution is better.

A solution that may be better elsewhere is `btr 63; btr 62`. Since RV can specify a different destination operand you don't need an extra mov instruction as the x86 version would require.


Or you can replace the escape sequences right in the source text: it works since no escape sequence is shorter than the text it represents.


The downside is that you lose the original and can't refer to it anymore. If (like OP) you are writing a compiler, you may need to spit out error messages, and will want to be able to say "line X, column Y", or even just print out the original line itself.

And no, you can't re-escape your unescaped text, as you can't do so losslessly in all cases. For some, the unescaped text might not obviously need re-escaping (e.g. the original text had a unicode escape for an emoji; you'll "forget" that after you unescape it).


Well, that's a trade-off, as most things in the parser design. You can go the way of the compilers of the yore and report only the line numbers. Or you can maintain column numbers only for a single line, everything in the previous lines will get only a line number. Or you can store the column numbers in the tokens which works better if you lexer emits them one at a time and the parser is fine with that.


If you save the offset (and you are, if you use a borrowed string) you can retrieve the original source code by re-reading the input file on the fly. This is expensive (due to IO, but especially also since you need to re-lex and recompute all offsets up to the point of the error), but it only needs to happen in case a parse error is generated, so the cost is likely negligible.

I’ve done this in a parser where I wanted to avoid storing line and column numbers instead of a single byte offset, but still wanted to provide the more user-friendly line and column numbers in the error message.


Storing column numbers for every token just in case there might be an error likely weighs out the string allocation reduction. But for error messages it might suffice to output just the line number together with a snippet of the line at the error location. Most people probably don't care about the exact column number, though it might be bad for other tools.


> Storing column numbers for every token just in case there might be an error likely weighs out the string allocation reduction.

Only if you store all of the tokens before starting the parsing, which would arguably be the bigger fish to fry in terms of conserving memory.

But it actually slows down the lexer: by an insignificant amount if the lexer is the traditionally written one, or by quite a lot if it's one of those new-fangled ones that use SIMD-instructions speed up finding the token boundaries... but such lexers also benefit wrom the ability to re-use unneeded parts of the source input (like quotes or spaces) for storing auxillary data.


If you still have access to the original source in the filesystem you could just re-read the snippet and recover the original escaped form. If your language offers only a single way to escape characters then you may even just re-escape the string when emitting the error (although most languages have some sort of codepoint escape sequence)


You can also be lazy about it -- store the quoted string with 1 escape as 3 separate (fixed-size) tokens/spans, and don't actually allocate or decode strings until you need to.

In an interpreter there may be many string constants that aren't used (e.g. docstrings in Python).

And probably in a compiler too, since you may import a ton of library code and only use 1 function.

Storing escapes as tokens also lets you point to them in warnings. I think most compilers warn about bad octal escapes like \777 these days.


You can, but see, the actual contents of the strings are really only used at the code-gen stage, where it's put verbatim in the .rodata section. Arguably, it could have been discarded by the dead-code elimination pass by that time but that's unlikely: most of the string constants do actually see use in the program.

Pushing that part of the lexer's task so late on the off-chance it probably won't be needed IMHO only makes sense if you're writing a parser for an IDE/LSP in which case you don't need to unescape the string at all.


If the string contents are only used at the code-gen stage, then I see that as an argument FOR being lazy.

On the other hand, the principle of validating your input up front suggests that you should at least recognize the escapes, if not store their decoded from.

That is, you can be lazy by only parsing \\ and \" to find the closing quote, and producing 1 token. But that's nearly the same thing as doing a full-decoding. The downsides are that it allows invalid input deeper into the compiler, and it doesn't alert the user to errors as quickly as possible.

most of the string constants do actually see use in the program

[citation needed] :) I don't see how that can possibly hold across languages (Go, Java, Swift, etc.), and I'm not even sure that's true for C/C++.

only makes sense if you're writing a parser for an IDE/LSP in which case you don't need to unescape the string at all.

Clang's front end is used both for LSP and for code gen, so that distinction doesn't apply in all cases. It seems like most projects want to reuse their front ends these days.

And again the LSP will want to warn about invalid escapes like \777 and \u{9999999}.

----

EDIT: Although I guess in C, emitting all the errors up front, and then storing the decoded form in place, is probably about the same cost. If there's no allocation, then I might treat eager decoding as "free".

I think the problem is that you lose info, and most languages have more structure in their string literals, e.g. Swift has \(var) interpolation, and IDEs in Rust want to do things with {} format! strings, etc.


I was wondering the same thing, you can probably mess a bit with the input if you can deal with the drawbacks (which seem mild, pre-processed input might be shorter and column numbers might be off).


What about \n, which decodes to 0x0ah - two bytes to one byte?

I can't think of any others like that, but newline in different OS's get represented in various different ways (like DOS's famous CRLF, which adds an 0x0dh to the recipe). Operating systems that use 0x0ah alone could very well be the only example.


Reread the suggestion, you’ve got it back to front:

> no escape sequence is shorter than the text it represents

The idea is that you unescape immediately, and if necessary shuffle the remainder back, leaving a blank space at the end, but able to reuse the initial allocation.

Also, I don’t even know of anything that actually turns a \n into {CARRIAGE RETURN, LINE FEED}; if you want that, you’ll need \r\n. (A very few situations are a bit fuzzy about it, e.g. \r and \n are a bit wonky in Vim’s :substitute, but that’s all I can think of).


Note that while in your compiler you don't really care about the representation that the OS expect in real files. It's ok for your intermediate, in-memory representation to be meaningless to the OS.

Also, maybe it's not a literal character for character replacement. Since lengths are probably changing, your pre-processed input might be shorter (or longer if you truly need to expand some escaped chars (and you have a program that just wants to break your compiler by abusing that escape sequence :P))


Yes, but some OS's use 0x0ah alone, which is half the bytes of \n representation. People from the 90's will easily remember navigating the ^M (which is the 0x0dh showing up) showing up at the end of every line in a DOS text file when viewing them under a Linux text editor.


This is a different problem around how to interpret new lines from a Windows file (\r\n) or maybe an old Mac version (\n\r to keep things incompatible I guess). If you are not happy with the OS's choice for new lines, you can pre-process that when reading the input as the very first stage. This is what I meant when I said that your compiler doesn't really care about the OS's semantics once the file is in memory, you only need to respect them if you are writing intermediate state to files (like some IR).

The article is talking about how to handle escaped characters in the source, like literal `\n` you can see on `let hi = "Hello\nWorld\nfrom my compiler!"` using your average text editor (0x5c+0x6e in ASCII), and how to build an efficient string type that mostly borrows views from the in-memory version of the source file (which could be a pre-processed version of the raw file if you hate odd line breaks).


FWIW the Classic Mac line ending is \r. \n\r was used by Acorn, apparently.


Before the UNIX had the bright idea to represent newlines as single LFs in disk files and then write a bunch of code in the kernel for the tty driver to convert LF to and from CRLF pairs (which is what ttys actually used), everybody else used CRLF pair to indicate a linebreak. That's why almost all network protocols use CRLF, and not because Microsoft invented the time machine and travelled into the seventies to hinder the UNIX adoption.


> everybody else used CRLF pair to indicate a linebreak

Almost everybody... https://en.wikipedia.org/wiki/Newline#Representation mentions a couple that apparently used LFCR.


Isn't the two byte escape sequence not shorter than the one byte newline?

I do inline replacement of escaped values all the time, because the sequence is always longer than the replacement.


And exponentially shorter with nested escapes.


Is it not a better approach, when already assuming semi-niche usage, to rely on more information about the string? e.g.

* Have some of your strings be guaranteed-memoized, then just pass an index into the global/thread-local array of memoized strings. * In tight loops working on a few strings, ensure they are all short, then forego pointers altogether and just use the values.

etc.


I love how rust removed SSO from its string and now every library reimplements it over and over


I feel most code in Rust falls into the premature optimisation bucket. To prototype/hack something, TypeScript is a much better choice. To then carefully optimise something, Rust is not high-level enough for me, and too restrictive in what you can do. The kind of high-level environment I am looking for here does not exist yet, so I guess people go for Rust because there is no better choice, but it feels like a dead end to me.


How 'high-level' are you hoping for here? Haskell? Rust has very high abstractions, but I guess the fundamentals are based around giving you control over low level stuff, which seems like a good balance to me. I'm not sure what specifically you'd hope to do which you can't do in Rust.


There is a lot of stuff you cannot do in Rust. For example reasoning about your code, which is the minimum I expect from my language. If you cannot do that, it's not high-level enough. Rust is like all other type-based languages: it restricts what you can do, offering you some guarantees in return. For me, the deal Rust is offering is just not that good. I don't like to live in somebody else's idea of allowed code.


You can't..."reason about your code" in Rust? Look, maybe there's a reasonable take in there somewhere, but the way you put it just makes it sound like gibberish.

It sounds like you're just hoping for either a dynamically typed language (eg. Python) or for a language which is fully type-inferred (eg. Haskell). Which way is it?


[flagged]


It doesn’t, and it’s still a pretty unintelligible answer.

> I don't like to live in somebody else's idea of allowed code.

You program runs on a CPU, that is the definition of someone else’s idea of allowed code.


I don't see my programs limited to a CPU. Thanks for pointing out another limitation of Rust.

I am sorry that you cannot understand my answer. I am working on remedying this, unfortunately now is not the time for me to educate one person at a time. Hopefully later.


Are you ok? If you think you’re programming on the fabric of the universe and not a chip, then the likelihood is you’re experiencing a major psychiatric issue, which given the tone of the replies is a definite possibility.

If you’re not, then you’re programming for some kind of chip, which inherently and obviously limits your ability to express programs to those that are supported on that chip, be that via the specific instructions, architecture or performance of that chip.

Which has nothing to do with rust.


Thanks for asking! Yes, I am ok, I am just moving on a higher plane of reasoning than pretty much anyone else currently. Trying to give you access to that plane as well, hopefully soon.

If you express your programs in logic, first, your programs are not just programs anymore, they are mathematics. Mathematics is not bound to a particular CPU or chip. Sure, you can make it run on one (or not, that would depend on your program).

Now, this has a lot to do with Rust. Rust is a particularly restrictive form of expressing yourself. It basically prescribes a very particular form of mathematics, and in return, promises you certain theorems for free. Much like other type based programming languages, and even logics (although type-based logics are of course much more expressive than type-based programming languages).

Instead, it would be better to express your ideas freely in a logic, just constrained by consistency, and not much else. This is very much the idea of programming as theory building. Where can you build a theory better, than in a logic?

Then of course implementation details are important, such as speed, parallelism, memory use, etc. Rust's idea is to put these details front and center, especially memory use. That can work well, depending on your use case, but usually, it's just premature optimisation. Of course, if your thinking is restricted to Rust, you would not agree with that.


> Yes, I am ok, I am just moving on a higher plane of reasoning than pretty much anyone else currently

Real talk: anyone who talks like this isn’t in a good way. Please take that seriously, and take care of yourself.

I’m not sure engaging any further would be fruitful, but reason about the likelihood of yourself “operating on a higher plane of existence/seeing the inner truth/understanding that everyone else is blind/cracking the code” against simply having a manic episode of some kind.

Because what you’ve described above is just… programming, just stripped down to the abstract and ignoring all practicalities.

There isn’t a great truth in that, or a higher plane of existence, and the lack of your ability to clearly articulate the concepts combined with classic delusional catchphrases is really, really indicative of something else going on.

And the inability for others to understand your great discoveries is not evidence of your higher plane of existence, it’s because you’re spouting semi-delusional abstract nonsense without a clear train of thought or concise argument.


Again, thank you for your concern, but it is not me who is not properly engaging with arguments here. Sometimes it takes a clear mind to receive clear (and arguably pretty simple) arguments, and yours doesn't seem particularly clear.


> and yours doesn't seem particularly clear

How often do you find this to be the case, either online or in the real world?


Don't forget, the online world is part of the real world.


Are you able to answer the question, or are we to take avoidance as answer enough?


Not sure who "we" is? Are you thinking of yourself as multiple personalities? Which one shall I address?

On the other hand, I don't really care to address any of your personalities at this point.


I’d wager pretty often.

Here’s what’s likely to happen: you’ve got this big grand exciting idea that you just can’t express to everyone in words. The thoughts are there, but they just don’t come out into words how you want them to. People don’t get it.

But you need to share it with the world. You need to. So you produce something about your grand idea that you think is cohesive. Something to teach people!

Except it doesn’t. It’s equivalent to a mess of paper, handwritten notes and impossible to follow logic, presented terribly. And it’s likely long. Too long.

And so… nobody cares. Nobody reads it. But your grand idea is there! Don’t write programs, write logic! Duh! Why can’t people understand that? Why can’t they see?

So obviously those people who don’t get it have foggy brains. But your thoughts are clear as possible. You’re elevated. They don’t get it because they cant.

This superiority complex impacts your relationships, and you grow more introverted, isolate yourself more etc. This fuels itself into a nasty cycle.

You don’t have a grand idea and you’re not in a higher plane of existence. Coming to terms with that is going to suck, but it’s important to do so.

Honestly, really do take a moment to breathe and step back. And it’s ok. This all happens surprisingly often in our field, and you sound like a cookie cutter case. My email is in my profile, reach out if you need.

Good luck.


Mate, you are intense. I thought you are a troll, but it seems you've just had some bad experiences.


I’ve seen it happen to a few people and it is a shame.

If you’re just roleplaying a mentally ill person online then shame on you. If you’re not, best of luck.


Sorry to hear that. But you should probably just accept that the answer is not contained in either of the two options you offer.


> On the other hand, I don't really care to address any of your personalities at this point

You’re completely unable to not reply to this thread. You need the last word because it makes you superior, and you don’t want to concede to a foggy brained person who can’t hope to grasp your big idea.

So your next reply will probably just be something nonsensical and short, because you have to put something but there isn’t really anything to put. And if you don’t, it’s going to really irk you more than it should.

Best of luck. Email is in my profile if you wanna talk, otherwise I’m looking forward to seeing some evidence of your higher plane of existence posted here when it’s ready :)


Programming languages are restrictions. A 'program' is information passed to an interpreter, where the language describes what that interpreter will do with your program. That interpreter has restrictions, upon which the language may add further restrictions.

The only language with zero restrictions relative to the interpreter is the language that interpreter acts on, e.g. aarch64 machine code for a specific processor instance. And even then, the restrictions of that CPU are very much present.

What do you have in mind for an unrestricted language?


See the other thread (oops, it got flagged). Mainly, the right level of abstraction is that of a logic. Current logics are somewhat lacking (first-order logic doesn't have general binding, typed logics don't have a single mathematical universe), but I think I found the one to rule them all. Programming is then just a (arguably very important) special case of expressing yourself in this logic. Hopefully soon(ish) on a website near to you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: