Zig 0.9.0

sph · on Dec 21, 2021

Congratulations! And welcome Isaac, I loved his Zig+Wayland screencast [1], where he explained very engagingly how Zig is able to embrace and extend a pretty hairy C library and make it more memory safe, without forcing the implementor to rewrite or painfully try to bend it to the strict or idiosyncratic rules of the language, as it may happen with Rust or even Go.

I can't wait for 1.0! Or at least until the docs are a bit more accessible... :)

1: https://youtu.be/mwrA5IRGpfU

vladharbuz · on Dec 20, 2021

I don't know if string handling has been improved, but it's one of the few things stopping me from using Zig. I look forward to 1.0, which will hopefully have proper strings. [0]

[0]: https://github.com/ziglang/zig/issues/234

flohofwoe · on Dec 20, 2021

IMHO Zig's builtin string handling (or rather, lack of) is exactly right for a systems programming language. Zig avoids the biggest problem of C strings and treats strings as ptr/length slices, not as zero-terminated. UTF-8 for string literals is also fine.

That's all that's needed (and should be implemented) on the language level, everything else should go into the standard library, and additional specialized string processing libraries (because string handling can never be a "one-size fits all" solution, it's too complex for that).

diegocg · on Dec 20, 2021

They can put it on the standard library or the core language or wherever they want, but they absolutely need to provide good string handling.

And regardless of where they put the code, this is something that needs to be done by the core team. Otherwise you will end with too many string libraries, all of them trying to solve a particular problem and doing bad at everything else, with bad documentation and different API styles and making difficult to mix code that uses two different libraries.

AndyKelley · on Dec 20, 2021

A lot of people reach for string handling when the actual correct thing to do is intentionally avoid string handling, and only handle strings as opaque encoded UTF-8 bytes, that cannot be reasoned about in terms of human language.

I would even argue that having string handling in a standard library (or language) has the potential to cause a net increase in bugs, because of people thinking they are handling strings when actually they are just screwing around with codepoints. Go's string handling is completely broken, for example. As a result of strings in the language, Go programs tend to be more broken than C programs in terms of string handling.

wyldfire · on Dec 21, 2021

> A lot of people reach for string handling when the actual correct thing to do is intentionally avoid string handling, and only handle strings as opaque encoded UTF-8 bytes, that cannot be reasoned about in terms of human language.

I have lost count of how many times I have wanted to find substrings, transform cases, catenate strings, find patterns, substitute patterns. I'd be happy to do that in a language that didn't permit me to index the underlyinc characters or the bytes. Keep 'em opaque, sure. But I think it would be a mistake for a language not to have an idiom with a favorite library to perform these operations on encoded text. If resolving library dependencies is easy enough, then it doesn't need to be "standard" but it should be "the defacto standard." And if it turns out the defacto standard stagnates and doesn't keep up with the needs of developers, a new one can come take its place.

jstimpfle · on Dec 21, 2021

Maybe Java or Python are better choices for your problems. The point of a systems programming language is to allow implementation of specific solutions to specific problems. There are many, many ways to do the things you mention here (and that starts with the strings' storage format and allocation strategy) , so it's the right choice to not include _anything_ like that in the core language.

_rend · on Dec 20, 2021

IMO, Swift is a language that gets it right, by exposing an interface which feels to very closely match how "humans" understand text — the default String interface is a collection of grapheme clusters, while `.utf8View`, `.utf16View`, and `.unicodeScalarView` are optional views which expose data explicitly encoded as needed.

Swift is pretty pedantically strict about Unicode correctness, and avoids some pitfalls which lead to incorrect handling (e.g., integer-based indexing). This means that the traditional way many developers might be used to interacting with strings is cumbersome and annoying — but once you get past the initial hurdle*, most operations are actually (1) really easy, especially when expressed in terms of generic Sequence/Collection operations, and (2) much more difficult to get wrong.

*I think the largest part of that hurdle is overcoming what you may have gotten used to from other languages, i.e., treating strings as an array of "characters", for some language/library definition of "character" (whether bytes, code points, etc.). It's relatively rare that you actually care about indexing into an arbitrary spot in a string: instead, combinations of slicing operations (including `prefix(_:)`, `dropFirst(_:)`, `take(while:)`, etc.) and generic Collection operations will get you what you want. Things like `.reversed()`, `.sorted(), and `.shuffled()` all work trivially correctly too (since you're not operating on a bag of bytes), and it's exceedingly rare that user input will confound the operations you might need to perform. (Exception: operations like case folding and collection, which are locale-specific, need special handling through a framework like Foundation.)

To be clear: not everything is sunshine and roses, but an amazing amount of functionality "falls out" of basic protocol conformances on String, and its exposure as a Collection of grapheme clusters.

Given a specific string manipulation task, I'd be happy to provide an example of what it might look like in Swift!

saagarjha · on Dec 21, 2021

> Things like `.reversed()`, `.sorted(), and `.shuffled()` all work trivially correctly too (since you're not operating on a bag of bytes), and it's exceedingly rare that user input will confound the operations you might need to perform.

One thing to note: aside from programming interviews, these operations are fairly rare. And that's a good thing, because none of these produce results that are very intuitive, because they are not very well defined on strings in general (I don't fault Swift for this, but it's just a general problem with text). Using any of these to create a new String may cause entirely new characters to show up, or the length of the text to change. So Swift actually doesn't expose these as "string" operations, but operations on the characters themselves; in each case returning a new collection of characters that is not a String. Now, you can reconstitute them into a String pretty easily, but you should keep the this in mind when doing so.

duped · on Dec 21, 2021

I use these operations on a daily basis, not in interviews. If you're ingesting third party data somehow (even if it's structured like JSON or XML) there are going to be cases where you need to hand roll a simple parser.

uranusjr · on Dec 22, 2021

I’m genuinely very interested in the kind of data you’re dealing with. It’s common to need to reorder a list of strings, but I’ve personally never had any use case that requires me to reverse or sort _characters in a string_ to consume any kind of data (some kind of decoding, perhaps, but those tend to work better implemented at the byte array level, not string).

rudedogg · on Dec 21, 2021

> Given a specific string manipulation task, I'd be happy to provide an example of what it might look like in Swift!

How would you safely get the nth index of a string, clamped to the valid indexes? So if the nth index is out-of-bounds you get the first/last index instead?

saagarjha · on Dec 21, 2021

I don't think Swift has a clamping function in the standard library, unfortunately, so we'll have to roll one ourselves. One question remains of what should be done for the empty string; I've chosen to return nil for this case.

  extension String {
      func character(atClampedIndex index: Int) -> Character? {
          guard !self.isEmpty else {
              return nil
          }
          let clamped = max(0, min(index, count - 1))
          return self[self.index(startIndex, offsetBy: clamped)]
      }
  }

coldtea · on Dec 21, 2021

Or you get an Optional and you have to check if there's an error or you go the character within the range? Sounds perfect safe to me...

tdeck · on Dec 20, 2021

> the actual correct thing to do is intentionally avoid string handling,

That sounds nice and all, but wr have 50+ years of protocols and formats and APIs built up around strings. Unless you're just writing code to run on a small microcontroller, you need to be able to parse and generate strings. So its going to be pretty frustrating not to have good support for them, or to have every codebase use its own libraries and idioms for working with strings.

rep_lodsb · on Dec 21, 2021

"Stringly typed" programming gave us SQL injections, shellshock, and the log4j debacle. Not to mention probably 99% of processor cycles being wasted on parsing and re-encoding again and again from/to text-based formats that are in no way actually human-readable without the use of specialized tools.

When have you last browsed the web using telnet? It's all plain text, so you should be able to, right? Press Ctrl-U right now and get a taste of how readable it is, and even that is with the benefit of syntax highlighting.

On the machine level, processing binary data is effortless. At most, you have to swap the bytes around to conform to wrong-endian network byte order. Check that the length of the data fits into your buffer. Not a problem with Zig, or any language that doesn't silently ignore integer overflow as C does!

Scripting languages make handling strings seem easy, because that's what they were built for. And of course in most languages there are "mature" libraries for parsing JSON or XML. But all the fundamental complexity is still there. Each layer of abstraction may introduce some bug that can be exploited. Compared to classical buffer overflows, it may take several more steps to gain arbitrary code execution, but every bit of extra code you depend on still increases the attack surface.

Ideally, only user interface code should be dealing with strings at all, but legacy protocols may unfortunately require it too. This should be isolated as much as possible from the rest of any application, and not dictate what features are "first class" in a programming language.

tdeck · on Dec 23, 2021

That's right, my suggestion was clearly that we should replace all datatyoes with strings. This comment brought to you by TCL/Tk.

AndyKelley · on Dec 20, 2021

Protocols and formats absolutely should not require decoding strings. I think you are mistaken. Can you name any well-established protocol or format that does not treat strings as opaque encoded bytes?

Edit: so far these examples have been given:

* HTTP: wrong. the spec does not tell you to decode any strings

* CSV: wrong. the spec does not tell you to decode any strings, nor is it necessary to have any unicode awareness in order to properly read and parse the data or deal with the delimiters.

Additional request: if you attempt to provide a counter-example, please also point to the place in the spec where it tells you to decode a string.

_rend · on Dec 21, 2021

JSON[1], for instance, specifies:

1. "JSON syntax describes a sequence of Unicode code points. JSON also depends on Unicode in the hex numbers used in the \u escapement notation."

2. "A JSON text is a sequence of tokens formed from Unicode code points that conforms to the JSON value grammar."

3. "A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All code points may be placed within the quotation marks except for the code points that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters."

4. "Any code point may be represented as a hexadecimal escape sequence. The meaning of such a hexadecimal number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point."

5. "Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments."

JSON does require Unicode awareness, both for general parsing, and for correctly interpreting strings. Backslashes are allowed for special escape characters, which means that you must be aware of the format (and the encoding of the text) in order to be able to decode.

Note that JSON also doesn't specify a required encoding, only Unicode correctness, so a parser may need to be able to handle multiple Unicode encodings and differentiate between them.

The spec doesn't specify what to do with code points which are not understood as Unicode (given especially the allowance for unassigned characters), but explicitly-invalid Unicode should be rejected.

[1] https://www.ecma-international.org/wp-content/uploads/ECMA-4...

Spex_guy · on Dec 21, 2021

Parsing this out of utf-8 encoding requires no knowledge of unicode or even utf-8. All of the relevant characters (reverse solidus, quotation mark, and control characters) are single byte characters in the ascii subset. These characters cannot be found inside multi-byte characters in utf-8 due to the design of the encoding. Converting the unicode character escape codes to utf-8 would require knowledge of utf-8 encoding, but this unescaping is not a feature that would be provided by the language regardless.

_rend · on Dec 21, 2021

> Parsing this out of utf-8 encoding requires no knowledge of unicode or even utf-8.

If you have valid UTF-8 already, then yes, the task is a lot easier. But depending on the level at which you're parsing, this might not be the case — i.e., if you're writing a JSON parser from the ground up, you do need to know what UTF-8 and Unicode are, and will need to validate the input data.

> Converting the unicode character escape codes to utf-8 would require knowledge of utf-8 encoding

Agreed. Even if you're not working at the "array-of-bytes" level, you will need to be able to parse and translate "\u..."-style strings into the appropriate output character encoding.

> but this unescaping is not a feature that would be provided by the language regardless.

I'm not sure we're talking about this being handled at the language level. This translation is something that would likely be offered at the parser level (working with the features offered by the standard library), but the parser does need to know about it — and does need to be able to work with strings at a granular level to be able to parse it out. By definition, it cannot leave the input data as an undecoded bag of bytes.

Note, too, that the JSON spec does not specifically require UTF-8. UTF-16 is a completely valid encoding for JSON (though much less common than UTF-8), in which case none of these characters are an ASCII subset, and greater awareness is needed to be able to handle this.

lhorie · on Dec 21, 2021

> it cannot leave the input data as an undecoded bag of bytes

But all it's doing here is taking a hex string (which is entirely ASCII) and converting it into the respective hex representation. Since ASCII translates unambiguously to bytes, it doesn't really matter if `str[0]` is operating on a byte stream, codepoint stream or grapheme stream, because in utf8, they're all the same thing as long as we're within the ASCII range.

Where things get hairy is stuff like `str.reverse()` over arbitrary strings that may or may not be in ASCII. This repo[0] talks about some of the challenges associated with conflating characters with either bytes or codepoints. The problem is that programming languages often approach strings from the wrong angle: you can't just tack on handling of multi-byte codepoints on top of ascii handling; you lose O(1) random access and you don't actually model the linguistic domain properly by doing so, because in the first place, humans think of characters not in terms of bytes or codepoints, but in terms of grapheme clusters. Clustering correctness falls deep in the realm of linguistics, and is therefore arguably more suitable to be handled by a library than a programming language.

[0] https://github.com/mathiasbynens/esrever

_rend · on Dec 21, 2021

I agree entirely with your second paragraph, but regarding this:

> hex string (which is entirely ASCII)

My point is that JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc.; so long as the text is is comprised of Unicode code points in some Unicode transformation format, the JSON is valid.

As I said in the parent comment: if you are working within UTF-8 exclusively, and can assume valid UTF-8, then great! But this isn't necessarily true, and in some cases, you will still need to care about the encoding.

(Either way, this starts straying slightly from the more general discussion at hand: regardless of the encoding of the string, you will still need an ergonomic way of interacting with the contents of the data in order to meaningfully parse the contents — even past the hurdle of decoding from arbitrary bytes, you still need to manipulate the data reasonably. In some cases, this means working with a buffer of bytes; in others, it makes sense to manipulate the data as a string... In which case, you may run into some of the string manipulation ergonomic considerations being discussed around these comments.)

lhorie · on Dec 21, 2021

> JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc

Sure, it can also be gzipped, encrypted, etc but that goes back to the point that there's nothing inherently special about JSON as it relates to encoding to a byte stream. All there is to it is that somewhere in a program there's an encode/decode contract to extract meaning out of the byte stream, and in a protocol one most likely only looks at byte streams as sequences of bytes (because performance-wise, it doesn't make sense to look at payload size in terms of number of codepoints/graphemes at a protocol level)

joe_guy · on Dec 21, 2021

Anything with case insensitivity, like domain names and email addresses, require knowledge of strings as strings.

Case sensitivity, ordering, and case changing rules, are reasons not only to have strings, but extensive culture support built into them. String types also let you compare utf16 bytes to utf8 bytes, for example.

To think that strings are opaque bytes is massively naive. It presumes one encoding exists, the input is always valid, in addition to what is said above.

An example of what you're setting yourself up for, is an exploit based around differences in handling invalid utf8.

chrismorgan · on Dec 21, 2021

> * CSV: wrong. the spec does not tell you to decode any strings, nor is it necessary to have any unicode awareness in order to properly read and parse the data or deal with the delimiters.

CSV is defined in terms of characters, not bytes, so CSV parsing does require you to be encoding-aware: if your CSV file is encoded as UTF-16, bytewise parsing will destroy the data.

Akira1364 · on Dec 21, 2021

That's not really true at all. The single byte / single character comma separators are all that matters there. As long as you directly acknowledge / exactly replicate whatever blob of data happens to be in between each set of commas (even if it is nonsense garbage text) then you're correctly parsing the CSV.

rini17 · on Dec 21, 2021

Well let's try :grinningface:,:grinningface: which is d83d de00 002c d83d de00 in UTF16BE. If you extract first column by naively cutting the blob bytewise before the comma you end up with d83d de00 00 with extra NUL byte, which is a problem. With UTF16LE you'd prepend NUL which is even worse.

lhorie · on Dec 21, 2021

> cutting the blob bytewise

That doesn't make sense. If you're working with utf16, why would you slice bytewise? That's like slicing a zip file bytewise and wondering why it got corrupted.

The whole point of the argument for string support at a library level (rather than assuming some sort of equivalence between stringness and its underlying byte buffer at the language level) is that fixed width bytes fundamentally cannot model human language characters unambiguously because what a "character" is depends on the encoding/decoding contract of the program manipulating the byte buffer.

Assuming equivalence between 0x2C and `,` stems from the ancient history of ASCII, english and usage of C `char` as a mechanism to squeeze performance out of string operations by not properly supporting the full gamut of valid human language characters.

For a low level language that might be used to implement protocols, it totally makes sense that foo.len is length in bytes, because you're pretty much never going to want to know number of grapheme clusters at a protocol level. It doesn't make sense for a language level .len to be length in terms of codepoint count because that assumes encoding, which is fundamentally a business logic level concern.

chrismorgan · on Dec 22, 2021

The entire matter in question is whether CSV is encoding-independent, operating on bytes (we’re addressing AndyKelley’s comment). The answer clearly demonstrated here is: no, CSV is operating upon characters, not bytes, so you need to decode the Unicode first and let the CSV operate on Unicode data, so that it’s splitting on U+002C, rather than 0x2C in the byte stream before Unicode decoding which destroys the data.

Akira1364 · on Dec 21, 2021

I think you somewhat misunderstood the manner I was suggesting you'd be approaching things there, though nothing you've said is incorrect.

chrismorgan · on Dec 22, 2021

Then I’m not sure what you were suggesting. The crux is that CSV operates upon strings, not bytes.

tdeck · on Dec 21, 2021

The problem is you've invented your own definition of the terms "string" and "string handling" which doesn't seem to agree with the generally accepted usage of these words. You're fixated on the word "decode" despite that not being the original focus of the discussion, not being present in my comment or the original comment you're replying to.

I could create a standard library called "arrayofbytes" that lets you, for example:

- Search an array of bytes for a smaller array of bytes

- Split up an array of bytes based on an array of delimiter bytes

- Selectively convert an array of bytes in range 41-5A to bytes in the range 61-7A

But wouldn't it be more appropriate to describe these as "string handling" functions using the generally accepted terminology of our profession?

somethingor · on Dec 21, 2021

In Zig, those functions are in std.mem (distinct from std.ascii and std.unicode)

pcwalton · on Dec 21, 2021

By "decode a string" do you mean that you must apply some sort of Unicode transformation to it? If so, the HFS+ file system, which .dmg files often are, is an example: Unicode NFD normalization must be performed on all filenames because of case insensitivity requirements [1].

[1]: https://eclecticlight.co/2021/05/08/explainer-unicode-normal...

lmm · on Dec 21, 2021

The HTTP specification does tell you to behave differently depending on the presence or absence of specific values for case-insensitive headers. That requires decoding the header name strings.

coldtea · on Dec 21, 2021

>That requires decoding the header name strings.

Which should be ascii though no?

lmm · on Dec 21, 2021

Maybe, but it's not "treating strings as opaque encoded bytes" which is what was asked.

lhorie · on Dec 21, 2021

ASCII is kinda special in that it's both a way of encoding strings and it has character-to-byte equivalence. You can specify an HTTP header as being any number of characters between `\n\r` characters or any number of bytes between `0x0A 0x0D`, and the parsing code is going to compile to pretty much the same thing regardless of whether you know what `0x0A` means or not.

lmm · on Dec 22, 2021

Right, but to do case-insensitive comparisons of header names you need to decode them as strings, not just byte sequences.

mamcx · on Dec 21, 2021

What???

> the spec does not tell you to decode any strings

Which "spec" is that? My user ask for csv, see csv, upload csv, do SEARCH on CSV, transform them, etc. Even do some scripting on them.

Nowhere, EVER "opaque encoded bytes" is on the spec of the users!

ironmagma · on Dec 21, 2021

HTTP for one. It does something like assume ASCII-like until it encounters odd looking bytes and/or a meta encoding tag.

wyldfire · on Dec 21, 2021

I think HTML tags like 'meta' are merely payload to HTTP, right? Presumably a markup language doesn't fit Andrew's criteria for "protocol"?

10000truths · on Dec 21, 2021

Protobuf string fields are expected to be in UTF-8. Although I'd expect a sane implementation of a protobuf decoder to throw an exception or otherwise indicate an error if it receives a malformed protobuf encoding.

haberman · on Dec 21, 2021

Protobuf decoders are expected to validate UTF-8 strings for syntax="proto3" files, but not syntax="proto2". The behavior diverges mostly for historical reasons.

This is a validation pass only and it doesn't make any meaning of the code points, except to validate that none of them are surrogate code points (disallowed in UTF-8).

alophawen · on Dec 21, 2021

The tar / star format requires decoding strings of octal numbers.

https://en.wikipedia.org/wiki/Tar_(computing)#Header

adgjlsfhk1 · on Dec 21, 2021

skybrian · on Dec 21, 2021

I’m not aware of any significant breakage due to Go’s handling of strings. Where would I read more about this?

noisem4ker · on Dec 21, 2021

Go presents strings as slices of bytes (chars). As long as it's ASCII, it's fine. However, when dealing with multi-byte UTF-8 characters (codepoints), the proper unit is the rune. So, before attempting to measure a string's length, or read its n-th character, one must remember and access the string as a slice of runes.

  s := "naïve"
  
  // bad
  fmt.Println(len(s)) // 6
  fmt.Println(string(s[2])) // Ã
  
  // good
  r := []rune(s)
  fmt.Println(len(r)) // 5
  fmt.Println(string(r[2])) // ï

https://go.dev/play/p/YbMo49wU7vu

skybrian · on Dec 22, 2021

Yes, but I think most Go developers are aware of Go's quirks, so I'm wondering what bugs happen despite that awareness.

pphysch · on Dec 21, 2021

Yeah strings are ugly and programmatically impure -- they are an extension of human language after all -- but virtually all human facing apps use tons of them for obvious reasons. Same goes for regex.

IMO they are a great example of how Golang is a pragmatic rather than "clever" or "pure" programming language.

turminal · on Dec 21, 2021

By not treating strings specially, you get people screwing around with individual bytes inside codepoints which leads to at least as many bugs.

Spex_guy · on Dec 21, 2021

The UTF-8 encoding is designed so that this is usually not a problem. If you do a search in a utf-8 encoded byte array for an ascii character, for example, you can never get a false positive. Compound UTF-8 characters always have the most significant bit set of each component byte, and ascii characters always have it unset. Additionally, treating the string as an array of unicode codepoints doesn't solve the problem -- now you have people screwing around with individual codepoints inside grapheme clusters :P

turminal · on Dec 21, 2021

> Additionally, treating the string as an array of unicode codepoints

I suggested no such thing.

> individual codepoints inside grapheme clusters

That's less severe than invalid codepoints.

Perhaps the whole thing whichever way it is represented should not be mutable given that there's no way to make it mutable in a sensible way?

fastball · on Dec 21, 2021

I don't think I've ever had a problem with Python3's string handling, which is very robust.

bvrmn · on Dec 21, 2021

    >>> "ñ"[0]
    'n'
    >>> "ñ"[1]
    '̃'

fastball · on Dec 21, 2021

Huh, which version is that? Python 3.9 on my system:

  >>> "ñ"[0]
  'ñ'
  >>> "ñ"[1]
  IndexError: string index out of range

Which is what I would expect.

fiedzia · on Dec 21, 2021

this is normalization issue, not version issue

import unicodedata list(unicodedata.normalize('NFD', "ñ")) >> ['n', '̃']

list(unicodedata.normalize('NFC', "ñ")) >> ['ñ']

both are correct, the issue is that unicode allows accented letters to be written as _accented_letter_ or _letter_, _accent_. The idea of "character" in uncicode is not very useful, most of the time you will want graphemes, not codepoints. User-friendliness wise, this is what Python should use (another rant - strings should not have length method, they should have byte_length, codepoint_length and grapheme_length).

bvrmn · on Dec 21, 2021

Python 3.9.9 (main, Nov 20 2021, 21:30:06) [GCC 11.1.0] on linux

fastball · on Dec 21, 2021

EDIT: I can't seem to replicate this on my Ubuntu system. Strange.

Ah, interesting, I suppose then there is a difference in the way Linux handles strings? Didn't realize that, very unfortunate if true. I am running MacOS.

bvrmn · on Dec 21, 2021

Actually browser normalized string and it works for me also. Here is original byte sequence:

    >>> b'n\xcc\x83'.decode()
    'ñ'
    >>> b'n\xcc\x83'.decode()[0]
    'n'
    >>> b'n\xcc\x83'.decode()[1]
    '̃'

But I agree, it's rare case when you need to deal with non-normalized data.

eyelidlessness · on Dec 21, 2021

As a counterpoint from primarily experience with higher level languages, not being able to distinguish strings from just any blob of bytes is something that’s consistently put me off lower level languages. Sure you can just treat it as a blob of bytes but… you can’t do anything with it with certainty. That seems to be the class of bugs you’re describing? But this isn’t an issue in languages where a string is a distinct type (even if dynamic).

gautamcgoel · on Dec 22, 2021

Can you elaborate on why you think Go's string handling is broken?

svnpenn · on Dec 21, 2021

> Go's string handling is completely broken, for example.

Classic Andy, shitting on other language with no references or examples. Go has some of the best string handling I've used. Seamless byte, rune, string conversion. Simple iterating and slicing. Plus helpful tools like strings.Builder and strconv.AppendInt. while Zig has nothing.

feffe · on Dec 21, 2021

Maybe Andrew is referring to this? I'm just guessing but that is the only thing I can think of which could be considered broken. Go mostly just treat strings as bag of bytes and you have to use one of the unicode packages to actually do any significant work with them.

https://play.golang.com/p/Dla3sXciYXC

I also think string handling in Go is pretty sane. But range vs indexing on strings is something you need to be aware of.

svnpenn · on Dec 21, 2021

I don't see anything wrong in that example. As you said, golang has unicode/utf8, as well as the utf8string package. Is anyone really suggesting that the default slice should be on graphemes?

anonymoushn · on Dec 22, 2021

That's one of relatively few things that language-level utf-8 support could get you over utf-8 support in the standard library.

anonymoushn · on Dec 21, 2021

Zig certainly has things like strings.Builder and strconv.AppendInt, which aren't really related to language-level encoding-aware string support.

svnpenn · on Dec 21, 2021

No it doesnt

anonymoushn · on Dec 21, 2021

strings are called []u8, strings.Builder is called std.ArrayList(u8), strconv.AppendInt is called std.fmt.formatInt. Are you trolling?

svnpenn · on Dec 21, 2021

those dont do the same thing at all. strings.Builder can append a byte, rune, string, or byte slice:

https://godocs.io/strings#Builder

ArrayList cant do that. And strconv.AppendInt can convert a number to byte slice, then it appends to an existing byte slice. formatInt cant do that.

anonymoushn · on Dec 21, 2021

ArrayList(u8) exposes a writer. You can write things to writers in various ways. Probably more than four ways!

"appends to an existing byte slice" is not a completely coherent thing to ask for. If I give you a byte slice, the byte after the end it might belong to something else. If you want to deallocate my byte slice and make a new, replacement byte slice, you'll need an allocator to do that. An ArrayList(u8) has a byte slice, and knows how much of it is unused, and has an allocator so it can make a new larger byte slice if needed, and exposes a writer so that you can call std.format.formatInt to write into the byte slice (or allocate a new byte slice, copy the entire contents, and write into the new one, if appropriate).

> strconv.AppendInt can convert a number to byte slice, then it appends to an existing byte slice. formatInt cant do that.

As we have learned just now, we actually can use formatInt to append a formatted int to an existing byte slice (to the extent that that's a meaningful thing to ask for), by passing it the writer exposed by an ArrayList(u8)!

Is your complaint that you're not aware of any function in Zig that takes a u32 representing a Unicode codepoint and encodes it to UTF-8 and writes it to a writer (which is what "appending a rune" seems to mean)?

Is your complaint that people do not, by convention, allocate new, larger byte slices without an explicit reference to an allocator?

Is your complaint is that Zig does not have a garbage collector?

Can you share a Go program that uses the Go stdlib functions and cannot be trivially ported to use the Zig stdlib functions instead, for some reason other than that Zig programs must decide where the bytes will live, and Go programs need not do that?

svnpenn · on Dec 22, 2021

> Can you share a Go program that uses the Go stdlib functions and cannot be trivially ported to use the Zig stdlib functions

Heres an easy one:

https://go.dev/play/p/JFIvQYTF3R8

anonymoushn · on Dec 22, 2021

Please let me know the ways this does not do the right thing. Thanks.

https://www.godbolt.org/z/Y3z8rhzrq

svnpenn · on Dec 22, 2021

Im not sure how a 50 line program is better than a 5 line program, but OK. Also it seems you had to write functions to get the same result.

My original comment, was that Andrew has a habit of shitting on other languages without proper references or examples. Nothing you can really say is going to change Andrews behavior, so maybe you should stop, unless you can justify Andrews comments.

anonymoushn · on Dec 22, 2021

Your original program which is claimed not-possible-to-port is 14 LOC. This comparable program is 18 LOC: https://www.godbolt.org/z/xf4Tdr1Ps

Most of the difference comes from the demand that we represent unicode codepoints as integers at some point in the program, which is a nonsensical thing to do, because unicode codepoints don't correspond to anything useful in the actual text being represented.

You seem to have a habit of making false claims about the standard libraries of languages you dislike, and when pressed on the matter ask other people to do your homework. I certainly should stop doing other people's homework.

svnpenn · on Dec 22, 2021

You seem to have a habit of getting off topic. This thread was never about Zig, it was about Andrew shitting on other languages without justification. I think you can agree that for the basic example I gave, the Go language is easier, more streamline, more comfortable to work with. It allows to get the task done quickly.

So for Andrew to shit on Go string handling with not a single example is rude, and frankly just wrong as I have demonstrated.

anonymoushn · on Dec 22, 2021

Alright, I guess your objection is 100% about the function that accepts a 32-bit integer and encodes it as UTF-8 then.

coldtea · on Dec 21, 2021

It seems like every language must have some dumb decisions made because the authors don't care/think they know better/are stubborn/etc or because it painted itself into a hole from the start, without which it would be close to perfect for its use cases.

Python has the GIL.

Go itself has a few such cases (usually revolving around "NIH" and misguided simplicity).

Zig has the prejudice about proper string handling.

bvrmn · on Dec 21, 2021

> Python has the GIL.

You are welcome to show fast single-core python interpreter without GIL. Coreteam will gladly accept your patches.

kzemek · on Dec 21, 2021

That's how I understand "painted itself into a hole from the start" from the GP - removing GIL at this point is very hard because virtually all of Python language and code has been created in a world with the GIL.

bvrmn · on Dec 21, 2021

It's nice to have atomic guarantees of GILed implementation. For me current CPython semantic is a golden middle: you have pretty fast interpreter without race conditions (for huge part of user code). Due to GIL.

I always like to ask for what kind of task one needs GILless python?

CPU bound? If you are using native python code for CPU bound tasks you are already in a bad place. C extensions can release GIL. For example numpy.

What else?

kzemek · on Dec 21, 2021

> I always like to ask for what kind of task one needs GILless python?

> CPU bound? If you are using native python code for CPU bound tasks you are already in a bad place.

While I think you may have misread my comment for a value judgement on Python's GIL, I don't see as particularly useful dismissing major potential (multithreaded) performance gains for a slow language just because there's faster languages. Languages that are better in some way - like speed - should stand as a benchmark and a goal, not a reason to give up improvements.

bvrmn · on Dec 21, 2021

I'm not against faster interpreter also. It's history now, but IMHO super slow GILless python with Java-esk data races in 199x would not gain such broad community and ecosystem.

duped · on Dec 21, 2021

Anything with a hard or soft real-time constraint (audio, video, industrial automation, etc). You'd be shocked how many algorithmic prototypes are written in numpy then hand converted to C++ for realization.

bvrmn · on Dec 21, 2021

NumPy releases GIL. I love NumPy myself and created complex interactive realtime synths. It always was more convenient to slap Cython or C code to achieve desired 100x performance instead of getting potentially 10x (4x on my laptop) with multithreading.

formerly_proven · on Dec 21, 2021

u8 slices strike me as exactly the correct thing on the language level as well. In particular, it allows very straightforward zero-copy parsing/tokenization of strings. Most parsing of strings doesn't really need to know about UTF-8 and works correctly in the presence of non-ASCII codepoints.

String literally are not really UTF-8. Rather, zig source files are UTF-8 (by definition), and string literals are u8 literals; putting some UTF-8 between quotes just puts the literal UTF-8 encoded text into the literal because those are the bytes that are in the source file. In particular, string literals can contain arbitrary binary data and null bytes by way of \x00 escapes.

For Unicode handling there's some basic stuff in std.unicode (conversion between different UTF encodings, checking validity, decoding to codepoints etc.). This is used e.g. on Windows for checking filesystem paths. I don't general libraries of encodings is really that important today, iso-8859 and shift JIS might be useful sometimes, but everything else probably doesn't need to bloat up a standard library (iirc Python's codecs package, which contains dozens upon dozens of encodings, is like a third of the standard library by size).

anonymoushn · on Dec 22, 2021

Zig already ships a bunch of libcs so I don't think there's much concern about bloating the standard library.

turminal · on Dec 20, 2021

The issue with treating strings as slices/arrays is that it exposes (writably!) the underlying representation, which is almost never what you want and leads to subtle bugs.

The downside is of course the fact that you have to account for encodings in your language now, but picking the one and only sensible encoding really shouldn't be a problem in 2021.

von_lohengramm · on Dec 21, 2021

It's not that simple. Obviously, you'd choose UTF-8, but that exposes another issue: a codepoint is not a character. There's no simple & elegant solution to this at the language level.

turminal · on Dec 21, 2021

> a codepoint is not a character

Precisely. That's why I think representing strings as character slices and letting third party libraries handle them is not a good solution.

Deciding what features to put in the language vs. the stdlib vs. third party libraries is one of the hardest parts of language design. Personally I believe strings are important and frequent enough they deserve special treatment at the language level.

Edit (late addition):

I think not treating strings specially is mostly fine in C, but Zig seems to aim at being a little less lowlevel.

von_lohengramm · on Dec 21, 2021

> representing strings as character slices

Zig does not do this. It represents strings as byte slices. A UTF-8 character could be multiple codepoints with each codepoint being multiple bytes.

A big thing about Zig is not hiding complexity. If UTF-8 were implemented at a language level (whatever that means), then "language level" string operations would be non-linear, which would be very non-Ziggy. I could see value in a standard library UTF-8 implementation, but a LOT of forethought would need to be put into it. I think keeping UTF-8 string manipulation at the third-party library level is a good choice for now. Maybe once the language is finalized, the ecosystem is more developed, and lessons have been learned from the third-party libraries, then the standard library can implement this.

turminal · on Dec 21, 2021

> character slices

Sorry, byte slices is what I had in mind.

I'm not talking about language level string operations as in concatenation with + or something like that at all, because that certainly wouldn't make sense in language like Zig.

I'm not advocating for string functionality in the language, I'm advocating for a way to not allow byte slice functionality on a thing that is clearly not a byte slice.

von_lohengramm · on Dec 21, 2021

> a way to not allow byte slice functionality on a thing that is clearly not a byte slice

This already exists in the form of structs or opaque types. Both of these approaches would end up being implemented in "userspace" anyways, whether that's standard library or third-party.

However, (UTF-8) strings are byte slices. You can do simple manipulation with them as byte slices safely and validly. Split on spaces? Sure. Tokenize? Sure. Find substring? Sure. You can't do things that depend on say UTF-8 graphemes, but you can safely do most things that depend on bytes. For most purposes, treating strings as byte slices is the safest and correct approach.

lmm · on Dec 21, 2021

Doing find substring by find byte subsequence won't behave correctly in many cases, where semantically equivalent strings have multiple different bytesequence representation. Treating strings as byte slices exposes a lot of footguns; it shouldn't be easy just as e.g. treating floating-point numbers as byte sequences shouldn't be easy.

von_lohengramm · on Dec 21, 2021

Technically the shortest UTF-8 representation is _the_ representation and _correctly_normalized_ Unicode is uniquely represented, but fair enough. The unknown input may be slightly malformed. Complexities like this is why one shouldn't underestimate the nuances (and runtime costs!) of implementing proper Unicode. As for representing byte sequences as byte sequences, that is the most basic way to represent strings of text without placing any assumptions on them. It's the assumption of potentially incorrect invariants that's the issue. If you have faculties to handle Unicode correctly (and very few languages do), then using something more opaque may be better fitting than a byte slice.

lmm · on Dec 21, 2021

> Technically the shortest UTF-8 representation is _the_ representation and _correctly_normalized_ Unicode is uniquely represented

Not necessarily the shortest (NFC means not using composed characters from later revisions of the standard), and you only get a normalised representation if you've actually normalised it - if you've just accepted and maybe validated some UTF-8 from outside then it probably won't be in normalized form. IMO it's worth having separate types for unicode strings and normalized unicode strings, and maybe the latter should expose more of the codepoint sequence representation, but I don't know if any language implements that.

turminal · on Dec 21, 2021

> it shouldn't be easy just as e.g. treating floating-point numbers as byte sequences shouldn't be easy.

That's a nice analogy.

> Doing find substring by find byte subsequence won't behave correctly in many cases, where semantically equivalent strings have multiple different bytesequence representation.

Unfortunately that's nearly impossible to do sanely in the general case, no matter how the string is represented.

Firadeoclus · on Dec 21, 2021

I'm curious, what would be a good reason why treating floating point numbers as byte sequences should be any harder than what is required to make it obvious (provided their binary format is well defined)?

lmm · on Dec 21, 2021

There are footguns in making that representation easy to access, e.g. if you try to hash the byte sequence to use floats as hash table keys then it will almost work but you'll get a very subtle error because 0 and -0 will hash differently. And frankly most of the things you'd do with the byte sequence are things that there are more semantically correct ways to do. There should be a way to access that representation but it shouldn't be something you'd stumble into doing accidentally, IMO.

turminal · on Dec 21, 2021

You are talking about what stringy things can be done with byte slices and I'm talking about all the byteslicy things that shouldn't be done with strings.

Like subslicing. And accessing individual bytes in it.

coldtea · on Dec 21, 2021

>Zig does not do this. It represents strings as byte slices.

That's what the parent meant. Char(aracter) is a byte in C.

coldtea · on Dec 21, 2021

So let's have everybody write their own incompatible libs and partial solutions... It worked for C /s

coldtea · on Dec 21, 2021

It seems they don't want unicode strings as part of the language, at best as a library. But unless there's a single sactioned one, this will not end well.

And this is met with answers like "just avoid string handling" from the language designers...

It's probably because they don't work in any related domain, and even less so have to do with international strings (except as byte buckets they don't care about and don't have to do anything do).

jorangreef · on Dec 21, 2021

No, I believe it's the right decision and I respect it. Encoding charsets are continuously evolving and shouldn't be baked into a language specification.

roca · on Dec 21, 2021

Evolving how? I'm not aware of any reason to move beyond UTF8 for encoding Unicode.

jorangreef · on Dec 21, 2021

Which version of UTF8?

coldtea · on Dec 21, 2021

When adopting a new niche language, with hardly any following, and frequent changes, and not even an 1.0, like Zig, "which version of UTF8" (as if that's an issue) is the least of your worries...

"Which third-party strings lib of several half-complete incompatible libs" will be a much realer concern...

jstimpfle · on Dec 21, 2021

For systems programmers the answer to "which third-party strings lib" is probably "None, write your own that fits with the rest of the system". A ready-made lib will be a lot of work to fit in - consider choice of internal encoding, allocation, hashing, buffering, mutable operations, etc....

Assuming that you really want to use UTF-8 internally, which is probably a sensible choice, the reusable part of a string library is basically the UTF-8 encoded/decoder. A useful implementation of UTF-8 is about 100-200 lines, I could probably rewrite what I use in an hour or two without an internet connection. The rest of the work is integration stuff that doesn't make sense to put in a library IMO. The idea of a string library fits much better with garbage collected and scripting languages (which includes C++ with RAII mechanism, but consider that std::string and similar often cause bad performance).

Many programs, in particular non-graphical programs don't need any UTF-8 code at all - UTF-8 handling is basically memcpy().

rightbyte · on Dec 21, 2021

> Many programs, in particular non-graphical programs don't need any UTF-8 code at all - UTF-8 handling is basically memcpy().

argv to main is utf8 on my system.

jstimpfle · on Dec 21, 2021

On Unix/Linux, it is binary data without any restrictions except that each argument is zero-terminated (the typical argument is probably UTF-8 if you have set a UTF-8 locale). You'll see exactly the bytes that were put in as arguments to execlp() et. al. by the parent process.

On Windows, I believe it is Unicode converted to current codepage.

In any case I don't need to care about it since I can simply treat arguments as ASCII-extended opaque strings as described.

anonymoushn · on Dec 22, 2021

> argv to main is utf8 on my system.

That sounds totally compatible with programs that don't know anything about utf8. Do programs need to normalize the utf8 you pass in before using it as an argument to open(2) or something?

chaz6 · on Dec 21, 2021

How feasible would it be to defer string processing to the operating system so that the behavior of all software running on it is the same? Perhaps a new OS interface could be defined for this purpose using syscalls on Linux. At the very least, there should be one canonical set of algorithms per operating system, rather than everyone downstream reinventing the wheel. Please forgive me if this sounds absurd, I am not a low-level programmer.

anonymoushn · on Dec 22, 2021

I certainly don't want to pay for a syscall to do string encoding.

Also, a lot of the time the problem isn't that people are using fundamentally incompatible string libraries, but that there isn't one correct answer to the question they're asking and they chose different ways to convert the question into code. A reasonable question to ask is "How many extended grapheme clusters are in this string?" The answer is "It depends on what font you plan to use to render it." Not great!

Some programmers would still like to e.g. write a reverse() function that returns ":regional_indicator_f::regional_indicator_r:" unmodified (because it is the French flag emoji) and returns ":regional_indicator_i::regional_indicator_h:" when given ":regional_indicator_h::regional_indicator_i:". If such people want to avoid having nonsensical behavior in their programs, the only solution available is to decide what domain the program should work on and actually deal with the complexity of the domain chosen.

ShrigmaMale · on Dec 22, 2021

Because string processing is not slow enough? Making string operations eat the performance impact of that context switch in and out of kernel is not a good idea. Library is better. This is not string but I'm thinking to some language like Rust where the "time" crate is not language feature but just about standard and all the other library use it. This is possible if a library like it is good quality and exist early in the language.

jorangreef · on Dec 21, 2021

Zig's C interop is pretty good though, and there must be some decent native Unicode library out there somewhere right? ;)

I've worked on a full-duplex file synchronization system that had to support cross-platform operating systems and file systems, across a variety of Unicode normalization schemes (and versions [1], which is why I introduced the question), and I'm personally satisfied that baking this into the language specification would be a mistake.

[1] For example, depending on the file system, there's simply no way to get the normalization right unless you reverse engineer the actual table they're using, or probe the file system to do the normalization for you.

motiejus · on Dec 21, 2021

Would Bellard's libunicode[1] work?

[1]: https://github.com/bellard/quickjs/blob/master/libunicode.h

roca · on Dec 21, 2021

Regular UTF8, not WTF-8 or any of those other variants (which are for encoding data that is not necessarily Unicode).

jorangreef · on Dec 24, 2021

Also excluding Unicode normalization? Or should that also be baked in?

roca · on Dec 24, 2021

No need to drag Unicode normalization into it; don't require strings to be normalized. Normalization is only relevant in very specific contexts and you don't want to pay for it elsewhere.

jorangreef · on Dec 25, 2021

Agreed, but I think that many people would consider Unicode normalization to be part of what they want from the std lib when they mean that UTF8 should be baked in... so that they can manipulate UTF8 as they want, including in various normal forms according to platform. It's hard to imagine people being satisfied without having access to Unicode normalization.

For example, consider JS' introduction of String.normalize(). This is a slippery slope. It had a huge impact on Node's build process and binary sizes because now all the tables had to be shipped. But it's still broken in JS, because no matter the Unicode normalization support provided, it will never match the exact tables used e.g. in Apple's HFS.

I feel that by the time it gets to String.normalize(), it's too far gone.

anonymoushn · on Dec 21, 2021

It seems way nicer to have a language that treats strings as byte arrays and use libraries to handle encodings than to have a language that treats strings as UCS-2 and use libraries to handle UTF-8 strings that live inside of UCS-2 strings.

coldtea · on Dec 21, 2021

I don't know, it has been a pain to work with strings in any language that does the above, and has seldom (if ever) been a problem with Java, Go, Swift, or even modern Python 3, and so on...

jstimpfle · on Dec 21, 2021

I don't get what's so hard about it. Most of my programs deal with UTF-8 simply by doing memcpy(). Parsing code just loops over the bytes and compares to ASCII characters (0-9, A-Z, a-z, \n, \r, \t ...). That's how UTF-8 was designed to be used.

SuchAnonMuchWow · on Dec 21, 2021

I assume you never had to deal with unicode normalization ?

When you send your unicode string to an external system (for example a storage server with a database) and latter retrieve the string, only to find out that it has been normalized differently so it no longer match byte-for-byte what is stored in your program, and all of a sudden strcmp no longer works.

Or all kind of weirdness like that because every system outside of your program will handle unicode differently, and you will need to adapt to them, and having a string library to do most of the heavy-lifting will avoid the need for every user to rewrite a library from scratch.

Not to mention that for every developer which rewrite unicode handling functions, you will probably end up with a function with subtly different behaviors from others, which aggravate the problem for others when they will try to communicate with your system.

jstimpfle · on Dec 21, 2021

> I assume you never had to deal with unicode normalization ?

I hadn't, and as long as I control the data I'm displaying, I won't have to.

> Or all kind of weirdness like that because every system outside of your program will handle unicode differently

Blame those systems, not me.

What you suggest is surrendering to the state of affairs, which we collectively self-inflicted.

When I have to deal with normalization issues and have to interface with external systems, I can still go looking for a library if I don't feel like implementing it on my own (which is likely).

But unless I need to do normalization, I'm way worse off with a complicated library than with just doing memcpy() or using a simple decode_codepoint() routine.

SuchAnonMuchWow · on Dec 22, 2021

> Blame those systems, not me.

Yes I completely agree with you, and if you don't need it, any unicode handling library is overkill and add more headaches than simply handling utf-8 string as byte arrays.

I just wanted to insist on the fact that some people will have to deals with theses kind of issues. And these issues are self inflicted, but it gets worse every time someone try to reinvent the wheel or rely on byte array when they shouldn't.

Having a standard library in the language make the issue less worse: the core of the language still handle only byte arrays, and for the cases where it's not enough you still have only one library so you don't add your own subtly different mishandling of the standard by implementing your own.

So memcpy is fine, but that's about it: for example, please don't use strcmp when you need to sort data alphabetically and please don't try to reimplement the standard algorithms designed for that, otherwise you will be part of the problem.

anonymoushn · on Dec 22, 2021

Can you really fix anything by changing the native string type? You'll inevitably need to exchange bytes with different systems that demand different encodings and different normalization forms.

anonymoushn · on Dec 21, 2021

I'm not sure how to explain the difference in experience here. Python 3's built-in string encoding support has been a source of endless pain to me, and Lua's belief that strings are byte arrays has been much easier to deal with.

ruined · on Dec 20, 2021

thread tl;dr:

it doesn't look like there will be language support for things like codepoints or grapheme indexing or treatment of strings as anything but byte arrays, so ddevault is sad.

there is intention from andrewrk and jecolon to provide such features in the standard library before 1.0 release.

downside to library vs lang support that is you can expect a good chunk of programmers to ignore the less-ergonomic library features, use the language features at hand, and handle strings incorrectly as byte arrays.

dnautics · on Dec 21, 2021

Does this boil to the semantic question of if the stdlib is part of the language?

anonymoushn · on Dec 21, 2021

One difference is that there won't be a built-in string type with operator[] defined as something other than returning the byte at the indexed position in the array of bytes. There could be most other things one would expect from language-level string support though.

ruined · on Dec 22, 2021

i don't think so. the answer is presumed no, because everyone in the relevant thread including language authors and stdlib authors (with plenty of overlap) agree that it is not, and agree there are functional distinctions about what is possible on either side of that dividing line (mostly due to zig not supporting custom operators).

zenlot · on Dec 20, 2021

Is Zig going strong in community, or it's likely to remain as a niche / fans language? I like the idea, but anybody knows how community / companies react to it in a more wider "looking to use in production" environment?

Spex_guy · on Dec 21, 2021

We currently don't advise using the Zig language in production. The compiler has known bugs and even some miscompilations, and we sometimes make breaking changes to the std lib and even the language. However, `zig cc` and `zig c++` are currently considered stable enough for production use.

int_19h · on Dec 21, 2021

I don't see anyone seriously using it in production, outside of the Zig ecosystem itself, so long as this remains true (from the posted link):

> The Zig standard library is still unstable and mainly serves as a testbed for the language. After the Self-Hosted Compiler is completed, the language stabilized, and Package Manager completed, then it will be time to start working on stabilizing the standard library. Until then, experimentation and breakage without warning is allowed.

dnautics · on Dec 21, 2021

I believe there is someone using it in production for a field deployed embedded system that is tied to revenue.

IIRC Forwards compatibility is not an issue because those deployments are one-and-done.

Aransentin · on Dec 21, 2021

That'd be my company. We've had Zig in production for more than a year now, and it has gone remarkably smoothly - if you're ever using public transport in northern Europe or Germany there's a non-negligible chance that there's some embedded device running Zig code in it that's logging GPS and route data.

jorangreef · on Dec 21, 2021

I loved your talk on this ("Zig in Production"): https://www.youtube.com/watch?v=124wdTckHNY

coldtea · on Dec 21, 2021

>I believe there is someone using it in production for a field deployed embedded system that is tied to revenue.

Sure. But people do all kinds of not advisable things, even if revenue is at risk.

In fact, it might even be fine, for their use cases. The build it once, it has the libs they want covered, it works like the want, that's it.

That can work with any early release language/lib.

The problem is with someone casually putting it into production, and then not expecting breakage, missing libs they will need, finding that there's not much tooling, and so on.

In other words "can it be put into production?" is another question compared to "is it ergonomic, stable, full featured enough to be a good and easy production choice?"...

Almost everything can be put into production (and even work reasonably well), even a 1000-liner Perl script written by someone who first tried Perl that same week.

jorangreef · on Dec 21, 2021

Zig's a little different here.

Zig's C interop means it can leverage some of the most battle-tested libraries out there, with no shortage of libraries given C's massive ecosystem. And with Zig, it's not like you have to rewrite your whole system in a new language, you can incrementally rewrite the parts that make sense, test, and repeat.

Zig's tooling around compilation is also arguably ahead of most languages, and Zig's progress here is flowing back and adding value to many communities. For example, this 0.9.0 release can now build native Node addons without requiring node-gyp.

refulgentis · on Dec 21, 2021

I had no idea it wasn't in production yet: is there a story for why it consumes so much space on HN? Is the story strong enough already that is a clear alternative to Rust for post-C++ projects?

kbd · on Dec 21, 2021

Zig is full of good ideas and seems to be a truly serious attempt, by very talented developers, at improving the “C level” of the stack. I put my money where my mouth is and have a recurring donation to the Zig Software Foundation every month. Lots of people are excited about it. Dunno why you thought it was “ready for production” at a 0.x version though. They’re actively working on the self-hosted compiler and there are breaking changes every release still fyi.

Spex_guy · on Dec 21, 2021

I can't comment for all the other people who are posting and voting for those posts, but at least for me Zig has quickly become my language of choice for side projects. Its cross compilation features alone are enough for it to replace the system C/C++ compiler toolchains I used to use, and the language itself is everything I'm looking for. Readable (IMO) syntax, proper namespaces, order independent declarations, powerful metaprogramming, and an unmatched level of internal consistency all make it stand out to me. It feels like a massively simplified C++, a native language that improves significantly on C without introducing a massive set of unrelated features.

jeremyjh · on Dec 21, 2021

It seems strange to compare to C++ when it has none of the features that most define C++ like RAII, OOP & templates. Its not really "simplified" - its something totally different.

jhgb · on Dec 21, 2021

Presumably the idea is that lots of things that you basically have to use C++ for today because C alone is weak could be served by Zig instead, not that Zig and C++ are comparable languages.

xedrac · on Dec 21, 2021

I really like Zig, but the lack of RAII means we're back to malloc/free style programming, and I would never opt into that unless Rust could not be used (e.g. binary size too large). Having said that, it's way better than C and I hope it does well.

tw102 · on Dec 21, 2021

Unfortunately malloc-free is unavoidable if you're writing code that needs to allocate memory rather carefully (e.g. guaranteeing no OOMs—Rust has many situations where it'll silently heap alloc and then panic on OOM). Looks like Zig has accepted that it'll be used for those situations and decided to make that experience really good, instead of deciding to be a rustalike or insist on RAII. I think that's an appropriate choice! It makes me excited to use C less.

Arnavion · on Dec 21, 2021

RAII and explicit allocators are independent concerns. That Rust chose a bad default in having 1) a static global allocator, and 2) an expectation of infallible allocations from said global allocator that was then made pervasive in its libstd and third-party ecosystem, has nothing to do with the fact that it has lifetime-based destructors. Having explicit allocators does not mean you need to `free` manually instead of via an automatically-called destructor. A type that allocates needs to ensure a corresponding free in its dtor, and then every other code that uses the type gets the lifetime-based cleanup for free.

jhgb · on Dec 21, 2021

I imagine that Zig has a lot more focus on "I can't use X in my environment" types of situation. It seems that for many such situations it might be a better fit than Rust.

ncmncm · on Dec 21, 2021

There is, for practical purposes, noplace where one can't use C++, except where gatekeepers exercise power keep it out. Typically it would be only a day's work to get those building with a C++ compiler, whence they could begin modernizing.

There are plenty of loadable modules in C++ for Linux and BSDs, and plugins for Postgres, in places where there is no expectation of upstreaming them.

Zig is in a similar boat.

jhgb · on Dec 21, 2021

Perhaps you're correct about C++, but I was more referring to the Zig vs. Rust situation.

ncmncm · on Dec 21, 2021

Since the Zig tool chain can also compile C, and Zig can use C headers without translation, the cases are more similar than one might otherwise suspect.

But of course it would be less of an upgrade, and the Zig parts would have to stay clearly segregated.

kaba0 · on Dec 21, 2021

Not really back to that. Zig just chooses a different way to deal with memory, namely different compilation modes and faster write-compile-test cycles.

It’s a different tradeoff which may be worth it for some use-cases. But it’s definitely not obvious whether it is worse, as the alternative comes with a huge complexity on the language side.

jeremyjh · on Dec 21, 2021

It is possible they will add a type level resource release obligation at some point. It would not be anything quite like the Rust borrow checker but I think would be a big help. https://github.com/ziglang/zig/issues/782

kaba0 · on Dec 21, 2021

Well, Zig’s compile time metaprogramming capabilities do rival that of C++’s in a much more simple way, imo.

int_19h · on Dec 21, 2021

It does have comptime, which covers a lot of the same ground as templates (and generics, in other languages).

anonymoushn · on Dec 21, 2021

comptime seems like a superset of templates. The target audience was probably not using OOP. People do come by the discord to chat about how nice RAII is though.

wtetzner · on Dec 21, 2021

> is there a story for why it consumes so much space on HN?

People on HN find it interesting.

dnautics · on Dec 21, 2021

It's in production somewhere, but the authors currently discourage it.

jorangreef · on Dec 21, 2021

My guess is that Zig's design choices are hitting "a sweeter sweet spot" for systems programming that resonates with many engineers reading HN.

At least for TigerBeetle [1], a distributed database, the story was strong enough even last year that we were prepared to paddle out and wait for the swell to break, rather than be saddled for years with undefined behavior and broken/slow tooling, or else a steep learning curve for the rest of the project's lifetime. We realized that as a new project, our stability roadmaps would probably coincide, and that Zig makes a huge amount of sense for greenfield projects starting out.

The simplicity and readability of Zig is remarkable, which comes down to the emphasis on orthogonality, and this is important when it comes to writing distributed systems.

Appropriating Conway's Law a little loosely, I think it's more difficult (though certainly possible) to arrive at a super simple design for a distributed consensus protocol like Viewstamped Replication, Paxos or Raft, if the language's design is not itself also encouraging simplicity, readability and explicitness in the first place, not to mention a minimum of necessary and excellent abstractions. Because every abstraction carries with it some probability of leaking and undermining the foundations of the system, I feel that whether we make them zero-cost or not is almost besides the point compared to getting the number of abstractions and composition of the system just right.

For example, Zig's comptime encouraged a distributed consensus design where we didn't leak networking/locking/threading throughout the consensus [2] as is commonly the case in many implementations I've read, even in high-level languages like Go. It made things like deterministic fuzzing [3] really the natural solution. People who've worked on some major distributed systems in C++ have commented how refreshing it is to read consensus written in Zig!

Zig also has a different/balanced/all-encompassing approach to safety that resonates more with how I feel about writing safe systems overall: all axes of safety as a spectrum rather than as an extreme (this helps to prevent pursuing one axis of safety at the expense of others), safety also including things like NASA's "The Power of 10: Rules for Developing Safety-Critical Code" [4], assertions, checked arithmetic (this should be enabled by default in safe builds, which it is in Zig), static memory allocation, and compiler checked syscall error handling, the latter of which is really the number one thing by far that makes distributed databases unsafe according to the findings in "An Analysis of Production Failures in Distributed Data-Intensive Systems" [5].

While we could certainly benefit from the muscle of Rust's borrow checker in places, it makes less sense since TigerBeetle's design actively avoids the cost of multi-threading, with a single-threaded control plane for more efficient use of io_uring (zero-copy when moving memory in the hot path), plus static memory allocation and never freeing anything in the lifetime of the system. The new IO APIs like io_uring also encourage a future of single-threaded control planes (outsourcing to the kernel thread pool where threads are cheaper) since context switches are rapidly becoming relatively more expensive. Multi-threading for the sake of I/O is less of a necessary evil these days than it was say 5 years ago.

At some point, the benefits didn't outweigh the costs, and we had to weigh this up. In the end, it came down to simplicity, readability and state-of-the-art tooling.

[1] https://www.tigerbeetle.com

[2] https://github.com/coilhq/tigerbeetle/blob/main/src/vsr/repl...

[3] https://github.com/coilhq/tigerbeetle#simulation-tests

[4] https://web.cecs.pdx.edu/~kimchris/cs201/handouts/The%20Powe...

[5] https://www.usenix.org/system/files/conference/osdi14/osdi14...

vince14 · on Dec 21, 2021

> Introduced arbitrary code execution via ${jndi:ldap://... inside any logged string.

hehe

aix1 · on Dec 20, 2021

Is there a good "Why Zig" writeup motivating the language?

I've been looking around their web site and the only thing I could find was pretty generic:

Zig is a general-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

von_lohengramm · on Dec 20, 2021

In addition to the other recommendations, the Zig Zen[0] and The Road to Zig 1.0[1] are pretty convincing.

[0] https://ziglang.org/documentation/master/#Zen

[1] https://www.youtube.com/watch?v=Gv2I7qTux7g

animesh · on Dec 21, 2021

That Road to Zig talk is a must watch. I keep coming back to it. More than the language, I love how much conviction Andrew has in his vision for Zig.

curtisf · on Dec 20, 2021

Yes, there is a "why Zig when..." page in the "learn" section:

https://ziglang.org/learn/why_zig_rust_d_cpp/

jakelazaroff · on Dec 20, 2021

It's in the "learn" section: https://ziglang.org/learn/why_zig_rust_d_cpp/

anonymoushn · on Dec 21, 2021

We're quite happy about the allocgate changes. It seemed way too easy to write buggy code that would mangle the wrong bits of the stack prior to this change, and this sort of error was not caught by debug builds.

lhorie · on Dec 20, 2021

Would be nice to get more clarity on what exactly is the scope of the self hosted compiler. From previous discussions, I was under the impression that people would use the self hosted compiler for development, but still be expected to use the bootstrap compiler for production (because LLVM optimizations are in the latter but not the former).

This page makes it sound like the self-hosted compiler will be the compiler, but various parts of the compiler infrastructure will be either interchangeable with not-quite-self-hosted modules via flags, or these LLVM-leveraging modules (e.g. for C/C++ compilation) will always be hard dependencies even in the self-hosted compiler.

AndyKelley · on Dec 20, 2021

The self-hosted compiler is the compiler. It has an LLVM backend. The bootstrap compiler will be used only to bootstrap the self-hosted compiler.

The self-hosted LLVM backend is not done. Zig 0.9.0 is the self-hosted compiler, you are using the self-hosted compiler when you use Zig today. But it relies on the bootstrap compiler for the LLVM backend.

I know it is weird to say "self-hosted LLVM backend" but I don't know what else to call it. It's the .zig code that lowers Zig IR to LLVM IR.

It's possible to build Zig without an LLVM dependency by not passing `-Denable-llvm` to `zig build`. In this case the LLVM backend is not available (neither the self-hosted one or the bootstrap one).

yellowapple · on Dec 21, 2021

Does this mean that y'all are open to the self-hosted compiler supporting CPU architectures unlikely to ever have LLVM support? I know that's one of the blockers for the oft-asked "Will OpenBSD consider adopting Rust/Zig/Go/whatever?", as one example of a project that targets platforms which LLVM does not.

My guess from the preliminary Plan 9 target work mentioned in the release notes is that something like a SuperH backend (for example) would indeed be in scope (provided someone's willing to contribute one, of course), but a confirmation would be neat. I suppose the C backend would do the job in these cases, too, but I'm sure it'd be nice to not have to include GCC (or some other C compiler besides zig cc) in the mix (especially for projects like OpenBSD that try to minimize copyleft code).

Speaking of: how is zig cc anticipated to work with a self-hosted Zig? Will there be a dependency on clang (as suggested by the punt_to_clang(...) call in the current main.zig)? Will it be possible to swap that out with something else that could turn C into ZIR (or something else the self-hosted compiler could then punt to whatever backend)?

Relatedly, would zig cc support the planned C backend? If so, would the resulting C output be equivalent to the input (notwithstanding the current limitations re: macros, struct bitfields, etc.)?

AndyKelley · on Dec 21, 2021

> Does this mean that y'all are open to the self-hosted compiler supporting CPU architectures unlikely to ever have LLVM support?

Yes! We won't block 1.0 on the quality of the less mainstream targets, but that's what the tier system is for - to ship a compiler that has varying levels of quality for various targets, while communicating clearly to users what kind of experience they can expect for each one.

SuperH patches are absolutely welcome.

> how is zig cc anticipated to work with a self-hosted Zig? Will there be a dependency on clang [...]?

The main distribution of Zig will be LLVM/Clang-enabled. However it is already possible to build a version of Zig that does not have these features enabled. In such case, compiling C, C++, and Objective-C code will result in an error.

However, the arocc project[1] is emerging, which, depending on a combination of how much funding ZSF gets and how much enthusiasm the unpaid contributors working in their spare time have, is looking like a promising C frontend that would be available even without LLVM/Clang. It is C only, however, with no intention of compiling C++ or Objective-C.

> would zig cc support the planned C backend?

As it is currently implemented: no. Zig invokes clang to turn C source code into object files.

However, with the arocc frontend mentioned above, this would be converting the C source code into ZIR (or perhaps AIR), which could then be lowered with any of the backends, including the (partially complete) C backend. In such case, the C output would look drastically different than the input. It would look more like a machine-generated IR than natural C code that a human would write.

[1]: https://github.com/Vexu/arocc

logicchains · on Dec 21, 2021

To clarify, is there any plan to develop a pure-Zig backend, so people who want to escape from LLVM can use that backend for compiling Zig code instead?

AndyKelley · on Dec 21, 2021

Yes. If you look at the infographic here:

https://ziglang.org/download/0.9.0/release-notes.html#Self-H...

Notice the bubble that says "LLVM Codegen" (44% done). This is the LLVM backend. All the other bubbles do not depend on LLVM at all.

I suggest to check back in with the next release of Zig and see where we are at. I suspect we will have at least the x86 backend fully operational by then.

ModernMech · on Dec 20, 2021

Is there a story behind the cartoon lizards? Are they like the Zig mascot or something? Seems like they only started appearing since the last release, but they're not explained (unless I missed it).

xtian · on Dec 20, 2021

They are ziguanas :)

ModernMech · on Dec 20, 2021

Okay thanks! One thing I like best about programming languages are the particularities of the cultures that spring up around them.

rudedogg · on Dec 20, 2021

https://github.com/ziglang/logo#official-mascots

naltun · on Dec 20, 2021

The one in the jetpack has been around for a couple years, iirc.

rmanolis · on Dec 20, 2021

I love Zig. But unfortunately there are no use cases for web developers yet that Go and JavaScript does not do already … unless the webassembly replace JS for more cool and futuristic UI.

a-priori · on Dec 21, 2021

I say this as a web developer: there's far more to programming than the web.

ksec · on Dec 21, 2021

Wait until Bun is officially released. Which would be useful for vast majority of web developers.

https://bun.sh

edflsafoiewq · on Dec 21, 2021

Why would you want to use Zig for webdev?

badhombres · on Dec 21, 2021

So I can use zig and have someone pay me to use it :)

db65edfc7996 · on Dec 20, 2021

Compiler error for unused variables :( [0]. Possibly my most hated feature of Go.

[0] https://ziglang.org/download/0.9.0/release-notes.html#Compil...

Edit: To be clear, love enforcing the idea for production code, but wish they had embraced a '-dev' mode or equivalent flag that made it easier to experiment.

flohofwoe · on Dec 20, 2021

One problem I see with this decision is that code will now be littered with:

    _ = bla;
    _ = blub;

...which have been forgotten during development.

So the next thing that's needed is an error if 'bla' or 'blub' are actually used elsewhere ;)

mediocregopher · on Dec 20, 2021

I've been 99% a Go developer since 2015-ish, and this is not something I've ever encountered, in general people just don't leave unused variables lying around. The compile time check does highlight logic errors frequently enough though, so I'm very glad it's there.

DangitBobby · on Dec 21, 2021

It's mostly a problem when you're in the middle of writing something and you're trying to figure out what's going wrong, so you start commenting stuff out... And now you have un-used variables that you have to rename to _ for no good reason. It just adds busywork.

int_19h · on Dec 21, 2021

It would be nice if Zig had statement- and expression-level comments, similar to how most Lisps let you comment out whole S-exprs (which still have to be valid). Then it could allow variables that are referenced only from such comments.

carlhjerpe · on Dec 21, 2021

More often than not I end up creating variables for a bunch of chained function calls just to get the debugger output right, I might be holding it wrong though.

AndyKelley · on Dec 20, 2021

Please note there is an accepted proposal[1] to add an unused keyword, making it possible to detect conflicts between things being marked unused and actually being used, and improving tooling integration.

[1]: https://github.com/ziglang/zig/issues/10245

nicoburns · on Dec 20, 2021

I think there should definitely be a global option to disable this during development. This makes temporarily trying something out (e.g commenting out some code to run only part of it) a massive pain, as you have to go up and edit the variable definition as well as the code using it).

kaba0 · on Dec 21, 2021

Fully agree. I meet with it mostly in typescript’s linter, but recursively having to comment out things is just simply terrible. Of course it is great for prod, but for debug it is very questionable.

dureuill · on Dec 21, 2021

agreed. Rust has a warning for unused items, which is indeed interesting to catch logic errors (variables that should be used, but aren't). No need to fail compilation for this. It is then on you to deny warnings in CI to ensure that they are all handled before the code actually lands

dnautics · on Dec 20, 2021

I like this a lot. Though, many pls offer making the identifier start with _ to make it unused well.

dilap · on Dec 20, 2021

Sure, but this is an easy thing to audit for (could even be automated as part of some CI system, if you wanted).

nauticacom · on Dec 20, 2021

I've genuinely never understood this one, especially as a flat-out error instead of a warning. When could not using a variable ever be a bug?

pcwalton · on Dec 20, 2021

That warning has caught bugs for me when I was copy and pasting code, or writing code on "autopilot". Usually it's pretty obvious mistakes though.

Unused variable warnings are a good idea, but making them hard errors is a language design mistake. Warnings are good because they allow programmers to quickly make changes and test things, while providing a reminder to clean things up in the end (which is why the "just use tooling that removes unused variables" response misses the point--when making quick temporary changes for debugging, you want the warnings as reminders to go back and undo the change). Additionally, warnings allow for adding helpful static analyses to the compiler over time without breaking existing code like introducing new errors does. As I recall, there were some cases in which Rust 1.0 accidentally didn't flag unused variables, which was fixable post 1.0 without breaking existing code precisely because it was a warning, not an error.

AndyKelley · on Dec 21, 2021

Pretty bold to call it a language design mistake so confidently, when you have Rust users committing warning-emitting code to their source control.

Does Rust emit warnings for cached compilation units?

pcwalton · on Dec 21, 2021

Cargo doesn't enable warnings for crate dependencies, by design. In fact, it won't even emit them if those crates say #[deny(warnings)]--there's a special rustc flag called --cap-lints that it uses for this (RFC at [1]). The reason is that a lot of crates say #[deny(warnings)], and this was creating no end of backwards compatibility problems when new warnings were added.

There is an interesting thread with community consensus against the use of #[deny(warnings)] at [2]. The most important takeaway for me is that the right place to deny warnings is in CI. You don't want end users who compile your crate to be have their builds fail due to warnings, because they might be using a newer version of the compiler than you were. You don't want to fail developers' builds due to warnings while hacking on code, because of the overhead warnings-as-errors adds to the edit/compile/debug cycle. CI is the right place to deny warnings, because it prevents warnings from getting to the repository while avoiding the other downsides of warnings-as-errors.

[1]: https://rust-lang.github.io/rfcs/1193-cap-lints.html

[2]: https://www.reddit.com/r/rust/comments/f5xpib/psa_denywarnin...