Hacker News new | past | comments | ask | show | jobs | submit login

Why not this?

  fn main() {
      let a = "ab早".as_bytes();
      let a = &a[..3];
      println!("Hello, world!");
  }



It's not clear to me what you're suggesting; is it that String shouldn't have supported indexing in the first place? That code does work, but you have a &[u8] not a &str.


But neither AsciiString. It has a as_str method, but it's still a kludge.

This example was basically a suggestion to throw0u1t: if they want to cut in the middle of the utf-sequence for whatever reason, they can [edit:] do it without extra crates.

What I don't understand is why slices are indexed in bytes and not in objects. If String has an ability to check that we're cutting in the middle of the character sequence, why doesn't it provide an ability to take 3 fully formed characters.


I think the rust designers want to keep the implicit contract that indexing into a string is fast and O(1).

If you want to find the one millionth codepoint of a UTF8-encoded string, you have to more or less (1) visit every byte of the string.

If, on the other hand, you want to find the codepoint that covers the millionth byte, on the other hand, you have to read at most four bytes (read the millionth byte, and there are three cases:

- it’s a full codepoint. If so, you‘re done.

- it is the first byte of a multi-byte codepoint. If so, read forwards in the string for up to 3 continuation characters.

- it is a continuation character. If so, search backwards in the string for the first byte, then, if necessary, read forwards to find more continuation characters.

So, that is O(1)

(1) you can skip continuation characters, but these typically are rare.


> What I don't understand is why slices are indexed in bytes and not in objects.

Slicing is an O(1) operation, and that would be an O(n) operation.


It does: `s.chars().take(3)`. It just does it with iterators rather than with indexes because that better communicates the performance characteristics.


I think he's suggesting that slicing on strings should be by character, and if you want to slice on bytes, you should explicitly ask to treat the string as a byte array. It makes more sense semantically, and it's safe.


Slicing on characters is a linear time operation and indexing is meant to be cheap.


That seems like taking it too far. It's like using pointer arithmetic to index a linked list on the assumption that the nodes happen to be allocated contiguously in memory. I mean, I guess the thinking is, indexing a Unicode string isn't cheap, but indexing strings used to be cheap once upon a time, when strings were encoded in fixed one-byte-per-character representations, so let's pretend that's still the case and panic if it doesn't work out.... That's weirdly antithetical to Rust's purported focus on safety.

Also, you can get the same performance from an operation that returns a byte array instead of a string. If that kind of performance is what you want, then a Unicode string is simply not the right type to use.


Indexing a Unicode string is cheap... if you have a byte index. If you want to count out some fixed number of codepoints, then of course you've just moved the cost to calculating the corresponding byte index. But counting codepoints is almost always the wrong thing to do anyway [1]. In practice, it's more common to obtain indices by inspecting the string itself, e.g. searching for a substring or regex match. In that case, it's faster for the search to just return a byte index; there's no benefit to having it return a codepoint index, and then having to do an O(n) lookup when you try to use the index. And byte indices obtained that way will always be valid character boundaries, so you can use [] without worrying about panics.

You suggest just using a byte array instead, but then you'd lose the guarantee that what you're working with is valid Unicode. Contrary to your assertion, it is useful to have a type that provides that guarantee, yet which can still be operated on efficiently.

[1] https://manishearth.github.io/blog/2017/01/14/stop-ascribing...


Safety is about memory safety. Immediately exiting your program is about as memory safe as it gets.


Panics are not unsafe. Panic exists in Rust because they are safe. If you don't want a panic on index, just don't index.

Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway, because it is an abstraction of text that doesn't provide support to the notion that a "character" is more fundamental than a word or paragraph, etc. Rust's string slicing exists solely to make ASCII text easy to handle. If your text is not ASCII, then you shouldn't be slicing it at all. Thus the panic.


Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway

If that's true, isn't it the job of a type system to help avoid such nonsensical operations? If "slice" only makes sense for byte arrays and ASCII strings, it could be provided on those types without being defined on UTF-8 strings.

Panics are not unsafe. Panic exists in Rust because they are safe.

That's "safe" by a very limited definition of safety. It's one step up from undefined behavior, granted, but it's not a very high standard. In practice, in most programs, you'd want to ensure that such a panic would never happen, and personally I think the language's unhelpfulness in that regard is a wart.


>If that's true, isn't it the job of a type system to help avoid such nonsensical operations?

It's not strictly true, because there are situations where you want to slice UTF-8. For instance, if you already know where the code point boundaries are for newlines. But if you know that, then you've run something like a regex with >O(1) behavior and you certainly wouldn't want string slicing to do redundant work.

>hat's "safe" by a very limited definition of safety

That's the definition of safe that is used. Safety in the context of Rust means memory safety. (Division can panic, btw.) If you don't see why undefined behavior is so much worse than a panic, then do some research on it. If you want programs that never fail, you need a comprehensive plan that takes into account things like hardware failure. A programming language can't do that.


I think that's too extreme. There are many legitimate reasons to slice non-ASCII text - for example, to split it on newlines.


That's not trivial and different languages vary in how they handle new line characters even. https://stackoverflow.com/questions/44995851/how-do-i-check-...


You can still split non-ASCII text on ASCII newlines, and quite often that's exactly what needs to be done.


And usually, you don't want it to cost O(n) on top of whatever parser you ran to find those newlines.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: