If that's the case the winning move is get the bytes, and convert them to UTF8 o...

josefx · 2024-06-14T15:35:56

> If that's the case the winning move is get the bytes, and convert them to UTF8

That would require knowing the original encoding.

> or just process it as bytes

As long as the APIs you have to use take bytes and not u8 strings.

> Modern systems should be able to convert various encodings at GiB/s into UTF8.

They might even guess the correct source encoding a quarter of the time and it will break any legacy application that still has to interact with the data. I guess that would be a win-win situation for Rust.

Ygg2 · 2024-06-15T13:26:27

> That would require knowing the original encoding.

If you don't know that's one more reason to get bytes and try to figure out encoding. Usually using lib like encodings.rs or WTF8.rs

> As long as the APIs you have to use take bytes and not utf8 strings.

You can convert one into the other, albeit converting to str requires a check.

dralley · 2024-06-15T01:19:58

>That would require knowing the original encoding.

Or just use a library that can detect the encoding, and spit out utf-8. There's several of those.

josefx · 2024-06-15T13:16:20

Yes, you can try to automatically guess the wrong encoding based on statistics that only work some of the time when given large amounts of free flowing text.