Hacker News new | past | comments | ask | show | jobs | submit login

Not actually the case. UTF-8, using only SPECIFIC operations that don't try to split up strings or replace things that aren't exact matches for a given text, will result in valid output as long as there was valid input.

All interchanged Unicode text should be UTF-8, never use another encoding* (without a really compelling reason).

No, storing it as an array of unicode characters isn't a compelling reason during interchange.

ALSO, never use a BOM; that will break things.

The second answer (should be anchor-linked) in this goes over MOST of the advantages of UTF-8, but it doesn't capture that some carefully input operations in otherwise completely Unicode //unaware// 'string' functions result in no change to string validity.

The only potential issue is if recognizing something in different normalization representations is important. However, for nearly all quick and dirty tasks (where a short script is most likely) it usually doesn't matter. For everything else a different paradigm than the one Python3 picked would be better. (One where adding filters to a read file is OPTIONAL and they can be invoked on individual byte-strings as well.)

https://stackoverflow.com/questions/6882301/what-is-the-best...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: