Hacker News new | past | comments | ask | show | jobs | submit login

That sounds completely like a correct way to look at it. I'd put "stream of bytes" and "seekable stream of bytes" above files, but that's just nitpicking.

For me the toxic idea about text files is that they're a thing at all. They're just binary files containing encoded text, without any encoding marker making them an ideal trap. Is a utf16 file a text file? Is a shift-jis file a text file? Have fun guessing edge cases. We've already accepted with unicode that the "text" or letters are something separate from the encoding.




Totally agree that everything should be a byte stream. Even with Python 3.x text files are still confusing - if you open a UTF-8 file with a BOM in the front as a text file - should that BOM be part of the file contents, or transparently removed? By default, Python treats it as actual content, which can screw all sorts of things up. In my ideal world, every file is a binary file, and that if you want it to be a text file - just open it with whatever encoding scheme you think appropriate (typically UTF-8).

If you don't know the Encoding? Just write a quick detect_bom function (should be part of the standard library, no idea why it isn't) and then open it with that encoding. I.E.:

   encoding = detect_bom(fn)
   with open (fn, 'r', encoding = encoding) as f:
      ...
That also has the benefit of removing the BOM from your file.

Ultimately, putting the responsibility for determining the CODEC on the user at least makes it clear to them what they are doing -opening a binary file and decoding it. That mental model prepares them for the first time they run into, say, a cp587 file.

I understand why Python doesn't do this - it adds a bit of complexity - though you could have an "auto-detect" encoding scheme that tried to determine the encoding schemes, and defaults to UTF-8 - not perfect, as you can't absolutely determine the CODEC of a file by reading it - but better than what we have today - where your code crashes when you have a BOM that upsets UTF-8 decoder.

I finally wrote a library function to guess codecs and read text files, inspired by https://stackoverflow.com/a/24370596/1637450 and haven't been tripped up since.

But Python does not make it easy to open "text" files - and I know data engineers who've been doing this for years who are still tripped up.


Chardet, written by mozilla, already detect encoding if you need such thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: