Totally agree that everything should be a byte stream. Even with Python 3.x text files are still confusing - if you open a UTF-8 file with a BOM in the front as a text file - should that BOM be part of the file contents, or transparently removed? By default, Python treats it as actual content, which can screw all sorts of things up. In my ideal world, every file is a binary file, and that if you want it to be a text file - just open it with whatever encoding scheme you think appropriate (typically UTF-8).
If you don't know the Encoding? Just write a quick detect_bom function (should be part of the standard library, no idea why it isn't) and then open it with that encoding. I.E.:
encoding = detect_bom(fn)
with open (fn, 'r', encoding = encoding) as f:
...
That also has the benefit of removing the BOM from your file.
Ultimately, putting the responsibility for determining the CODEC on the user at least makes it clear to them what they are doing -opening a binary file and decoding it. That mental model prepares them for the first time they run into, say, a cp587 file.
I understand why Python doesn't do this - it adds a bit of complexity - though you could have an "auto-detect" encoding scheme that tried to determine the encoding schemes, and defaults to UTF-8 - not perfect, as you can't absolutely determine the CODEC of a file by reading it - but better than what we have today - where your code crashes when you have a BOM that upsets UTF-8 decoder.
If you don't know the Encoding? Just write a quick detect_bom function (should be part of the standard library, no idea why it isn't) and then open it with that encoding. I.E.:
That also has the benefit of removing the BOM from your file.Ultimately, putting the responsibility for determining the CODEC on the user at least makes it clear to them what they are doing -opening a binary file and decoding it. That mental model prepares them for the first time they run into, say, a cp587 file.
I understand why Python doesn't do this - it adds a bit of complexity - though you could have an "auto-detect" encoding scheme that tried to determine the encoding schemes, and defaults to UTF-8 - not perfect, as you can't absolutely determine the CODEC of a file by reading it - but better than what we have today - where your code crashes when you have a BOM that upsets UTF-8 decoder.
I finally wrote a library function to guess codecs and read text files, inspired by https://stackoverflow.com/a/24370596/1637450 and haven't been tripped up since.
But Python does not make it easy to open "text" files - and I know data engineers who've been doing this for years who are still tripped up.