You can easily distinguish them: if isinstance(msg, str) So I don't think that's...

diarrhea · on May 28, 2020

The issue is that

    if isinstance(msg, str)

will clutter code that is otherwise clean. A single type has to be specially handled, which sticks out like a sore thumb.

As a second point, do you have more on your last sentence? ("The idea that there are text files and binary files has been toxic for a whole generation of coders."). I have been thoroughly confused about text vs. bytes when learning Python/programming.

The two types are treated as siblings, when text files are really a child of binary files. Binary files are simply regular files, and sit as the single parent, without parents itself, in the tree. Text files are just one of the many children, that happen to yield text when their byte patterns happen to be interpreted using the correct encoding (or, in the spirit of Python, decoding when going from bytes to text), like UTF8. This is just like, say, audio files yielding audio when interpreted with the correct encoding (say MP3).

Is this a valid way of seeing it? I have to ask very carefully because I have never seen it explained this way, so that is just what I put together as a mental model over time. In opposition to that model, resources like books always treat binary and text files as polar opposites/siblings.

This leads me to the initial question of whether you know of resources that would support the above model (assuming it is correct)?

viraptor · on May 28, 2020

That sounds completely like a correct way to look at it. I'd put "stream of bytes" and "seekable stream of bytes" above files, but that's just nitpicking.

For me the toxic idea about text files is that they're a thing at all. They're just binary files containing encoded text, without any encoding marker making them an ideal trap. Is a utf16 file a text file? Is a shift-jis file a text file? Have fun guessing edge cases. We've already accepted with unicode that the "text" or letters are something separate from the encoding.

ghshephard · on May 28, 2020

Totally agree that everything should be a byte stream. Even with Python 3.x text files are still confusing - if you open a UTF-8 file with a BOM in the front as a text file - should that BOM be part of the file contents, or transparently removed? By default, Python treats it as actual content, which can screw all sorts of things up. In my ideal world, every file is a binary file, and that if you want it to be a text file - just open it with whatever encoding scheme you think appropriate (typically UTF-8).

If you don't know the Encoding? Just write a quick detect_bom function (should be part of the standard library, no idea why it isn't) and then open it with that encoding. I.E.:

   encoding = detect_bom(fn)
   with open (fn, 'r', encoding = encoding) as f:
      ...

That also has the benefit of removing the BOM from your file.

Ultimately, putting the responsibility for determining the CODEC on the user at least makes it clear to them what they are doing -opening a binary file and decoding it. That mental model prepares them for the first time they run into, say, a cp587 file.

I understand why Python doesn't do this - it adds a bit of complexity - though you could have an "auto-detect" encoding scheme that tried to determine the encoding schemes, and defaults to UTF-8 - not perfect, as you can't absolutely determine the CODEC of a file by reading it - but better than what we have today - where your code crashes when you have a BOM that upsets UTF-8 decoder.

I finally wrote a library function to guess codecs and read text files, inspired by https://stackoverflow.com/a/24370596/1637450 and haven't been tripped up since.

But Python does not make it easy to open "text" files - and I know data engineers who've been doing this for years who are still tripped up.

BiteCode_dev · on May 28, 2020

Chardet, written by mozilla, already detect encoding if you need such thing.

BiteCode_dev · on May 28, 2020

The open() API is inherited from the C way, where the world is divided between text files and binary files. So you open a file in "text" mode, and "binary" mode, "text" being the default behavior.

This is, of course, utterly BS.

All files are binary files.

Some contains sound data, some image data, some zip data, some pdf data, and some raw encoded text data.

But we don't have a "jpg" mode for open(). We do have higher API we pass file objects to in order to decode their content as jpg, which is what we should be doing to text. Text is not an exceptional case.

VSCode does a lot of work to turn those bytes into pretty words, just like VLC into videos. They are not like that in the file. It's all a representation for human consumption.

The reasoning for this confusing API is that reading text from a file is a common use, which is true. Espacially on Unix, from which C is from. But using a "mode" is the wrong abstraction to offer it.

If fact, Python 3 does it partially right. It has a io.FileIO object that just take care of opening the stuff, and a io.BufferedReader that wraps FileIO to offer practical methods to access its content.

This what what open(mode="b") returns.

If you do open(mode="t"), which is the default, it wraps the BufferedReader into a TextStream that does the decoding part transparently for you, and returns that.

There is an great explanation of this by the always excellent David Beazley: http://www.dabeaz.com/python3io_2010/MasteringIO.pdf

What it should do is offering something this:

    with open('text.txt').as_text():

open() would always return BufferedReadfer, as_text() would always return TextStream.

This completly separates I/O from decoding, removing confusion in the mind of all those coders that would otherwise live by the illusionary binary/text model. It also makes the API much less error prone: you can easily see where to the file related arguments go (in open()) and where to text related arguments go (in as_text()).

You can keep the mode, but only for "read", "write" and "append", removing the weird mix with "text" and "bytes" which are really related to a different set of operations.

zb · on May 29, 2020

Let’s be clear here that the fault is not with Python but with Windows.

Python uses text mode by default to avoid surprising beginners on Windows. If you only use Unix-like OSs you will never have this problem.

BiteCode_dev · on May 29, 2020

The problem is not "text mode by default". The problem is that the API offers a text mode at all.

Opening a file should return an object that gives you bytes, and that's it.

This "mode" thing is idiotic, and leak a low level API that makes no sense in a high level language with a strong abstraction for text like Python.

Text should decoded from a wrapping object. See my ohter comments.

mark-r · on May 28, 2020

Splitting it into two parts like that would make seek() kind of funky, but I suppose it is already.

101404 · on May 29, 2020

Sadly, there is no possible migration path. Because text is the default "mode".

VWWHFSfQ · on May 28, 2020

How would this work

    with open('text.txt', 'w').as_text():

BiteCode_dev · on May 28, 2020

    with open('text.txt','w').as_text() as f:
       f.write("text")

VWWHFSfQ · on June 1, 2020

it's just too weird and open-ended.

the next thing will be a bunch of "open" functions:

   with open_binary("filename") as f:
       ...


    with open_text("filename") as f:
        ...

How do I open these files in writeable mode?

    with open_text("filename").writeable() as f:
        ...

This is getting absurd.

rnnr · on May 29, 2020

Ultimately this 'frustration' is always caused by loose typing/inexistent data model and not by the iterability of strings itself.

BiteCode_dev · on May 29, 2020

If that bothers you, use sigle dispatch.

Skunkleton · on May 28, 2020

> The idea that there are text files and binary files has been toxic for a whole generation of coders.

It really is nonsense isn't it? Its like asking a low level api for opening files as a .doc, or as a pdf. Why would that be part of the file io layer?

aidos · on May 28, 2020

Well, I guess it’s easy to argue that it’s so common that beginners would expect to open files as text. You can see how it would evolve that way.

Now I’m more familiar with it I’m careful to be explicit with the decoding when using text to make it super obvious what’s going on.

signal11 · on May 28, 2020

I suspect it’s also to do with Python’s history as a scripting language. Because of Perl’s obvious strengths in this area, any scripting language pretty much has to make it very easy to work with text files. Ruby does something similar for instance.

Even languages like Java now recognise the need to provide convenient access to text files as part of the standard API, with Files.readAllLines() in 7, Files.lines() in 8, and Files.readString() in 11.

BiteCode_dev · on May 29, 2020

You can make it easy to deal with text file without lying to your API users. open("foo", mode="t") could become open("foo").as_text().

Besides, Python has pathlib now, which allows you to do Path("foo").read_text() for quick and dirty text handling already.

heavenlyblue · on May 28, 2020

My first mistake I made as a beginner was dumping a bunch of binary data as text. Something would happen in the way and not the whole data would be written because I was writing it in text mode.

It just never appeared to me that the default mode of writing the file would _not_ write the array I was passing it.

It’s much more important for beginners to be able to learn clear recipes rather than having double standards with a bunch of edge cases.

aidos · on May 28, 2020

I’ve done worse. Using MySQL from php and not having the encoding right somewhere along the way so all my content was being mojibaked on the way in and un-mojibaked on the way out so I didn’t notice it until deep into a project when I needed to extract it to another system.

EDIT thanks, I knew that didn't look quite right. "Mojibaked" - such a great term.

mark-r · on May 28, 2020

The term is actually "Mojibake", not "emoji baked". https://en.wikipedia.org/wiki/Mojibake#Etymology

a1369209993 · on May 28, 2020

More to the point, it's so common that it ought to be supported out of the box by any decent programming language, the same way you'd expect any language to support IEEE floats. That doesn't mean the mechanism for it shouldn't be (effectively) textfile(file("foo.txt")), though.