Hacker News new | past | comments | ask | show | jobs | submit login

You can easily distinguish them:

    if isinstance(msg, str)
So I don't think that's a good argument for not accepting iterables of strings in str methods. Things like replace() would benefit a lot and it's not that hard to do, you can even accept regexes optionally: https://wonderful-wrappers.readthedocs.io/en/latest/string_w...

I agree that iterating on string is not proper design however. It's not very useful in practice, and the O(1) access has other performance consequences for more important things.

Swift did it right IMO, but it's a much younger language.

I also wish we stole the file api concepts from swift, and that open() would return a file like object that always gives you bytes. No "b" mode. If you want text, you to open().as_text(), and get a decoding wrapper.

The idea that there are text files and binary files has been toxic for a whole generation of coders.




The issue is that

    if isinstance(msg, str)
will clutter code that is otherwise clean. A single type has to be specially handled, which sticks out like a sore thumb.

As a second point, do you have more on your last sentence? ("The idea that there are text files and binary files has been toxic for a whole generation of coders."). I have been thoroughly confused about text vs. bytes when learning Python/programming.

The two types are treated as siblings, when text files are really a child of binary files. Binary files are simply regular files, and sit as the single parent, without parents itself, in the tree. Text files are just one of the many children, that happen to yield text when their byte patterns happen to be interpreted using the correct encoding (or, in the spirit of Python, decoding when going from bytes to text), like UTF8. This is just like, say, audio files yielding audio when interpreted with the correct encoding (say MP3).

Is this a valid way of seeing it? I have to ask very carefully because I have never seen it explained this way, so that is just what I put together as a mental model over time. In opposition to that model, resources like books always treat binary and text files as polar opposites/siblings.

This leads me to the initial question of whether you know of resources that would support the above model (assuming it is correct)?


That sounds completely like a correct way to look at it. I'd put "stream of bytes" and "seekable stream of bytes" above files, but that's just nitpicking.

For me the toxic idea about text files is that they're a thing at all. They're just binary files containing encoded text, without any encoding marker making them an ideal trap. Is a utf16 file a text file? Is a shift-jis file a text file? Have fun guessing edge cases. We've already accepted with unicode that the "text" or letters are something separate from the encoding.


Totally agree that everything should be a byte stream. Even with Python 3.x text files are still confusing - if you open a UTF-8 file with a BOM in the front as a text file - should that BOM be part of the file contents, or transparently removed? By default, Python treats it as actual content, which can screw all sorts of things up. In my ideal world, every file is a binary file, and that if you want it to be a text file - just open it with whatever encoding scheme you think appropriate (typically UTF-8).

If you don't know the Encoding? Just write a quick detect_bom function (should be part of the standard library, no idea why it isn't) and then open it with that encoding. I.E.:

   encoding = detect_bom(fn)
   with open (fn, 'r', encoding = encoding) as f:
      ...
That also has the benefit of removing the BOM from your file.

Ultimately, putting the responsibility for determining the CODEC on the user at least makes it clear to them what they are doing -opening a binary file and decoding it. That mental model prepares them for the first time they run into, say, a cp587 file.

I understand why Python doesn't do this - it adds a bit of complexity - though you could have an "auto-detect" encoding scheme that tried to determine the encoding schemes, and defaults to UTF-8 - not perfect, as you can't absolutely determine the CODEC of a file by reading it - but better than what we have today - where your code crashes when you have a BOM that upsets UTF-8 decoder.

I finally wrote a library function to guess codecs and read text files, inspired by https://stackoverflow.com/a/24370596/1637450 and haven't been tripped up since.

But Python does not make it easy to open "text" files - and I know data engineers who've been doing this for years who are still tripped up.


Chardet, written by mozilla, already detect encoding if you need such thing.


The open() API is inherited from the C way, where the world is divided between text files and binary files. So you open a file in "text" mode, and "binary" mode, "text" being the default behavior.

This is, of course, utterly BS.

All files are binary files.

Some contains sound data, some image data, some zip data, some pdf data, and some raw encoded text data.

But we don't have a "jpg" mode for open(). We do have higher API we pass file objects to in order to decode their content as jpg, which is what we should be doing to text. Text is not an exceptional case.

VSCode does a lot of work to turn those bytes into pretty words, just like VLC into videos. They are not like that in the file. It's all a representation for human consumption.

The reasoning for this confusing API is that reading text from a file is a common use, which is true. Espacially on Unix, from which C is from. But using a "mode" is the wrong abstraction to offer it.

If fact, Python 3 does it partially right. It has a io.FileIO object that just take care of opening the stuff, and a io.BufferedReader that wraps FileIO to offer practical methods to access its content.

This what what open(mode="b") returns.

If you do open(mode="t"), which is the default, it wraps the BufferedReader into a TextStream that does the decoding part transparently for you, and returns that.

There is an great explanation of this by the always excellent David Beazley: http://www.dabeaz.com/python3io_2010/MasteringIO.pdf

What it should do is offering something this:

    with open('text.txt').as_text():
open() would always return BufferedReadfer, as_text() would always return TextStream.

This completly separates I/O from decoding, removing confusion in the mind of all those coders that would otherwise live by the illusionary binary/text model. It also makes the API much less error prone: you can easily see where to the file related arguments go (in open()) and where to text related arguments go (in as_text()).

You can keep the mode, but only for "read", "write" and "append", removing the weird mix with "text" and "bytes" which are really related to a different set of operations.


Let’s be clear here that the fault is not with Python but with Windows.

Python uses text mode by default to avoid surprising beginners on Windows. If you only use Unix-like OSs you will never have this problem.


The problem is not "text mode by default". The problem is that the API offers a text mode at all.

Opening a file should return an object that gives you bytes, and that's it.

This "mode" thing is idiotic, and leak a low level API that makes no sense in a high level language with a strong abstraction for text like Python.

Text should decoded from a wrapping object. See my ohter comments.


Splitting it into two parts like that would make seek() kind of funky, but I suppose it is already.


Sadly, there is no possible migration path. Because text is the default "mode".


How would this work

    with open('text.txt', 'w').as_text():


    with open('text.txt','w').as_text() as f:
       f.write("text")


it's just too weird and open-ended.

the next thing will be a bunch of "open" functions:

   with open_binary("filename") as f:
       ...


    with open_text("filename") as f:
        ...
How do I open these files in writeable mode?

    with open_text("filename").writeable() as f:
        ...
This is getting absurd.


Ultimately this 'frustration' is always caused by loose typing/inexistent data model and not by the iterability of strings itself.


If that bothers you, use sigle dispatch.


> The idea that there are text files and binary files has been toxic for a whole generation of coders.

It really is nonsense isn't it? Its like asking a low level api for opening files as a .doc, or as a pdf. Why would that be part of the file io layer?


Well, I guess it’s easy to argue that it’s so common that beginners would expect to open files as text. You can see how it would evolve that way.

Now I’m more familiar with it I’m careful to be explicit with the decoding when using text to make it super obvious what’s going on.


I suspect it’s also to do with Python’s history as a scripting language. Because of Perl’s obvious strengths in this area, any scripting language pretty much has to make it very easy to work with text files. Ruby does something similar for instance.

Even languages like Java now recognise the need to provide convenient access to text files as part of the standard API, with Files.readAllLines() in 7, Files.lines() in 8, and Files.readString() in 11.


You can make it easy to deal with text file without lying to your API users. open("foo", mode="t") could become open("foo").as_text().

Besides, Python has pathlib now, which allows you to do Path("foo").read_text() for quick and dirty text handling already.


My first mistake I made as a beginner was dumping a bunch of binary data as text. Something would happen in the way and not the whole data would be written because I was writing it in text mode.

It just never appeared to me that the default mode of writing the file would _not_ write the array I was passing it.

It’s much more important for beginners to be able to learn clear recipes rather than having double standards with a bunch of edge cases.


I’ve done worse. Using MySQL from php and not having the encoding right somewhere along the way so all my content was being mojibaked on the way in and un-mojibaked on the way out so I didn’t notice it until deep into a project when I needed to extract it to another system.

EDIT thanks, I knew that didn't look quite right. "Mojibaked" - such a great term.


The term is actually "Mojibake", not "emoji baked". https://en.wikipedia.org/wiki/Mojibake#Etymology


More to the point, it's so common that it ought to be supported out of the box by any decent programming language, the same way you'd expect any language to support IEEE floats. That doesn't mean the mechanism for it shouldn't be (effectively) textfile(file("foo.txt")), though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: