So I don't think that's a good argument for not accepting iterables of strings in str methods. Things like replace() would benefit a lot and it's not that hard to do, you can even accept regexes optionally: https://wonderful-wrappers.readthedocs.io/en/latest/string_w...
I agree that iterating on string is not proper design however. It's not very useful in practice, and the O(1) access has other performance consequences for more important things.
Swift did it right IMO, but it's a much younger language.
I also wish we stole the file api concepts from swift, and that open() would return a file like object that always gives you bytes. No "b" mode. If you want text, you to open().as_text(), and get a decoding wrapper.
The idea that there are text files and binary files has been toxic for a whole generation of coders.
will clutter code that is otherwise clean. A single type has to be specially handled, which sticks out like a sore thumb.
As a second point, do you have more on your last sentence? ("The idea that there are text files and binary files has been toxic for a whole generation of coders.").
I have been thoroughly confused about text vs. bytes when learning Python/programming.
The two types are treated as siblings, when text files are really a child of binary files. Binary files are simply regular files, and sit as the single parent, without parents itself, in the tree. Text files are just one of the many children, that happen to yield text when their byte patterns happen to be interpreted using the correct encoding (or, in the spirit of Python, decoding when going from bytes to text), like UTF8. This is just like, say, audio files yielding audio when interpreted with the correct encoding (say MP3).
Is this a valid way of seeing it? I have to ask very carefully because I have never seen it explained this way, so that is just what I put together as a mental model over time. In opposition to that model, resources like books always treat binary and text files as polar opposites/siblings.
This leads me to the initial question of whether you know of resources that would support the above model (assuming it is correct)?
That sounds completely like a correct way to look at it. I'd put "stream of bytes" and "seekable stream of bytes" above files, but that's just nitpicking.
For me the toxic idea about text files is that they're a thing at all. They're just binary files containing encoded text, without any encoding marker making them an ideal trap. Is a utf16 file a text file? Is a shift-jis file a text file? Have fun guessing edge cases. We've already accepted with unicode that the "text" or letters are something separate from the encoding.
Totally agree that everything should be a byte stream. Even with Python 3.x text files are still confusing - if you open a UTF-8 file with a BOM in the front as a text file - should that BOM be part of the file contents, or transparently removed? By default, Python treats it as actual content, which can screw all sorts of things up. In my ideal world, every file is a binary file, and that if you want it to be a text file - just open it with whatever encoding scheme you think appropriate (typically UTF-8).
If you don't know the Encoding? Just write a quick detect_bom function (should be part of the standard library, no idea why it isn't) and then open it with that encoding. I.E.:
encoding = detect_bom(fn)
with open (fn, 'r', encoding = encoding) as f:
...
That also has the benefit of removing the BOM from your file.
Ultimately, putting the responsibility for determining the CODEC on the user at least makes it clear to them what they are doing -opening a binary file and decoding it. That mental model prepares them for the first time they run into, say, a cp587 file.
I understand why Python doesn't do this - it adds a bit of complexity - though you could have an "auto-detect" encoding scheme that tried to determine the encoding schemes, and defaults to UTF-8 - not perfect, as you can't absolutely determine the CODEC of a file by reading it - but better than what we have today - where your code crashes when you have a BOM that upsets UTF-8 decoder.
The open() API is inherited from the C way, where the world is divided between text files and binary files. So you open a file in "text" mode, and "binary" mode, "text" being the default behavior.
This is, of course, utterly BS.
All files are binary files.
Some contains sound data, some image data, some zip data, some pdf data, and some raw encoded text data.
But we don't have a "jpg" mode for open(). We do have higher API we pass file objects to in order to decode their content as jpg, which is what we should be doing to text. Text is not an exceptional case.
VSCode does a lot of work to turn those bytes into pretty words, just like VLC into videos. They are not like that in the file. It's all a representation for human consumption.
The reasoning for this confusing API is that reading text from a file is a common use, which is true. Espacially on Unix, from which C is from. But using a "mode" is the wrong abstraction to offer it.
If fact, Python 3 does it partially right. It has a io.FileIO object that just take care of opening the stuff, and a io.BufferedReader that wraps FileIO to offer practical methods to access its content.
This what what open(mode="b") returns.
If you do open(mode="t"), which is the default, it wraps the BufferedReader into a TextStream that does the decoding part transparently for you, and returns that.
open() would always return BufferedReadfer, as_text() would always return TextStream.
This completly separates I/O from decoding, removing confusion in the mind of all those coders that would otherwise live by the illusionary binary/text model. It also makes the API much less error prone: you can easily see where to the file related arguments go (in open()) and where to text related arguments go (in as_text()).
You can keep the mode, but only for "read", "write" and "append", removing the weird mix with "text" and "bytes" which are really related to a different set of operations.
I suspect it’s also to do with Python’s history as a scripting language. Because of Perl’s obvious strengths in this area, any scripting language pretty much has to make it very easy to work with text files. Ruby does something similar for instance.
Even languages like Java now recognise the need to provide convenient access to text files as part of the standard API, with Files.readAllLines() in 7, Files.lines() in 8, and Files.readString() in 11.
My first mistake I made as a beginner was dumping a bunch of binary data as text. Something would happen in the way and not the whole data would be written because I was writing it in text mode.
It just never appeared to me that the default mode of writing the file would _not_ write the array I was passing it.
It’s much more important for beginners to be able to learn clear recipes rather than having double standards with a bunch of edge cases.
I’ve done worse. Using MySQL from php and not having the encoding right somewhere along the way so all my content was being mojibaked on the way in and un-mojibaked on the way out so I didn’t notice it until deep into a project when I needed to extract it to another system.
EDIT thanks, I knew that didn't look quite right. "Mojibaked" - such a great term.
More to the point, it's so common that it ought to be supported out of the box by any decent programming language, the same way you'd expect any language to support IEEE floats. That doesn't mean the mechanism for it shouldn't be (effectively) textfile(file("foo.txt")), though.
I agree that iterating on string is not proper design however. It's not very useful in practice, and the O(1) access has other performance consequences for more important things.
Swift did it right IMO, but it's a much younger language.
I also wish we stole the file api concepts from swift, and that open() would return a file like object that always gives you bytes. No "b" mode. If you want text, you to open().as_text(), and get a decoding wrapper.
The idea that there are text files and binary files has been toxic for a whole generation of coders.