I found the link to Stroustrup's "Learning Standard C++ as a New Language" [edit: from 1998, which explains why I found it kind of familiar to my late 90s run-in with c++...] interesting[2,1]. But then I tried to find out what's the current state of the art for idiomatic, cross-platform text processing (in unicode, probably utf-8) -- and got a little sad.
There's still no way to write a ten(ish) line c++ program that reads and writes text, that works both on the windows console, and under OS X/*bsd and Linux?
(Lets go crazy, say you have to implement a minimal cat, tac (reverses stdin on stdout) -- and also corresponding echo and ohce (I made that up, something that takes string(s) as input, and outputs the characters reversed (ie: "ohce olé there" outputs "ereht élo").
[2][edit] The direct link to Strostrup's paper is:
Trying to look for a (simple, generally accepted) solution, I came across:
http://www.utf8everywhere.org/ (If I'm reading this right, it says assume std::string is utf8, but I'm not sure if there are std-lib funtions for doing stuff like getting the index of a glyph, and reversing strings by glyph? And will they work on windows?)
Writing tac in C++ is trivial: std::deque<std::string> lines; for (std::string line; std::getline(std::cin, line); ) lines.push_front(line); for (const auto &line : lines) std::cout << line << std::endl; // Note that I typed this quickly off the top of my head: I'm willing to believe there is a trivial typo the compiler would catch, but the overall implementation should be fine ;P.
As for ohce, you have defined a very very hard problem, one that it does not seem you realize is quite as hard as it actually is: if you have a sequence of Unicode codepoints and reverse their order you do not end up with a string of reversed characters, not in the general case, and not even for some reasonable encodings of seemingly-simple cases like an accented letter e.
Like, I challenge you to provide a working version of ohce in Python (2 or 3: your choice). Virtually no language actually provides a string type that makes this problem reasonable. It simply isn't fair to pick on C++ in this regard when no language "gets this right": at least C++ is being honest about the lack of guarantees it is making about string manipulation.
For more information, I recommend reading this article:
> Like, I challenge you to provide a working version of ohce in Python (2 or 3: your choice).
Not a full implementation, but wouldn't this approach actually work (note, doesn't work for python2):
$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> "abc"[::-1]
'cba'
>>> "øæåにほ言"[::-1]
'言ほにåæø'
[edit: with accents]:
>>> "eẽêëèøæåにほ言"[::-1]
'言ほにåæøèëêẽe'
[edit2: formatting, indentation]
There might very well be problems with this, but I'm not aware of any?
[edit4: This is indeed broken for the ligature (baffle) case in python3. I'm not entirely sure that is an entirely fair test (but it is very interesting). I would argue that the ligature should probably be a replacement done for display/print, not in a text file. Just like the reverse of "æ" isn't a reverse composition of "e" and "a" (even if "æ" might be seen as a compositon of "a" and "e".
I'm not sure how it deals with changing direction (left-to-right, right-to-left) -- comments welcome.]
[edit3, sorry for the many edits]
To be clear, I do not wish to "pick on c++", nor do I think the example is trivial. I do think it probably should be trivial -- it is something that should be supported in a canonical way by a standard library/implementation.
Working with graphemes is a very fundamental part of working with text -- the fact that half(?) of developers have been able to hide behind ascii isn't a good excuse for not fixing it. How would one implement an editor if you can't access graphemes in a reasonable way?
And more importantly, how would you test for palindromes? ;-)
For the "reversing Unicode text" problem the easiest C++ solution is probably to use an external library such as QtCore or ICU (Qt uses ICU internally).
Unfortunately even in UTF-16 grapheme clusters do not correspond 1:1 with Unicode code points so you wouldn't be able to just reverse a list. But Qt can split up a QString into its grapheme clusters (a quick example I had made a couple of months ago):
static QString reverse(QString src)
{
auto src_nfc = src.normalized(QString::NormalizationForm_C);
QChar *start = src_nfc.data();
int length = src_nfc.length();
QTextBoundaryFinder finder(QTextBoundaryFinder::Grapheme, start, length);
finder.toStart();
// Reverse code elements that make up a code point when that code point has
// been expressed in more than one code element (which is even possible in
// UCS-4!)
while(finder.position() < src_nfc.length()) {
int oldPos = finder.position();
finder.toNextBoundary();
int newPos = finder.position();
if(newPos - oldPos > 1) {
std::reverse(start + oldPos, start + newPos);
}
}
std::reverse(start, start + length);
return src_nfc;
}
This. I am very happy with how C++ has evolved and how expressive it has become today, but decent Unicode support at least in the stdlib is something any programmer would(hopefully) look for in a modern language. Thanks for the interesting links, though.
Any canoical examples to go with those two? As I understand it now, I can pretty much get away with utf8 and some locale code, as long as I stick to (possibly some subset of distributions of) Linux. Which really is fine for my use case, but it's not really a very nice stance to take (it's all fun and games until you need to work in an environment where you for some reason or other can't change the OS, and need that clever utility that wasn't quite as standard/cross-platform as it maybe should've been...).
There's still no way to write a ten(ish) line c++ program that reads and writes text, that works both on the windows console, and under OS X/*bsd and Linux?
(Lets go crazy, say you have to implement a minimal cat, tac (reverses stdin on stdout) -- and also corresponding echo and ohce (I made that up, something that takes string(s) as input, and outputs the characters reversed (ie: "ohce olé there" outputs "ereht élo").
[2][edit] The direct link to Strostrup's paper is:
http://stroustrup.com/new_learning.pdf
code:
http://isocpp.org/wiki/faq/newbie#simple-program
[1]
The FAQ briefly touches on Unicode, but it doesn't seem very helpful (to me): http://isocpp.org/wiki/faq/cpp11-language-misc (search page for unicode)
Trying to look for a (simple, generally accepted) solution, I came across:
http://www.utf8everywhere.org/ (If I'm reading this right, it says assume std::string is utf8, but I'm not sure if there are std-lib funtions for doing stuff like getting the index of a glyph, and reversing strings by glyph? And will they work on windows?)
http://stackoverflow.com/questions/2037765/what-is-the-optim...
Which points to: http://utfcpp.sourceforge.net/ which is the best I'm aware of so far.
http://stackoverflow.com/questions/8513249/handling-utf-8-in...
http://stackoverflow.com/questions/402283/stdwstring-vs-stds...
(Suggest using wstring on Windows and string on Linux -- for simple programs that would effectively mean write to versions, one for each platform ..)