Hacker News new | past | comments | ask | show | jobs | submit login
Swedish Exchange paralyzed by buy order for "-6 stocks" (translate.google.com)
80 points by ComputerGuru on Nov 29, 2012 | hide | past | favorite | 50 comments



In case anyone is perplexed by the "-6" comment, the decimal number -6 represented as a signed int in twos complement is

    11111111111111111111111111111010
When interpreted as an unsigned int, this is 4,294,967,290 = 2^32 - 6.


As a native Swedish speaker, I was curious to see how well the auto-translator would handle this rather niche financial text.

The wording "The value of words corresponding to" in the second sentence confused me, but I first chalked it up to me not understanding finance-speak in English well enough.

Then I realized hovering shows the original text, and investigated. The original Swedish text has a typo! It says "orden" (="the words") where it should say "ordern" (="the [purchase] order").

I found that amusing. :)


Just curious, how accurate was the rest of the translation?


Well, for a translation about finance it is (very!) surprising that it seems to have just transformed "kronor" (="crowns", the everyday name for the SEK, the Swedish currency) into "dollars". It's not as they two currencies are 1:1, or even close to it. Very strange.

Also there was a Swedish word that was just copied ("anullerades", meaning more or less "cancelled", "voided" or something along those lines) into the English text.

But, overall, it's still rather impressive.


I can confirm the translating-names-of-currencies issue, but something even worse is that the names of languages can also be similarly converted.

For example, the word "Eesti" will often get Google-Translated to "English" rather than "Estonian".

This means that a film at my local cinema that my web browser assures me is in "English" will in fact be in Estonian. And an interview with a Russian saying that he doesn't speak Estonian gets translated so that he appears to say that he doesn't speak English.

The product designers special-cased language names, doing extra work to produce what will almost always be the wrong result.

(And what they can do to place names is often patently ridiculous. For example, "Peterburi tee" should either be left alone or maybe translated to "St Petersburg Road" but actually somehow becomes "Hertford Road". And the ZIP + City name "13415 Tallinn" becomes "thirteen thousand four hundred and fifteen Tallinn".)


> The product designers special-cased language names, doing extra work to produce what will almost always be the wrong result.

They absolutely did not do this. It's an artefact of statistical translation. In the corpus there are a lot of English documents saying "This document is in English", whose translated versions in Afrikaans (because I know Afrikaans) say "Hierdie dokument is in Afrikaans". Thus the translator learns the "hierdie" is Afrikaans for "this", "dokument" is "document, ..., and "English" is "Afrikaans".

The street name issue probably comes from an organisation whose Estonian office is in Peterburi tee and whose English office is in Hertford Road.


> The product designers special-cased language names, doing extra work to produce what will almost always be the wrong result.

Are you sure about that? It would sound quite likely to me that simply, the word for "English" tends to appear in the same context (N-gram etc.) as the word for "Estonian". For example the sentence "I speak English" would be common in English, while "I speak Estonian" would be common in Estonian, so it might associate the words together.


The machine learning behind Google's translations uses a corpus that consists of texts that have already been translated into many different languages. But if some of those documents start off with, say, "this is the <language> version of this press release" that's exactly the sort of thing that would confuse an algorithm that can't distinguish between translation and localization. Pure conjecture of course. I'm just not sure if anything is being special-cased.


One of the things Google Translator is horrible about is translating stuff like kroner into dollars, SEK into USD, etc. Granted, there are cases where translating kroner into dollars is ok, but I'm guessing this is the wrong translation most of the time.


Since it detects the value being qualified by the currency and localizes the unit in front of the value, I would understand that behavior if it would also do some cash value translation at the current rate (which would arguably be somehow less wrong), but it does not [0]. Also, weirdness comes when it changes it to DKK for 100 [1], and $ for 1000 [0]. It just makes no sense.

[0]: http://translate.google.com/#auto/en/1000%20kroners

[1]: http://translate.google.com/#auto/en/100%20kroners


It makes no sense to the casual user, but you can understand why it does it when you learn that Google Translate uses statistical methods to learn. Google feed Translate with articles and pages that have been already been translated by a human, and the program learns the translations of words and sentences from that. The problem arises when the two documents it's taught with differer slightly. With translations, this often happens with currencies, country names (lots of examples of translate screwing those up too) and numbers.


Are there common expressions that are equivalent in english and swedish with the respective currencies? E.g. stuff like "dollar to dollar", "bet someone dollars to doughnuts" ?

Anyway, plenty of text would probably match word-wise ("X bought Y for $AMOUNT $UNIT") in financial news, so the mapping of "dollar" over "kronor" seems a reasonable error.

For probably the same reason, google also translates hungarian "1000 forint" to english "1000 HUF" going from the full word to a quasi-acronym for "HUngarian Forint".


I would guess "annulled" to be the most direct translation of "anullerades" ("annulled" carries the same meaning as "voided" or "invalidated" but tends to be used more in legal contexts).


...making "anulled" a relatively rare word in the English training corpus.


> it seems to have just transformed "kronor" (...) into "dollars".

I saw some article once, where the names of a prime minister or some such was "translated" to the name of the US president. Weird.


Just an FYI, there is no way to tell how much of that article is machine translated since Google Translate allows anyone with a Google account to "Contribute to the translation".

Though I imagine it is mostly machine translated and it requires several agreeing "contributions" before it accepts them as accurate.


Does the contribution method actually change the document being edited or just add the parallel texts to the training data?


Alright, but it has its faults - it missed the word "error": "Instead, it is about a parsing [error] incurred in exchange system due to a technical error"


I was profoundly impressed. At first, I didn't realise it was a translation. Apart from a few odd passages, the text flows really well.


Well, that's obviously because English is a Scandinavian language... http://www.apollon.uio.no/english/articles/2012/4-english-sc...


The funny thing about translations (or automated understanding of text-snippets) is that it gets easier when you know the subject domain since it removes a lot of ambiguity in the individual expressions. So it might be feasible for a translator to make a good guess about the topic, but I don't know wether google translate attempts that.


"Sweden's gross domestic product, by comparison, amounted in 2011 to more than 3500 billion."

The irony.

Looks like a typo, it should be around 500bn$. 3500 would be Germany. Proof that "fat finger" human mistakes can do as much damage as algos gone wild.


As mentioned in another comment, it's not a typo, it's a translation error. It's actually 3500 billion SEK, which is roughly 500 billion USD.


Remember all those compiler warnings about mixing signed and unsigned integers? Now you know why.


In the stock market there are no sanity checks, else the whole thing would grind to a halt. :-)


And that's why you should always use unsigned ints

And put sensible limits and checks in processing untrusted data (that is, everything that comes from the outside)


Actually, I find you should always use signed ints.

With signed ints, you can notice that '2-3' has below 0, and act accordingly. If you do '2-3' with unsigned ints, there is not really any way of finding out something went wrong (other than looking for very large numbers).

Of course, particularly in a financal system, you should probably use an integer type which throws/aborts if an out-of-bounds error occurs.


And this is why I almost always never trust advice that has the words 'always' and 'never'.


Or trapping overflow/underflow.


How about using programming language/number library that properly handles integer promotion or signals a condition when a number gets out of bounds. Obviously it's slower than unchecked processor integers, but as this story shows, in some circumstances it's well worth it.


Just validation of the input string would suffice. No need to use unsigned ints or promoted numerals.


Well, that's a start, but any subsequent math would involve numbers not directly entered by the user, so it's not sufficient by itself.


No. Either the value started as a signed value, and then was converted to an unsigned ( something well warned about by observing compile warnings or Lint ), or it was unsigned all along. Either way the same result would have occurred.

Slightly better is 64bit signed. There can always be overflow though. The reason these values are POD's is because tickers move around at a very high speed.


That's quite a controversial statement. Using unsigned ints may just clutter code without providing any safety. The value of 4e9 is still invalid, even though it's positive. As you say, checking your inputs is important, and in this case values can be too large, and should be rejected.

For instance: http://google-styleguide.googlecode.com/svn/trunk/cppguide.x...


Using "unsigned ints" for values that will never be negative allows you to perform one validation test instead of two. Use a typedef to avoid writing "unsigned" everywhere and you end up with less clutter in the code (for humans) and in the binary (for machines).

nanex


You should check before the operation that can produce result outside the valid range anyway.

Wrong example:

    unsigned int z = x - y;
    if (z > BIG_NUMBER) {
        return false;
    }
Good example:

    if (y > x) {
        return false;
    }
    unsigned int z = x - y;


The issue is

4e9 is invalid but in the middle of the way (more often than not) this may be interpreted as a negative number.

Hence you need to check for value 'bigger than allowed' and 'smaller than allowed'

For unsigned ints, there is a natural smallest value allowed: zero.

You're free to not follow my advice though.

And the google style guide is subtle, it's not what you're implying. You can't use 'short' for example unless explicitly.


However this was an unsigned int, otherwise it would show up as -6 (as int32 is −(2^31) to 2^31-1), the problem here is just validating the input.


Before we jump into how the code might have been wrong, there are other issues that can impact mission-critical applications like this that have little to do with the high-level code. In financial applications it's somewhat common to build the hardware to tolerate SEUs. For example, I know of one application where two identical digital subsystems are used, given the exact same input, and the outputs are compared. If the outputs ever differ a logic error is assumed and operation halted.

Logic errors are rare but not that rare. At sea level you can expect ~1 bit flip in 4GB of memory every 24 hours. Often it's in an invalid line or gets overwritten anyway, but for applications like this where one logic error can cost you your shirt, application of hardware-level error checking (ECC, SEU detection and correction, etc).


No, that's why you should always check your input data. It's a bug elsewhere, not a question of unsigned versus signed.


Using an unsigned int is what caused the problem...


where did the -6 come from?


I'm guessing because 4294967290 = 2^32 - 6


I guess someone wanted to buy 6 stocks, pressed wrong and made -6, and the program did not have any data checks and sent it to the market.


Well, they're going to have a lot of explaining to do when a dump truck shows up outside of their house with 4,294,967,290 stock certificates.


Like: What happened to the street when a dump truck loaded with 8500 tons of paper drove down it leaving deep trenches through the pavement.

4,294,967,290 at one sheet of paper per share is about one entire freight train of box cars. Depending on the paper.


Great way to sell actions at the price you like :)


How is that an explanation? Do you think someone wanted to buy 4 billion stocks?


thanks, that makes sense


Equivalence class testing, yo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: