The conclusions you draw from the results that are presented seem to be a bit of...

amandavinci · on June 13, 2018

Thanks for the great feedback.

The paper by Sproat and Jaitly which introduces the challenge rightly notes that the acceptability of errors and quality of output is more important than accuracy for a real application. The number of instances in all of the critical semiotic classes is too low (1-2k for some, even less than 100 for others) for a meaningful comparison in accuracy.

But you are right to point out that the 'unacceptability' of errors could be analyzed better. However, we could not think of a way to quantify or form a metric that measures such errors. These 'silly' errors are subjective by their very nature and depend on a human reading them. As you have suggested, we are working on preparing a table of sorts to summarize all these errors and show a link between the availability/frequency of particular types examples to the performance of our model on those particular types. Something of this sort for example:

* The training set had 17,712 examples in DATE of the form xx/yy/zzzz. Upon the analyzing the mistakes in DATE class we did not find any mistakes made in the dates of the above form.

* On the other hand, if look into the mistakes made in MEASURE class we find that the DNC network made exactly 4 mistakes. The mistakes were reported in the units (g/cm3, ch, mA). Upon searching for the occurrences in the training set of these units, we found out that 'mA' occurs 3 three times, 'g/cm3' occurred 7 times and 'ch' occurred 8 times, whereas other measurement units like kg occur 296 times and cm occur 600+ times.

If you have any other ideas on how to analyze and report the results, please let us know. We will be glad to improve the quality of our work (By the way, we are undergrads and this is our very first research paper). Thanks again!