The conclusions you draw from the results that are presented seem to be a bit of a stretch. From the accuracy table it is pretty clear that the LSTM model outperforms your DNC model in most classes (for both datasets the LSTM achieves the same or better accuracy in 10 out of the 14 classes).
You then argue that the number of "unacceptable errors" is a better measure of model performance, which seems reasonable. However, you don't really show any analysis of these errors other than a table with some hand-picked examples. I would spend some time trying to actually quantify these errors so you can analyze them and show a proper plot or table that summarizes the results.
I think the work is interesting but I would be careful how you present the results. I would suggest adding more plots/tables to back up your claims in a more objective manner or tone down the conclusions a bit. This is meant to be constructive criticism btw, it's not an attack :-) I think with a bit more work you'll be ready for a proper peer review.
The paper by Sproat and Jaitly which introduces the challenge rightly notes that the acceptability of errors and quality of output is more important than accuracy for a real application. The number of instances in all of the critical semiotic classes is too low (1-2k for some, even less than 100 for others) for a meaningful comparison in accuracy.
But you are right to point out that the 'unacceptability' of errors could be analyzed better. However, we could not think of a way to quantify or form a metric that measures such errors. These 'silly' errors are subjective by their very nature and depend on a human reading them. As you have suggested, we are working on preparing a table of sorts to summarize all these errors and show a link between the availability/frequency of particular types examples to the performance of our model on those particular types. Something of this sort for example:
* The training set had 17,712 examples in DATE of the form xx/yy/zzzz. Upon the analyzing the mistakes in DATE class we did not find any mistakes made in the dates of the above form.
* On the other hand, if look into the mistakes made in MEASURE class we find that the DNC network made exactly 4 mistakes. The mistakes were reported in the units (g/cm3, ch, mA). Upon searching for the occurrences in the training set of these units, we found out that 'mA' occurs 3 three times, 'g/cm3' occurred 7 times and 'ch' occurred 8 times, whereas other measurement units like kg occur 296 times and cm occur 600+ times.
If you have any other ideas on how to analyze and report the results, please let us know. We will be glad to improve the quality of our work (By the way, we are undergrads and this is our very first research paper). Thanks again!
You then argue that the number of "unacceptable errors" is a better measure of model performance, which seems reasonable. However, you don't really show any analysis of these errors other than a table with some hand-picked examples. I would spend some time trying to actually quantify these errors so you can analyze them and show a proper plot or table that summarizes the results.
I think the work is interesting but I would be careful how you present the results. I would suggest adding more plots/tables to back up your claims in a more objective manner or tone down the conclusions a bit. This is meant to be constructive criticism btw, it's not an attack :-) I think with a bit more work you'll be ready for a proper peer review.