The model specifications used for the Kaggle competition was a lot different than the one mentioned in the paper. The paper compares on the same test set used by https://arxiv.org/abs/1611.00068. DNC showed significant improvement over LSTM as a recurrent unit of a seq-to-seq model with almost zero unacceptable mistakes in certain semiotic classes. LSTM, on the other hand, is susceptible to these kinds of mistakes even when a lot of data is available.
I'm sorry for the misunderstanding. The reason we added the sentence because the model used in the competition was also based on DNC. But, changes were made when writing the paper, for instance, we did not use any attention mechanism at the seq-to-seq level in the competition. Besides, the paper concentrates more on comparing the kinds of errors made by the DNC network (avoiding unacceptable mistakes; not the overall accuracy), which shows an improvement over the LSTM model in the paper (https://arxiv.org/abs/1611.00068). On the other hand, overall accuracy was more important for the Kaggle competition.
We modified the sentence to say, "An earlier version of the approach used here has secured the 6th position in the Kaggle Russian Text Normalization Challenge by Google's Text Normalization Research Group".
The conclusions you draw from the results that are presented seem to be a bit of a stretch. From the accuracy table it is pretty clear that the LSTM model outperforms your DNC model in most classes (for both datasets the LSTM achieves the same or better accuracy in 10 out of the 14 classes).
You then argue that the number of "unacceptable errors" is a better measure of model performance, which seems reasonable. However, you don't really show any analysis of these errors other than a table with some hand-picked examples. I would spend some time trying to actually quantify these errors so you can analyze them and show a proper plot or table that summarizes the results.
I think the work is interesting but I would be careful how you present the results. I would suggest adding more plots/tables to back up your claims in a more objective manner or tone down the conclusions a bit. This is meant to be constructive criticism btw, it's not an attack :-) I think with a bit more work you'll be ready for a proper peer review.
The paper by Sproat and Jaitly which introduces the challenge rightly notes that the acceptability of errors and quality of output is more important than accuracy for a real application. The number of instances in all of the critical semiotic classes is too low (1-2k for some, even less than 100 for others) for a meaningful comparison in accuracy.
But you are right to point out that the 'unacceptability' of errors could be analyzed better. However, we could not think of a way to quantify or form a metric that measures such errors. These 'silly' errors are subjective by their very nature and depend on a human reading them. As you have suggested, we are working on preparing a table of sorts to summarize all these errors and show a link between the availability/frequency of particular types examples to the performance of our model on those particular types. Something of this sort for example:
* The training set had 17,712 examples in DATE of the form xx/yy/zzzz. Upon the analyzing the mistakes in DATE class we did not find any mistakes made in the dates of the above form.
* On the other hand, if look into the mistakes made in MEASURE class we find that the DNC network made exactly 4 mistakes. The mistakes were reported in the units (g/cm3, ch, mA). Upon searching for the occurrences in the training set of these units, we found out that 'mA' occurs 3 three times, 'g/cm3' occurred 7 times and 'ch' occurred 8 times, whereas other measurement units like kg occur 296 times and cm occur 600+ times.
If you have any other ideas on how to analyze and report the results, please let us know. We will be glad to improve the quality of our work (By the way, we are undergrads and this is our very first research paper). Thanks again!
"...while avoiding the kind of silly errors made by the LSTM based recurrent neural architectures."
Only in arXiv you could get away with that kind of language :). Good paper though! Kudos.
"Another direction to go from here would be to increase the size of the context window during the data preprocessing stage to feed even more contextual information into the model."
Could you comment on how the training time would scale with increasing the size of the context window? Is there a sweet spot?
Thank you for the review! We will surely correct these mistakes before submitting for final publication.
The memory requirements of DNC is quite high. We used GTX 1060 for training. Increasing the context window anything more than 3 increases the sequence length by a huge amount, causing memory problems. However, we also found that DNC works quite well even on small batch size. We used a batch size of 16 for all our experiments. The training time for a batch size of 16, context window of size 3 and 200k steps is 48h on a GTX 1060 system.
Good paper, I was wondering what is the state-of-art of using Neural Networks for Text Segmentation, Text Lemmatisation, Part-of-speech Tagging. Morphological approaches is dominant in this space.
"text normalization" just refers to that sort of thing. it's a subcomponent of the "front-end" of a speech synthesizer but it can also be used for speech recognition (if your training data contains things like "145" you may want to convert it to read "one hundred and forty five", for various reasons) and information extraction (perhaps you want to treat "145" and "one hundred and forty five" as the same).
They use "an unmodified version
of the architecture as specified in the original paper", and it looks like they copy & pasted the core code.
This paper lacks some further explanation. Why do they use XGBoost for predicting whether some word is to be normalized? And why do they use DNC for the seq2seq model? I think a single shared model for both tasks might be a cleaner solution. E.g. an encoder which with output layer for the prediction and also this encoder is fed to the decoder. The motivation for DNC is also not too clear, although I can guess that they think this is too hard for a LSTM. But for DNC, to get the advantages out of it, it should support some time for doing internal calculations, which you could get by introducing internal computation steps. They don't do that. Also, in their results section, they do not compare to any other model, so it it not clear whether XGBoost is the best choice, and also not whether the DNC really helps here.
Hi, we tried using a single model for the entire seq-to-seq task but the number of examples in PLAIN is huge which causes the model to perform worse on other classes. The reason we used XGBoost was to separate the two very different tasks (predicting whether a word in normalized; predicting the sequence of normalized tokens).
On the other hand, as mentioned when comparing text normalization systems it is more important to look at the exact kinds of errors made by the system (not only the overall accuracy). Our model showed improvement over the baseline model in https://arxiv.org/abs/1611.00068. DNC showed improvement in certain semiotic classes such as DATE, CARDINAL and TIME making zero unacceptable predictions in these classes, LSTM was susceptible to these kinds of mistakes even when a lot of training data was available. Yes, we do not use internal computation steps, the model replaces a standard LSTM in a seq-to-seq model with a DNC. However thanks for the suggestions it would be interesting to see the performance improvements if the internal computation steps are increased.
The demo code used for test results in the paper are available in https://github.com/cognibit/Text-Normalization-Demo. The model implementation is located in src/lib/seq2seq.py. We did not code the DNC cell from scratch, the official implementation was used.
The write up on that (from Google, who organized it and provided the data) was really interesting: http://blog.kaggle.com/2018/02/07/a-brief-summary-of-the-kag...