They use "an unmodified version
of the architecture as specified in the original paper", and it looks like they copy & pasted the core code.
This paper lacks some further explanation. Why do they use XGBoost for predicting whether some word is to be normalized? And why do they use DNC for the seq2seq model? I think a single shared model for both tasks might be a cleaner solution. E.g. an encoder which with output layer for the prediction and also this encoder is fed to the decoder. The motivation for DNC is also not too clear, although I can guess that they think this is too hard for a LSTM. But for DNC, to get the advantages out of it, it should support some time for doing internal calculations, which you could get by introducing internal computation steps. They don't do that. Also, in their results section, they do not compare to any other model, so it it not clear whether XGBoost is the best choice, and also not whether the DNC really helps here.
Hi, we tried using a single model for the entire seq-to-seq task but the number of examples in PLAIN is huge which causes the model to perform worse on other classes. The reason we used XGBoost was to separate the two very different tasks (predicting whether a word in normalized; predicting the sequence of normalized tokens).
On the other hand, as mentioned when comparing text normalization systems it is more important to look at the exact kinds of errors made by the system (not only the overall accuracy). Our model showed improvement over the baseline model in https://arxiv.org/abs/1611.00068. DNC showed improvement in certain semiotic classes such as DATE, CARDINAL and TIME making zero unacceptable predictions in these classes, LSTM was susceptible to these kinds of mistakes even when a lot of training data was available. Yes, we do not use internal computation steps, the model replaces a standard LSTM in a seq-to-seq model with a DNC. However thanks for the suggestions it would be interesting to see the performance improvements if the internal computation steps are increased.
Official DeepMind DNC code is here: https://github.com/deepmind/dnc
They use "an unmodified version of the architecture as specified in the original paper", and it looks like they copy & pasted the core code.
This paper lacks some further explanation. Why do they use XGBoost for predicting whether some word is to be normalized? And why do they use DNC for the seq2seq model? I think a single shared model for both tasks might be a cleaner solution. E.g. an encoder which with output layer for the prediction and also this encoder is fed to the decoder. The motivation for DNC is also not too clear, although I can guess that they think this is too hard for a LSTM. But for DNC, to get the advantages out of it, it should support some time for doing internal calculations, which you could get by introducing internal computation steps. They don't do that. Also, in their results section, they do not compare to any other model, so it it not clear whether XGBoost is the best choice, and also not whether the DNC really helps here.