The speaker prompt is the sample speaker voice reading a random text, that’s one piece that the model uses as input. The second column corresponds to the human speaker reading the text (ground truth) The two next columns are baseline and VALL-E producing text-to-speech respectively, given the first column and only the text as input.
What’s the study’s control group? If we compare the number of companies founded by students over different time periods it doesn’t say much about the effects of the “brain drain”.
The argument the author uses is invalid. NNs are equally strong at modelling sparse signals, provided that they could be mapped into a continuous space, what is commonly referred to as an 'embedding'.
The premise of the article is valid though, in that NLP is a hard problem. The reason is partly because NLP is ill-defined; how do you define language understanding?
NNs are very effective at learning mappings of y=f(X), given enough examples. One of the reasons that they're so effective at modelling speech, vision, translation, etc., is that such mappings exist in high volumes. Because of the above-mentioned ambiguity of NLP, it's harder to come up with such pairs for 'understanding' a language. How do you come up with a dataset of sentences and their 'meaning'? Probably the best you could do is to map them to some action. And critics will readily disregard such attempts as 'not really NLP'.
I think he's arguing that the current NN approach for NLP is not going to lead to embeddings that are going to make revolutionary progress in NLP.
And there have been attempts to ascribe a semantics to natural language from text (for ex. see CCG grammars). The datasets are not as big as for vision tho, yes. But I'm not convinced that we need such explicit datasets to be able solve this problem.
I would not be sure that mapping of discrete linguistic objects to a continuous space is necessary. Why can't we handle the original space?
There are just a lot of things that have to be figured out still.
+ Different time scales. There is semantics on a sentence level while there is also semantics on a plot level. It's convenient to know key elements from the start of a story if you want to understand the plot. LSTMs are a perfect starting point.
+ When to stop learning. The so-called stability-plasticity dilemma. Our ability to pay attention to what matters might be tightly linked to our capability to forget vast bodies of texts that we just read. Current NNs do not seem to forget correctly. This was the rationale behind ART and ARTMAP (Grossberg) and might enter AI mainstream again soon.
+ Grammar constructions. Some aspects of grammar seem simpler than computer vision, where we also have a lot of structure in the environment, models like things that can be inside of other things, be balanced on top of other things, temporarily occluded by other things, etc. Other aspects seem more complicated, like the pleasantness of a poem. My gut feeling is that some of this gets spilled over from (a) structure in other modalities and (b) idiosyncrasies from our generative system (vocal cords, etc.). In other words, our grammatical preferences might be sampled not only from listening and reading.
+ Emphasis.
Just a few things that might lead to interesting NNs. Contrary to the author I think they are definitely in line with current research.
The problem with deep learning and language understanding is that the task is ill-defined end-to-end. For speech, image understanding, and translation, you can come up with large datasets of x->y and have deep learning learn a complex function to approximate the mapping. We don't have that luxury in language understanding, at least not yet.