I'd use a dictionary and Bayesian net. You're given word lengths and keystroke signatures. You are not told which key matches which signature. It essentially becomes an encryption cracking problem, kinda like "e is the most common letter so that's the vibration sig we'll see the most".
I think the backspace key would be the best key to try to find a reference for. It is often pressed multiple times in a row and would allow for recognition of common typos.
You could also get a sketch map of the keyboard from key press volume, louder means closer, find backspace, model phone position in relation to keyboard position, take it from there.
And the spacebar key, when hit, sounds different than the other keys. I'm not good with onomatopoeia but normal keys generally go tack-tack-tack while the spacebar goes tchick-tchick-tchick. It's noticeably different. This could actually prove quite scary since if you know the length of the 'tack', you're basically doing stupidly simple linear cryptanalysis.
Hm. Not an expert on the subject but I think by some readings of that you could consider it to be one. After all, if you duplicate the setup you're going to bug that gives you exactly the same kind of advantage that a template attack gives you, one where you have many more samples than you'd have if all you had to use was the target itself. You could even try to parallelize this to determine sensitivity to conditions by creating a number of setups all slight variations on the target. This should help to establish how reliable the output of the actual side-channel is.