WeChat does a subset of this and is extremely popular in China. Basically it's like text messages, except voice. This looks a lot more polished though, and obviously includes many more features. I can definitely see it being handy. It's often easier to speak a message than type it, and if you care about the presentation of your message, voice to text will often take longer than straight text. (Plus can't communicate everything voice does.) But I still rarely use WeChat since voice clips aren't my default mode of communication, and there's friction in using a completely separate app/service just for that. Including it in an all-encompassing messaging platform though, sounds great.
It depends on what input system you use. If you draw characters with your finger, as many Chinese do, it's pretty slow. If you're on Android and use Google Pinyin you can swipe across letters just like you can in English. Based on your swipe gestures it predicts what you're trying to type pretty well. If you're a Chinese typing expert, you can use Wubi[1] which can be used to achieve typing speeds higher than most English typists.
Entry seems pretty fast to me. The pinyin system means that you can enter a few letters using the English alphabet and it will show you the most likely character. For common words, one or two letters per character is enough. I am not a native Chinese typist, but even by guessing at the proper spelling, the predictive entry system is good enough for me to type at a non-too-terrible speed.
Not a native speaker either, but I hate typing in Pinyin on a touchscreen device (iPhone in my case). You have to get almost every letter right, whereas if you type English and miss up to 50% of the letters, autocorrect can usually still save you. I agree that Pinyin feels pretty much as fast as typing English on a real computer, though.
(German is also incredibly frustrating to type on a phone, as there are more ways to compose words than autocorrect could ever possibly know.)
I find it very surprising you have said pinyin is slow. It has the downside that you need to look at what you're typing, but it's certainly not slow. With modern IMEs it's more akin to something like Swype in it's usage.
I am appalled when in other regions in China where they do not use pinyin, watching people spending over a minute trying to tell their friend that they just got on the train.
It is strange watching people have entire conversations using Voice Clips.
I disagreed with it more until she showed me the (hidden!!) option to make them not play back through speakerphone, which WeChat, of course, defaults to.