Hacker News new | past | comments | ask | show | jobs | submit login

As much as I agree with you in general... why this? "So, failing easy pattern recognition, we have to use frequency analysis, DFT, and tons of AI."

It's not like we hear the waves really. We hear the pitch and volume mainly (proved nicely by playing simple impulses at >10Hz). So it's not like we fail easy pattern recognition on waves. I'd say that it's something we should not be looking at. Ever. Especially when you can have many representations which sound the same (they sound the same, because they look really similar in the frequency domain).

Using the frequency analysis, DFT, etc. is the way to approach it.




> Using the frequency analysis, DFT, etc. is the way to approach it.

Indeed. I meant to say there is no shortcut that immediately makes speech recognition easy to solve. Hence, we have to use a lot of advanced math that works pretty well but like the article says, not as well as a typical human.

However, I don't think humans are that great either. The proof of that becomes evident when working on projects with people from around the world. Ask 10 people from around the world to dictate 10 different paragraphs that the other nine have to write down. I doubt they'll have the 98% accuracy that the article states, especially if there is no domain-constraint. Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. Of course, this doesn't mean it's not fun and interesting for computer scientists. It's just hard for most people to realize why the stupid automated voice-attendant can't understand they said "next" and not "six".


It works quite well if you don't have to care about the accent though.

Trivia: "Automatic computer speech recognition now works well when trained to recognize a single voice, and so since 2003, the BBC does live subtitling by having someone re-speak what is being broadcast." (http://en.wikipedia.org/wiki/Closed_captioning)

But I'm pretty sure they started using it globally sometimes, because you can see the recognition quality fall when there's an interview with someone from outside UK.


"... Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. ..."

RP [0] [1] is an attempt to solve this "regional dialect" problem and from that may lie a clue.

[0] http://www.bbc.co.uk/dna/h2g2/classic/A657560 [1] http://en.wikipedia.org/wiki/Received_Pronunciation


It's not actually clear what we hear at this point, there is evidence that we respond to something like frequencies, pitch, volume, and other things. But the jury is still out on the low-level signal processing that is occurring in the cochlea and the primary auditory cortex. What's happening further downstream in the brain is even less clear.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: