Yes, this is how VideoMentions Search works. It scrapes the video page markup and pulls out the "baseUrl" for the English caption track. It converts that XML caption track into JSON, then searches it for keyword matches. Is that what you're asking about?
Ah got it. I thought there would be some API where the transcripts could be searched across videos. Maybe that requires way to many resources for Google to index
Yeah, Google is of course the king of search, so they could certainly decide to revamp YouTube search to include spoken word/transcription matches. They have all the data required to make that happen. That would make VideoMentions Search irrelevant– and I'd be okay with that!
In the meantime, I think this is a useful tool for quickly locating videos based on spoken words.