Based purely on the speed of the results, I believe that the crawling is happening in real time.
The search is scoped by channel, so the closed-caption files for all the videos in the channel are downloaded and searched for on the fly.
Edit: Wow, thanks to dev tools, I can see that the website is downloading the transcript and metadata for all the videos from the channel to the client. So the search is happening client-side!!
Hey @alex_smart- Yep! You're exactly right. My app performs as much of the fetching and searching work as possible on the client, then calls serverless function API endpoints for the few things that can't be done in-browser.
I replied to @lewisjoe's comment above this one with some more details about tricks I'm employing to make searches fast.
Hey @lewisjoe! YouTube doesn't provide a way to search across all of YouTube based on transcriptions/spoken words. My app performs on-the-fly searches, as @alex_smart noticed in another comment.
My app does as much of the fetching and match finding work that it can client-side. For the few things that can't be done client-side, my app calls serverless function API endpoints to fetch the YouTube channel and video data it needs. Here are the tricks I'm using to make it fast:
- As soon as the "Channel URL" field loses focus, I start fetching the most recent 30 videos on that channel in the background. This way, by the time the user enters the keyword and date range, I've already fetched some (maybe even all!) of the data ahead of time, which means less wait time for them.
- Once a specific video's data (title, description, transcript, etc.) has been fetched once, it is saved in memory. All other searches the user performs from that point on will pull the video data from the in-memory cache, if it's there. Otherwise, it will fall back to fetching the video data over the network. This in-memory caching makes subsequent searches within the same date range (or a shorter date range) take <1 second.
- Network requests to fetch video data are processed concurrently rather than one at a time. So the browser fires off as many as it can in parallel to get them all resolved as quickly as possible.
- As soon as any matches are found, the UI updates to show the user. This way, the user can start scrolling through the matches and reviewing them while the search is still in progress– they don't have to wait until it finishes to start interacting with the matches.
Thanks for checking out my project, and for the kind words! I appreciate it.
I built something similar to this when I was first learning to program. Except mine lets you perform a one-time search for videos containing keywords, similar to how you would normally search YouTube. So there are no notifications or anything.
I don’t know if it even works anymore and I’m sure the code is atrocious. But I remember that I would just scrape YouTube pages for video IDs, and then use an API that returned video captions for a given ID [1]. I could see how OP would do something similar.
Nice job on PhraseFinder, @SteveDR! I am indeed doing something similar for my app. The user enters the YouTube Channel URL, the keywords and the date range, then I perform an on-the-fly search to find the matches.
I replied to @lewisjoe's comment above this one with some more details about tricks I'm employing to make searches fast.
Yep! You nailed it. My app performs an on-the-fly search on the client. I replied to @lewisjoe's comment above this one with some more details about tricks I'm employing to make searches fast.
Hey @jonplackett! as @alex_smart noticed in another comment, I am performing on-the-fly searches on the client.
Some requests can't be sent from the browser directly to YouTube due to the Content Security Policy directives that YouTube has in place, though.
For example, if you try to run this code to fetch a YouTube video page in the browser console from any non-YouTube site (like Hacker News or VideoMentions), you'll see the it errors out:
How exactly did you solve/approach the problem?
1. How did you crawl across those millions of videos from the platform?
2. How are you indexing stuff like that