Hacker News new | past | comments | ask | show | jobs | submit login

Hi, Great job. I was almost about to build such a thing before I realized the crazy amount of crawling and transcriptions that I have to index.

How exactly did you solve/approach the problem?

1. How did you crawl across those millions of videos from the platform?

2. How are you indexing stuff like that




Based purely on the speed of the results, I believe that the crawling is happening in real time.

The search is scoped by channel, so the closed-caption files for all the videos in the channel are downloaded and searched for on the fly.

Edit: Wow, thanks to dev tools, I can see that the website is downloading the transcript and metadata for all the videos from the channel to the client. So the search is happening client-side!!


Hey @alex_smart- Yep! You're exactly right. My app performs as much of the fetching and searching work as possible on the client, then calls serverless function API endpoints for the few things that can't be done in-browser.

I replied to @lewisjoe's comment above this one with some more details about tricks I'm employing to make searches fast.

Thanks for checking out my project!


Hey @lewisjoe! YouTube doesn't provide a way to search across all of YouTube based on transcriptions/spoken words. My app performs on-the-fly searches, as @alex_smart noticed in another comment.

My app does as much of the fetching and match finding work that it can client-side. For the few things that can't be done client-side, my app calls serverless function API endpoints to fetch the YouTube channel and video data it needs. Here are the tricks I'm using to make it fast:

- As soon as the "Channel URL" field loses focus, I start fetching the most recent 30 videos on that channel in the background. This way, by the time the user enters the keyword and date range, I've already fetched some (maybe even all!) of the data ahead of time, which means less wait time for them.

- Once a specific video's data (title, description, transcript, etc.) has been fetched once, it is saved in memory. All other searches the user performs from that point on will pull the video data from the in-memory cache, if it's there. Otherwise, it will fall back to fetching the video data over the network. This in-memory caching makes subsequent searches within the same date range (or a shorter date range) take <1 second.

- Network requests to fetch video data are processed concurrently rather than one at a time. So the browser fires off as many as it can in parallel to get them all resolved as quickly as possible.

- As soon as any matches are found, the UI updates to show the user. This way, the user can start scrolling through the matches and reviewing them while the search is still in progress– they don't have to wait until it finishes to start interacting with the matches.

Thanks for checking out my project, and for the kind words! I appreciate it.


I built something similar to this when I was first learning to program. Except mine lets you perform a one-time search for videos containing keywords, similar to how you would normally search YouTube. So there are no notifications or anything.

https://phrasefinder.net

I don’t know if it even works anymore and I’m sure the code is atrocious. But I remember that I would just scrape YouTube pages for video IDs, and then use an API that returned video captions for a given ID [1]. I could see how OP would do something similar.

[1] https://pypi.org/project/youtube-transcript-api/


Nice job on PhraseFinder, @SteveDR! I am indeed doing something similar for my app. The user enters the YouTube Channel URL, the keywords and the date range, then I perform an on-the-fly search to find the matches.

I replied to @lewisjoe's comment above this one with some more details about tricks I'm employing to make searches fast.

Thanks for checking out my project!


Its only for a single channel. That's the answer, it queries in real time.


Yep! You nailed it. My app performs an on-the-fly search on the client. I replied to @lewisjoe's comment above this one with some more details about tricks I'm employing to make searches fast.

Thanks for checking out my project!


Bump. Would like the lowdown too!


Hey @jonplackett! as @alex_smart noticed in another comment, I am performing on-the-fly searches on the client.

Some requests can't be sent from the browser directly to YouTube due to the Content Security Policy directives that YouTube has in place, though.

For example, if you try to run this code to fetch a YouTube video page in the browser console from any non-YouTube site (like Hacker News or VideoMentions), you'll see the it errors out:

(async () => { const response = await fetch('https://www.youtube.com/watch?v=irjc1nJ1eJs'); console.log(response); })();

That's why my app uses serverless API endpoints as a middleman. It works like this:

1. Browser fires off a request to the API endpoint.

2. The serverless function Node.js process spins up, fetches the data from YouTube, returns the response, then spins back down.

3. My app takes the video data in the response, saves it in memory, searches it for matches, and re-renders the UI to show the matches, if any.

I built my app to do as much work as possible client-side, and to use serverless function API endpoints for anything that can't be done in-browser.

I replied to @lewisjoe's comment above this one with some more details about tricks I'm employing to make searches fast.

Thanks for checking out my project!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: