To answer your first question:
We crawl the link, and apply a proprietary semantic algo which uses a statistical method - we trained our algo on a preset universe of web pages to get the relevant correlation matrices
About scaling:
We used node.js (server side javascript language), which is very useful for this type of process.