Hacker News new | past | comments | ask | show | jobs | submit login

I have no idea what Google does, but expect their parsers to be quite robust. I tried doing some web scraping, and so many pages are not even valid HTML (most often invalid nested tags, like a table inside span, missing closing tags even when required, random unopened closing tags, ...). Not closing <p> and <td> tags is quite common, I have not seen omitted <html> <head> and <body> yet.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: