Given the number of articles I guess you're processing each day, I think you should probably rewrite your parser in C. I used to run a service which basically consisted on a feed reader where every article was preprocessed by an algorithm similar to readability. I wrote the parser using lxml and it looked fast enough, but when I started running on the 400K-500K pages per day territory I started having performance problems. Since parsing the pages is easily paralelizable across multiple machines, I could have just rented some more servers. But where's the fun in that? So I sat in front of the computer and 4 hours later I had a C implementation which passed all the testsuite and, according to valgrind, didn't have any memory leaks. As soon as I deployed it into production, CPU and memory usage dropped by something like 10x (don't remember the exact number) and I was able to remove some servers and bring the costs down. Sadly, I had to close that project because I was spending too much time on it compared to the revenue it was generating, but it was so much fun while it lasted.
Another anecdote: I was writing an HTML-to-text converter. The prototype used lxml and some custom DOM-traversal and formatting logic in Python. I got about a 17x speedup from porting the thing to use C and libxml2 (the parser that lxml uses). The port to C took most of an afternoon, and it's currently chewing through a lot of HTML without a problem.