From the abstract: Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.
The Cat detection thing was just a side product of learning to identify features of things in an unsupervised manner, but the news outlets locked on to that with titles such as "How Many Computers to Identify a Cat? 16,000" in NY Times.
Wasn't it amazing that they could distill the concept of cat from images with no help from external labels (human intervention)? They missed the core of the discovery by not understanding that.
The deep learning method is an unsupervised way to process raw input and transform it into useable features. This used to be done by a combination of domain knowledge and supervised training, but they could build an automated way to extract relevant features from images.
This opened the window for hope that one day neural networks will be easily applied to any new domain if there is sufficient raw data to build a deep network for it. In the past there was a need for a large investment in human based data labeling and how to extract the best features from raw data (also described as voodoo magic by the same researchers - it was hard, it was domain locked and expensive).