I found some slides [1] explaining how this works.
A less poetic example of privileged information: if you're training on time-series information, you can include events from the future in the training examples, even though they won't be available while making predictions in production.
Apparently this helps the machine learning algorithm find the outlying data points when the data isn't linearly separable.
In particular, I was unsure after reading the original article whether the additional information--for example, the poetry--was available to the learner on test inputs. The above slides explicitly state that it is not.
Unfortunately, I don't find that reference very helpful - it's just pages of annotated equations.
What's an example of pseudocode that would actually implement this? Surely you don't load a natural language module in order to parse the pathologist's notes (in the example given in the reference about biopsies)?
(I should also note that the original article is devoid of any technical examples, making it completely opaque to me what it actually entails.)
It's worth pointing out that Vladimir Vapnik is the inventor of Support Vector Machines. The short version of what he's done here is he's come up with a way of formulating them that allows him to make use of extra information at training time (that is not available at test time).
That is an important point - the extra info is only available at learning time (otherwise you need a physician sitting next to the "cancer-scanning" computer slowing down the clock speed by doing the analysis themself.) This seems obvious once you say it, but it had not occurred to me before, thanks!
Yes, I mostly don't understand the math either. But apparently with the poetry, they converted it into a vector somehow based on the appearance of keywords. Perhaps someone will find a friendlier example.
These different distributions are difficult to distinguish with statistics. However, if you can see the shape snd therefore know the "rule" or the structure of the distribution, it's easy to design/train a network to recognise them.
(note that the content of that PDF is not about this problem per se)
I imagine a keyword decomposition would work very well with a pathologists' report too - essentially using the biological feature-names as "tags" on the image would eventually allow a program to correlate an image of that feature with a name much faster than a general "good v bad" fitness function.
If this is how it works, it's a shame this isn't made clearer in the texts. You could get someone to tag images whimsically but consistently, rather than go the whole trouble of writing poetry.
A less poetic example of privileged information: if you're training on time-series information, you can include events from the future in the training examples, even though they won't be available while making predictions in production.
Apparently this helps the machine learning algorithm find the outlying data points when the data isn't linearly separable.
[1] http://web.mit.edu/zoya/www/SVM+.pdf