The lsm command for Latent Semantic Mapping

lars512 · on June 24, 2011

Latent semantic mapping is a technique which takes a large number of text documents, maps them to term frequency vectors (vector-space semantics), and performs dimensionality reduction into a smaller semantic space. This then lets you determine how similar in meaning different documents are. You can use this for a variety of tasks.

Wikipedia: Latent Semantic Mapping http://en.wikipedia.org/wiki/Latent_semantic_mapping

WWDC 2011 talk, now available: "Latent semantic mapping: exposing the meaning behind words and documents" https://developer.apple.com/videos/wwdc/2011/

tswicegood · on June 24, 2011

I would really be interested in some use-cases. The examples they give are fairly limited.

bravura · on June 24, 2011

Classification (e.g. spam detection) and document categorization, as well as clustering similar documents.

You can do all these tasks in the original document space, instead of in the latent space, but the advantage of the latent space is that it can capture patterns across the entire corpus. This is called unsupervised learning.

In particular, if I have only 100 training examples (e.g. 10 examples of spam and 90 examples of ham), I will learn a better classifier if I first use LSM and then train my classifier, than if I train my classifier over the original documents. In the former case, unsupervised learning detects patterns over the entire corpus, which I use to discriminate between spam and ham. In the latter case, I can only use features from the 100 labeled documents, so it is more difficult to generalize.

More examples:

* What language is this document?

* Is this document about sports?

* Is this news article similar to 50 news articles that I previously marked as "highly interesting" ?

copper · on June 24, 2011

Well, for example, how about sorting out a lot of pdf documents I have in a folder called papers/ ? I do use Mendeley now but there are some leftovers from before that I really don't want to sit and sort through (not to mention the fact that I probably may have multiple copies of some of them.)

niels_olson · on June 24, 2011

Curriculum review comittees could reduce redundancy and fill gaps by reviewing course documents.

wooster · on June 24, 2011

We used this, when I was at Apple, to make the Parental Controls web content filter (which I worked on), among other things. It works surprisingly well.

spitfire · on June 24, 2011

I just can't ever see Microsoft shipping something like this available to every user. This sort of quiet progress is why I like Apple. Sure they highlight the glossy stuff, but below the surface there's so much blood and guts progress.

jules · on June 24, 2011

While Microsoft may not be shipping a LSM program to every user, they are doing a ton of scientific research, much more than Apple. For example googling "latent semantic analysis microsoft research" turns up several research papers on the topic by Microsoft. They do cutting edge research comparable to a good university on a large number of topics like programming language design, compilers, machine learning, distributed computing, graphics, automated theorem proving etc.

spitfire · on June 24, 2011

Microsoft research does some amazing things. But like Xerox PARC they never seem to ship anything. Except for the kinect, that's been a reasonable success.

ashish01 · on June 24, 2011

Actually MS SQL Server does ship with a set of data mining algorithms for some time now

http://msdn.microsoft.com/en-us/library/ms175595.aspx

http://msdn.microsoft.com/en-us/library/ms175382.aspx

contextfree · on June 24, 2011

afaik these are only in the paid versions.

pnathan · on June 24, 2011

No, MS doesn't ship this stuff with Windows.

But MS Research has some heavyweight chops on board, and IMO is an excellent institution.

pepijndevos · on June 24, 2011

So is anything like this available on other platforms? Because it's way faster than http://classifier.rubyforge.org/ , even with rb-gsl installed. I'd love it for generating related posts on my Jekyll blog.

yters · on June 24, 2011

How have you used this? Looks pretty interesting.

samg_ · on June 24, 2011

I've been playing with some clustering stuff in my free time for the past few months.

What I've found is that the problem seems to get a lot more reasonable if you know how many clusters there are.

K-Means requires this information, but afaict agglomerative techniques don't. I wonder why this tool's agglomerative clustering method requires the number of clusters as an argument.

microtherion · on June 24, 2011

You're right that agglomerative clustering (unlike K-means) does not inherently need to know the # of clusters in advance. However, it still needs some sort of termination criterium, and # of clusters is one possible criterium.

Since lsm operates in a transformed space, other commonly used criteria like cluster distance may not be as convenient for the user to express.

codeape · on June 24, 2011

Is it available on Linux?

rozim · on June 24, 2011

I've used this LDA code http://code.google.com/p/plda/ on multiple systems, however it's not really as well packaged as the Apple code seems to be.