Hacker News new | past | comments | ask | show | jobs | submit login
PredictionIO – A machine learning server (github.com/predictionio)
212 points by Anon84 on Oct 18, 2013 | hide | past | favorite | 47 comments



    https://github.com/PredictionIO/PredictionIO/blob/develop/process/engines/itemsim/evaluations/scala/topkitems/src/main/scala/io/prediction/evaluations/itemsim/topkitems/TopKItems.scala
Why is this structured like this? These directories seem ridiculous to me, it would be great if someone could explain.


Is the inconviently deep tree data structure housing an OO framework really so important that it should be discussed front and center on HN? Isn't the function of the framework more interesting? Honest question.


People who might want to understand the code and make changes in order to take advantage of it will care about form. It matters. I often pick tools based not only on function, but on how easy it will be for me to make changes or maintain the code base if necessary, and I doubt I'm the only one.


It matters, but not as much as this: https://news.ycombinator.com/item?id=6574237


I can't explain the first 80% of it but the last bit:

    io/prediction/evaluations/itemsim/topkitems/
Is the folder structure required by the package definition.

EDIT: mostly it seems to be the result of the project being constructed as dozens of separate modules each with it's own build process...yikes...


Well after all, Java and Scala they are like father and son ....

sigh


While I also find the incredibly deep nesting on the excessive side, I can explain it.

* process/engines/itemsim/evaluations/scala/topkitems

They're using sbt's awesome multi-project feature here (http://www.scala-sbt.org/release/docs/Getting-Started/Multi-...) so basically every "sub-project" that makes up the whole can have its own dependencies, options, versions, etc while also maintaining which projects depend on each other. This really helps keep all of the logic separated and sbt deals with all the compilation-order madness that ensues when you have a tangled nest of inter-dependencies.

Note: this isn't necessarily reflected on what is published as most projects that do this will still publish it as one single jar file or project; it just helps with development and really helps with compilation speed (in my experience).

Again not really condoning what they're doing here as they're really taking it to an extreme; for my "big" project I basically just have a top-level "modules" folder and each sub project is one below that. You can see how the hierarchy is defined here: https://github.com/PredictionIO/PredictionIO/blob/develop/bu... which I find to be quite human-readable but I've been using sbt for years so ymmv. The customized settings for that particular project are here: https://github.com/PredictionIO/PredictionIO/blob/develop/pr...

* src/main/scala

This is the basic structure of an sbt project. By default, you put all of your code/resources in the `src` directory. Then you have two directories, main and test(optional) which is how you seperate code/resources that belong in the final project and which is just used for testing. The last level there are three (default) directories that are processed: java, scala, and resources. The first two should be pretty self explanatory and the last is where you put any files that you need to be packaged/available to your project. So if you have main/resources/aDir/logback.xml then you can reference that (via class resources which is a java thing) with "aDir/logback.xml" (I didn't include a leading slash because it's ~complicated).

Example layout:

    src
     - main
       - java
       - resources
       - scala
     - test
       - java
       - resources
       - scala
* io/prediction/evaluations/itemsim/topkitems/TopKItems.scala

In Java it is mandatory that your package name be reflected in your directory structure. So here we can see that the TopKItems class is in the package "io.prediction.evaluations.itemsim.topkitems" if they followed that convention. As hinted at, scala does not mandate this silly requirement but it's considered best practice to follow along as it keeps things separated and easy to follow. Scala projects mostly used short package names so it isn't as nested as this.

This might all make it seem that development would be a nightmare trying to manage everything but all of this integrates beautifully with a good IDE such as IntelliJ (which is the recommended one for scala -- eclipse is just way too slow and freezes constantly, even on beefy machines). You just run a quick gen-idea command and the entire thing is recognized by intellij, sub-projects and all. You never even see the crazy nesting of folders!

P.S. I mostly just lurk here so I'm sure I butchered the markdown. Sorry.


I'm not evaluating or judging, but this is quite a path:

  /develop/process/engines/itemrec/algorithms/hadoop/...

  cascading/popularrank/src/main/java/io/prediction/algorithms/...

  cascading/itemrec/popularrank/PopularRankAlgo.java
Most of these contain a single folder.


This seems like a cool project, but I'm really not appreciating the unsolicited email I got based on the fact that I starred the Play Framework repo on GitHub[1].

[1] https://dl.dropboxusercontent.com/u/2938195/predictionio-ema...


I got a similar message because I had a commit against the scala project :(


I got an unsolicited email to check out a Github project once and I didn't mind it. I ended up looking at the new project and it was neat. What made this one so irritating?


So have I - "I saw your project X, thought you might be interested in this." That's one thing--personalized, actually in the context of what I've done and detailing why I might be interested.

"You starred Play so come look at our thing" is not. It's an email blast. It's spam.


Same here, because they noticed I am "engaged in https://github.com/playframework/playframework". I've not even tried Play, let alone been "engaged" in it. Really unsolicited.


yes they spammed my email inbox today as well


Seriously - I want literally nothing to do with them after that. Incredibly bad behavior.


From their docs:

    Note: Please be patient. It may take a long time to train the data model the first time even for very small dataset. It is normal because PredictionIO implements an distributed algorithm by default, which is not optimized for small dataset. You can change that later.
Sums up my experience with the Mahout/Hadoop world nicely. Not a good fit for small-medium projects -- too complex, too cumbersome, too slow. By the time you really need the scale (=often never, save for using "Big Data" for marketing), you're big enough and know enough about your domain to roll a custom, efficient, domain-optimized solution.

Bringing machine learning to the masses is an honourable goal though, so thumbs up for PredictionIO.


Offtopic: Haha, I guess this is the extreme example of using original titles for submissions (see the discussion at: https://news.ycombinator.com/item?id=6572466)

I wonder if this title was given in jest.


it's great pg gave an explanation. unfortunately, the justification and end result is no less absurd.

i think letting the community upvote and downvote the titles themselves would give clear indication to mods for changing them. instead, they choose to justify doing nothing by saying they dont have the resources to read all articles and evaluate each of them.

just seems like intentional friction and reluctance to fix a recurring and aggravating issue that has so many viable solutions.


This is an example of people searching for things to complain about. If the yc folks spent the energy to address this problem, the same people would find something else to gripe about. The only difference made by not addressing the problem is that they didn't waste their time.


you can apply this logic to every problem and never address anything.


Only if every problem were as inconsequential as this one is.


Could you expand on how prediction.io would handle a real world data set containing a few million items/users? How long would it take to generate a single user<->user recommendation at this scale? Does prediction.io require that I keep the whole dataset in main memory and how much memory would I need?

I'm asking, because for us (dawanda.com, one of the biggest ecommerce platforms in germany) most of the development effort on our soon-to-be-opensourced recommendation engine was spent on scaling the CF up from a few thousand test records to a 150 million record production data set.

In the first iteration we also built it completely in scala, but as we were putting more and more data into it, memory usage was exploding. We realized that boxed types had too much overhead and that we had to implement the whole sparse rating/similarity matrix in C [1]. Also we decided to go for a hybrid memory/disk approach which allowed us to process 80GB datasets on a machine with only 64GB main memory.

How did you manage to solve the memory consumption issue for prediction.io in scala? Did you use java raw memory access or did you also swap out data to disk/ssd?

[1] http://github.com/paulasmuth/libsmatrix


PredictionIO is a serving and evaluation framework on top of a bunch of algorithms. Currently a majority of them come from the Apache Mahout library [1].

Computation time and resource requirement depend on the choice of technology. If a non-distributed implementation is chosen using the framework, the rule of thumb from Apache [2] is a good guideline. For distributed implementations based on Hadoop, the 10M MovieLens data set [3] finish training on a single m1.large AWS instance (7.5GB RAM) within 30 minutes. Although we do not have an accurate account of how much computation time and resource will be required for your production data set's scale, a user has reported using his own production data set of similar size with 2M users, and finished training in about an hour using Amazon EMR.

That said, PredictionIO does not do anything special on memory consumption or has a special memory access model. It really depends on the underlying libraries that do the actual work.

We imagine your project requires a much faster turnaround time according to your spec, which is an interesting application to us as well.

PS. The work you posted is pretty cool. :)

[1] http://mahout.apache.org/ [2] https://cwiki.apache.org/confluence/display/MAHOUT/Recommend... [3] http://grouplens.org/datasets/movielens/


Has anyone integrated this into their project yet? Would be great to see examples of it working well.


We're using this at my work for a product recommendation engine. It's essentially a wrapper around Apache Mahout (which uses Hadoop). It make the whole Hadoop/Mahout setup much more accessible, but it still has the same drawbacks (sucks memory like anything, lots of overhead to ramping up the jobs)


We have it for recommending games on http://clay.io. Opted to just use the easy-deploy version on AWS: https://aws.amazon.com/marketplace/pp/B00ECGJYGE

It was pretty straight-forward to implement and has a nifty backend to it as well for managing the algorithms.


We're working on using it to recommend user-created events. Just getting started though, hopefully we'll have something to show next month. Haven't shown our site love in awhile but some info is at http://wobbleapp.me.


Interesting - The server is licensed under AGPL, the clients are Apache v2.0 and there is a promise that a client is a separate "work"


I understand the reason the AGPL exists, but has it ever been used by a project successfully? They can't (or rather, won't) be used by businesses, because corporate lawyers aren't dumb. So that leaves personal and academic projects. I've never seen a successful open source project that could survive on toy interest like that.


The mongodb server is a clear example of this.


Thanks for pointing this out - I was under the assumption that mongodb is under dual GPL/commercial license


Are there any benchmarks out there for the different prediction engines out there?


can you please name the other engines you know? we're currently seeking for a good system, but only found libs like mahout yet. thanks.


http://graphlab.org/ is also a good choice


We will open source our CF-based recommendation engine in the next days. We use the code in production (on a site with a few hundred million views a month) with up to 80GB datasets for item<->item CF. The engine generates thousands of recommendations per second per core. HTTP round-trip-time to fetch a recommendation from localhost is well below one millisecond.

I don't have the full engine source code or benchmarks to share right now (hopefully in the next two weeks as it is approved by our legal department), but I can tell you all the hard lifting is done by a library called "libsmatrix", which implements a fast, memory-efficient sparse matrix data structure that is used at the core of the CF algo. libsmatrix is 100% threadsafe and persists to disk (this way you can also work with datasets larger than the available main memory). this library is already open sourced:

    https://github.com/paulasmuth/libsmatrix
Using libsmatrix, you can build the rest of the "engine" in a matter of a hundred lines of C or so...

    #include "smatrix.h"
    smatrix_t* my_smatrix;

    // libsmatrix example: simple CF based recommendation engine
    int main(int argc, char **argv) {
      my_smatrix = smatrix_open(NULL);

      // one preference set = list of items in one session
      // e.g. list of viewed items by the same user
      // e.g. list of bought items in the same checkout
      uint32_t input_ids[5] = {12,52,63,76,43};
      import_preference_set(input_ids, 5);

      // generate recommendations (similar items) for item #76
      void neighbors_for_item(76);

      smatrix_close(my_smatrix);
      return 0;
    }

    // train / add a preference set (list of items in one session)
    void import_preference_set(uint32_t* ids, uint32_t num_ids) {
      uint32_t i, n;

      for (n = 0; n < num_ids; n++) {
        smatrix_incr(my_smatrix, ids[n], 0, 1);

        for (i = 0; i < pset->len; i++) {
          if (i != n) {
            smatrix_incr(my_smatrix, ids[n], ids[i], 1);
          }
        }
      }
    }

    // get recommendations for item with id "item_id"
    void neighbors_for_item(uint32_t item_id)
      uint32_t neighbors, *row, total;

      total = smatrix_get(my_smatrix, item_id, 0);
      neighbors = smatrix_getrow(my_smatrix, item_id, row, 8192);

      for (pos = 0; pos < neighbors; pos++) {
        uint32_t cur_id = row[pos * 2];

        printf("found neighbor for item %u: item %u with distance %f\n",
          item_id, cf_cosine(smatrix, cur_id, row[pos * 2 + 1], total));
      }

      free(row);
    }

    // calculates the cosine vector distance between two items
    double cf_cosine(smatrix_t* smatrix, uint32_t b_id, uint32_t cc_count, uint32_t a_total) {
      uint32_t b_total;
      double num, den;

      b_total = smatrix_get(smatrix, b_id, 0);

      if (b_total == 0)
        b_total = 1;

      num = cc_count;
      den = sqrt((double) a_total) * sqrt((double) b_total);

      if (den == 0.0)
        return 0.0;

      if (num > den)
        return 0.0;

      return (num / den);
    }


I'm gonna say the subtitle would be better for this submission :)

> PredictionIO, a machine learning server for data engineers and software developers.


Very cool, but for some reason I really just want to know if you're using text-to-speech for that short demo video. I honestly can't tell with certainty.


It's like they narrated the script with a TTS engine, then got an actual human to mimic it.


a collaborative filtering server.

It strikes me as a lot of overengineering with no real meat at the core.


Why no contact details? Is this a side project? If so, kudos to you on the execution. very slick.


Server. What server? You mean an OS or just Server application.

edit

Why downvote? This is a good question. smh


this seems almost too good to be true. does it solve everything?


Yes.

It. Solves. Everything.


So it just returns 42?


is it functionally different from the Google prediction APIs?


This project sucks, because there is no proper examples that actually tell me how i can make predictions.


Try their main site http://prediction.io/ . It has videos and other write ups. Its still not thorough, guess you might have to install and play around with it to get a better understanding.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: