I didn't implement the machine learning algorithms for myself, because there are some really good packages out there and I know I don't have the smarts to better them.
Keep in mind that I didn't really have any success:
There seem to be two main ML packages, Weka and Orange. I personally preferred Orange, it has a nice graph-based UI for linking various components together; when you've figured that out it can script in Python. Also Orange makes it easy to test your data set against various different learning systems, and compare the performance. Standard testing procedures like n-fold cross-validation are built-in and really simple to use.
Also you need data. I'm pretty sure more is always better. I actually started with greyhounds* and skimmed mine (in Python use BeautifulSoup) from a website. I tried to come up with various statistics about the recent performance of the dogs. Unfortunately nothing I tried made the ML algorithms predict better than a random choice. A friend who's into gambling suggested greyhound racing was quite random by nature, so I've switched to horses recently. I'm still building that dataset, now trying out MongoDB just for fun.
I think the trouble is that you can have as much raw data as you like, but generating the predictive statistics requires a lot of knowledge of the problem domain. I'm not actually into gambling at all so I don't know if the track conditions are important, how much breeding or the age of the animal really matters etc... This made it hard to pick likely stats (and rebuilding datasets and retraining learners can take some time).
For horses there's a lot more information in forums and racing guides etc, so I'd start with horses. Just make sure you've tested your predictions with pretend bets before you commit any real money :)
Good luck!
*I began with greyhounds because of a dissertation posted on reddit where the authors suggested they'd had some success with a neural network and gave quite a lot of detail. That piqued my curiosity, and my initial version just re-implemented their work.
Yeah I hated using weka at uni. I'll look into Orange.
"I don't know if the track conditions are important, how much breeding or the age of the animal really matters etc."
Yeah, feature selection is a tough one. I'd thought that the system would pick up on good indicators by itself, but it might well be that that has to be a manual decision.
"Just make sure you've tested your predictions with pretend bets before you commit any real money"
haha, yeah absolutely. My plan was to train/test until the accuracy seemed good enough (using monte carlo) and then run the system on live data with pretend money for a few months to see what the actual performance is like, before actually investing real cash.
Do you have a link to the greyhound topic? I searched on google but couldn't find it.
I don't have a problem so much with having myriad statistics and picking the right ones, but not knowing which stats to generate in the first place from my database of results.
For example, I assume that a dogs past performance must be some indicator of its chances in the next race, but how do I account for the chances of dogs who didn't complete their last race? What weighting is the last race worth, compared to the ones before (perhaps it had a bad race, but on the whole is running well).
I just don't know how to optimise for those sort of things. I have a rough idea that some combination of genetic programming and GA could help - it would be an interesting challenge to builds software that knew how to apply a selection of mathematical functions to my data, and then breed the results like a GA. But it's tricky, I'd have thought.
I've been treating the ML classifiers and learners as something of a black box, perhaps a more rigorous approach is required.
"software that knew how to apply a selection of mathematical functions to my data, and then breed the results like a GA"
yeah, I'd envisaged using the accuracy of the neural net as the fitness function for a GA that mutates input parameters. It's another layer of complexity, and I've no clue how you'd start, but it seems like it would work.
In other words - use a GA to select features, using how well the NN trained on that set of features performs as the fitness function.
Why did re-implementing the dissertation for greyhounds not work? Was the dissertation flawed?
If you're putting this much effort into it, why not stop by a horse racing track a few times and pick up some domain knowledge? Maybe you could even talk to race horse owners, jockeys, breeders?
I wonder if you could turn this into a product for breeders? Or maybe for people buying/selling race horses? Or people hiring Jockeys, or even marketing an offshoot of this to the gamblers? Just some wild thoughts.
I've heard that the easiest way of predicting greyhound racing is to ignore the form book and monitor the odds changes following bets being placed at the very last minute by those with insider information...
Here's a random greyhound racing tip which a man in a bar told me, so it must be true: before the race, leave it as long as possible to bet and watch the dogs. The one that is quivering and dancing and looks must wound up generally wins. Wait until you get a race where only one dog looks that way.
You're welcome. One of the things I'm looking at now is adapting ranking systems from other sports or competitions. For example I know that the Elo system from chess has been applied to other sports (I don't know the details, though, or what success they had)