Okay, this is awesome (and easy to miss with a just a cursory skim):
"5.3 Delicious Chocolate Chip Cookies
Vizier is also used to solve complex black–box optimization
problems arising from physical design or logistical problems.
Here we present an example that highlights some additional
capabilities of the system: finding the most delicious chocolate chip cookie recipe from a parameterized space of recipes...
We provided recipes to contractors responsible for providing desserts for Google employees. The head chefs among
the contractors were given discretion to alter parameters if
(and only if) they strongly believed it to be necessary, but
would carefully note what alterations were made. The cookies
were baked, and distributed to the cafes for taste–testing.
Cafe goers tasted the cookies and provided feedback via a
survey. Survey results were aggregated and the results were
sent back to Vizier. The “machine learning cookies” were
provided about twice a week over several weeks...
The cookies improved significantly over time; later rounds
were extremely well-rated and, in the authors’ opinions, delicious."
I worked for Google at the time they went through the cookie iterations in the cafes. Personally I disliked all the cookies that ever came out of this project, mainly because they contained spices, I heard the same thing from many others.
After you got handed a cookie you were asked to fill out a feedback form on a tablet, which almost always is annoying.
But more importantly, the process of optimizing the ingredients left out many parts of what makes food enjoyable, temperature and texture for example.
I know this was meant to be a fun showcase of ML but to me this is still my favourite example of explaining misuse of ML technology where a simpler statistical model supported by expert opinion would have outperformed the taste of the cookies. Eg, any cookie expert might know that the number of people who like spicy cookies is only a small subset of all cookie lovers.
If I recall, the spices were an intentional choice of a kind of pathological input to the optimizer, and iirc it did what you might expect (minimize the amount of cayenne to the smallest in the acceptable range).
I've heard this was a problem since the spice level was optimized in an office where they make the cookie dough the day of service, but deployed in an office where they make the dough the night before and cook on demand. The spices in the dough longer made the flavor stronger than it was optimized for.
I think this is exactly the case how ML could go wrong. I would assume the people who hate the cookies probably will not bother to submit a survey and try again. The sample might be biased and not representative of the general population, thus overfitting, performs badly on a boarder test set.
It's great that they published this - I wasn't sure if I could share this externally. I see they left the ingredient that really improved the scores unexpectedly out, presumbly for proprietary reasons.
iirc, if avg = 50g, yolk is 20g, white is 30g... still up to the foodmage to interpret which part for 1oz (yolk adds taste/color, white has binding properties).
They most likely are holding it back for a bigger media event where they will show them off and every tech blog on the internet will run a story about them.
Oh darn you're right (to clarify, I'm pretty sure those are vizier cookies, just not the vizier cookies). I have the recipe sitting on my counter at home. I thought it was public, but it doesn't' seem to exist online anywhere.
I noted this section when I read the paper as well...
This is cute, but "in the authors' opinions, delicious" does not contribute anything to scientific research. Yes, you can use black-box optimization for cookie recipes. No, you should not make any sort of performance claims nor talk about "significant" improvement unless you back it up.
> Cafe goers tasted the cookies and provided feedback via a survey. Survey results were aggregated and the results were sent back to Vizier.
This, from the paper, is a great start on measuring deliciousness. But if you go to that effort, why not provide a benchmark? Such as the existing chocolate chip cookie recipe or the average scores from 10 random chocolate chip cookie recipes?
As a reader, I might suspect that performance was not that great compared to baselines but that the authors were able to mask it by making this qualitative claim ("the cookies were delicious") -- which I'm sure was still true!
But it really is not a very great start on measuring deliciousness.
Like said elsewhere in this thread, if after a few batches the algorithm optimizes into a local maximum gully that some people don't like, but others really love, then the ones that dislike it will stop taking the cookies and thus stop rating them.
But does that mean they're less delicious? Nobody knows! Because certainly the first and foremost great start on measuring deliciousness, is doing research on what is deliciousness, and are there any good methods to measure deliciousness that can be compared with other literature and research, and if that fails, either measure something else or defend why you chose a target metric that can't be measured properly. The latter being valuable if your way of measuring is novel and you think it's better than what has been used so far in the field. But you gotta provide a good argument for that, and the method used in this experiment wasn't particularly novel, simply flawed.
I mean you can be all giggly-googly about it, but food science is a thing.
This really assumes that there is one-true-deliciousness factor, which is patently false if the spaghetti sauce optimization exists. (in which the previously unknown chunky sauce actually was discovered to be a non-trivial market segment)
This seems more like a spanning/optimization problem (how to cover the majority of preferences while not needing too many varieties) than an actual optimization problem, but then again it's just a sample use case to use to sell their API, so they probably didn't think too hard about it.
To the extent that deliciousness is a physiologically measurable phenomenon, or there are proxy measures that correlated with it, yes, there is an objective way. You give large numbers of people cookies and have them rate it. Tastes vary but generally if you make something with umami, crunch, fat, and sugar, people will say it is delicious.
Whetlab, a startup founded by some former colleagues of mine, provided a service just like this 4 years ago (in fact, it's referenced in the Vizier paper as its open-source variant Spearmint), but unfortunately, it was acquired and shut down by Twitter: https://venturebeat.com/2015/06/17/twitter-acquires-machine-...
Black-box optimization is a hugely important problem to solve, especially when experiments require real wet-work (i.e. medicine, chemistry, etc.). Kudos to Google for commercializing this - I expect it will see a lot of use in those fields. But it's bittersweet to know that it's taken this long for this type of application to be promoted like this.
Yeah SigOpt clearly has taken the space left by Whetlab. They must not be too pleased about this announcement, I mean generally the worst thing that can happen to your startup is a tech giant launching in your space. Then again the problem of hyperparameter/black-box optimization is so ubiquitous that there should be enough space for them both.
I wasn't familiar with SigOpt - this is really cool. I actually think SigOpt will see this as great publicity telling people about the usefulness of the concept - and whereas Google won't lift a finger for (insert hedge fund or pharmaceutical giant here) beyond maintaining uptime, SigOpt can provide those customers with customized advice about how to integrate, what pitfalls to watch for, how to design experiments to maximally take advantage of their technology.
And they're competitive with Spearmint (though not necessarily the closed-source versions of it used at Whetlab), though Vizier remains to be seen: https://arxiv.org/pdf/1603.09441.pdf
Thanks! I'm one of the co-founders of SigOpt (YC W15).
You hit the nail on the head. We've been trying to promote more sophisticated optimization approaches since the company formed 4 years ago and are happy to see firms like Google, Amazon, IBM, and SAS enter the space. We definitely feel like the tide of education lifts all boats. Literally everyone doing advanced modeling (ML, AI, simulation, etc) has this problem and we're happy to be the enterprise solution to firms around the world like you mentioned. We provide differentiated support and products from some of these methods via our hosted ensemble of techniques behind a standardized, robust API.
We're active contributors to the field as well via our peer reviewed research [1], sponsorships of academic conferences like NIPS, ICML, AISTATS, and free academic programs [2]. We're super happy to see more people interested in the field and are excited to see where it goes!
If you click through the explanation in the documentation:
> Cloud Machine Learning Engine is a managed service that enables you to easily build machine learning models that work on any type of data, of any size. And one of its most powerful capabilities is HyperTune, which is hyperparameter tuning as a service using Google Vizier.
SigOpt (YC W15) is a related service that performs a superset of these features as a SaaS API (I am one of the co-founders).
We've been solving this problem for customers around the world for the last 4 years and have extended a lot of the original research I started at Yelp with MOE [1]. We employ an ensemble of optimization techniques and provide seamless integration with any pipeline. I'm happy to answer any questions about the product or technology.
We're completely free for academics [2] and publish research like the above at ICML, NIPS, AISTATS regularly [3].
It is referenced in the paper, but it's worth pointing out Yelp's MOE. This is an open source (Apache 2.0) implementation for black-box optimization. It doesn't look like it is well maintained, but it is interesting nonetheless. https://github.com/Yelp/MOE
Hi, I'm one of the co-authors of MOE (it was a fork of my PhD thesis).
We haven't been actively maintaining the open source version of MOE because we built it into a SaaS service 4 years ago as SigOpt (YC W15). Since then we've also had some of the authors of BayesOpt and HyperOpt work with us.
Let me know if you'd like to give it a shot. It is also completely free for academics [1]. If you'd like to see some of the extensions we've made to our original approaches as we've built out our ensemble of optimization algorithms check out our research page [2].
Only up to 64 variables? Why such small problems? And why are their results are averaged across all of the different test functions? I'd like to see the performance difference between Rosenbrock and Rastrigin, thank you very much. And they have a weird fixation on stopping rules, when in general your stopping rule is how many evaluations you can afford. Was this written by interns? There's no discussion of the retrogression by CMAES on small problems? What parameters does their algorithm have and how sensitive is it to its parameterization? What parameters did they use for the competing algorithms? If I were reviewing this paper I'd have these and other pointed questions for them. It's as if they did a cursory lit review and then decided to reinvent the entire field. Their one contribution seems to be not allowing you to run it on your own hardware.
>Only up to 64 variables? Why such small problems?
This is a tool tuned for optimizing the hyperparameters of machine learning models. In practice, most models have, maybe, 10 hyperparameters in the normal sense (learning rate, nonlinearity, etc.). If you start talking about model structure as a hyperparameter, you can get vastly more than 64, but then these techniques aren't great anyway.
So why such small problems? Because the blackbox function you're optimizing takes hours to run. So as another user mentions, if it takes you 700 function evaluations, you're gonna be running for the better part of a year.
So the domain over which vizier works is one where you probably are never really using more than ~100 evaluations and often even then you would prefer to stop early if you meet certain conditions (because wasting significant compute doing unnecessary optimization is costly).
As a concrete example, a single iteration of a relatively small and non-SOTA algorithm takes 20 minutes and $40 to run. [0]
So stopping a day early saves you a day and $3000. Now scale that up 5 or 10x for larger models. (another way of putting this is that a day and $10K might be worth a .5% increase in accuracy, but isn't worth a .05% increase in accuracy. Stopping rules can encode that).
It is 64 hyper-parameters, not variables, which is a huge amount.
Stopping rules are also a requirement of tuning. Can't just have a while True:. If an objective function doesn't improve after 100 (or any #), there needs to be logic to stop the trials since this system is apparently serving all of Alphabet.
Surely that reinforces my point? That what matters isn't the stopping rule but your evaluation budget? If you can afford 100 evaluations, do 100 evaluations. Pick any stopping rule based on the objective function and I can construct you a scenario that either has you spending a long time on the wrong side of it, or stopping prematurely. Might as well have predictable behavior.
I think that in the context that he meant, 64 variables = 64 hyperparameters because it was about benchmark problems solved in black-box settings. In other words, it is not about the common misconception of comparing #hyperparameters and #variables on some ML model, where the latter can be in 10^7.
Please note that their work is primarily for expensive optimization where you cannot afford millions of function evaluations like you would need on rotated Rastrigin to solve it exactly for n>64.
> How long it will take CMAES to solve 10-dimensional Rosenbrock? Here, it takes about 700 function evaluations
It really depends on parameterization though. Did they use off-the-shelf parameters? Did they make any attempt to tune CMAES for their test functions? Did they make any attempt to tune their own optimizer for the test functions? It's really easy to put your thumb on the scale when you're writing a paper about a black-box optimizer, and my default reaction is to assume that any paper claiming to equal or outdo CMAES is pure B.S. This paper did nothing to convince me otherwise.
I had come across this library - [1] - a while ago that claims to be the open source equivalent of Vizier. Does anyone know how well justified the claim is?
I know a guy who told me 8 years ago about an idea he had for a "black box" optimization consulting service. You give him a pile of data, he uses ML to optimize for whatever.
FYI there is also a different "automl" suite of libraries maintained by Frank Hutters lab at the University of Freiburg - [1]. His group has been leading a lot of research in the area of hyperparameter tuning.
"5.3 Delicious Chocolate Chip Cookies
Vizier is also used to solve complex black–box optimization problems arising from physical design or logistical problems. Here we present an example that highlights some additional capabilities of the system: finding the most delicious chocolate chip cookie recipe from a parameterized space of recipes... We provided recipes to contractors responsible for providing desserts for Google employees. The head chefs among the contractors were given discretion to alter parameters if (and only if) they strongly believed it to be necessary, but would carefully note what alterations were made. The cookies were baked, and distributed to the cafes for taste–testing. Cafe goers tasted the cookies and provided feedback via a survey. Survey results were aggregated and the results were sent back to Vizier. The “machine learning cookies” were provided about twice a week over several weeks...
The cookies improved significantly over time; later rounds were extremely well-rated and, in the authors’ opinions, delicious."