Hacker News new | past | comments | ask | show | jobs | submit login
Google Vizier: A Service for Black-Box Optimization (ai.google)
208 points by captn3m0 on Aug 27, 2018 | hide | past | favorite | 80 comments



Okay, this is awesome (and easy to miss with a just a cursory skim):

"5.3 Delicious Chocolate Chip Cookies

Vizier is also used to solve complex black–box optimization problems arising from physical design or logistical problems. Here we present an example that highlights some additional capabilities of the system: finding the most delicious chocolate chip cookie recipe from a parameterized space of recipes... We provided recipes to contractors responsible for providing desserts for Google employees. The head chefs among the contractors were given discretion to alter parameters if (and only if) they strongly believed it to be necessary, but would carefully note what alterations were made. The cookies were baked, and distributed to the cafes for taste–testing. Cafe goers tasted the cookies and provided feedback via a survey. Survey results were aggregated and the results were sent back to Vizier. The “machine learning cookies” were provided about twice a week over several weeks...

The cookies improved significantly over time; later rounds were extremely well-rated and, in the authors’ opinions, delicious."


I worked for Google at the time they went through the cookie iterations in the cafes. Personally I disliked all the cookies that ever came out of this project, mainly because they contained spices, I heard the same thing from many others.

After you got handed a cookie you were asked to fill out a feedback form on a tablet, which almost always is annoying.

But more importantly, the process of optimizing the ingredients left out many parts of what makes food enjoyable, temperature and texture for example.

I know this was meant to be a fun showcase of ML but to me this is still my favourite example of explaining misuse of ML technology where a simpler statistical model supported by expert opinion would have outperformed the taste of the cookies. Eg, any cookie expert might know that the number of people who like spicy cookies is only a small subset of all cookie lovers.


If I recall, the spices were an intentional choice of a kind of pathological input to the optimizer, and iirc it did what you might expect (minimize the amount of cayenne to the smallest in the acceptable range).


I've heard this was a problem since the spice level was optimized in an office where they make the cookie dough the day of service, but deployed in an office where they make the dough the night before and cook on demand. The spices in the dough longer made the flavor stronger than it was optimized for.

A good example of training/serving skew.


I think this is exactly the case how ML could go wrong. I would assume the people who hate the cookies probably will not bother to submit a survey and try again. The sample might be biased and not representative of the general population, thus overfitting, performs badly on a boarder test set.


Our dystopian future where we will be forced fed cookies in order for accurate optimization.


There was an icecream advertisment with that as a theme.

https://www.youtube.com/watch?v=j4IFNKYmLa8


That already happens :-)


Sounds like you got the less tasty side of the explore/exploit tradeoff.


I wonder if there was an ethnic bias at play with the population that participated since spice level tolerance is higher in Asian countries.

https://diversity.google/annual-report/


It's great that they published this - I wasn't sure if I could share this externally. I see they left the ingredient that really improved the scores unexpectedly out, presumbly for proprietary reasons.


Are you talking about orange extract? or cayenne pepper? The post is here: https://ai.google/research/pubs/pub46507 and direct link to paper is here: https://storage.googleapis.com/pub-tools-public-publication-...

And the recipe I found internally is "The Real California Cookie" in that paper:

Bake at 163C (325◦ F) until brown:

• 167 grams of all-purpose flour.

• 245 grams of milk chocolate chips.

• 0.60 tsp. baking soda.

• 0.50 tsp. salt.

• 0.125 tsp. cayenne pepper.

• 127 grams of sugar (31% medium brown, 69% white).

• 25.7 grams of egg.

• 81.3 grams of butter.

• 0.12 tsp. orange extract.

• 0.75 tsp. vanilla extract.


What an odd combination of units. Here's a translation:

• 5¼ oz. of all-purpose flour.

• 7 oz. of milk chocolate chips.

• scant ½ + ⅛ tsp. baking soda.

• ½ tsp. salt.

• ⅛ tsp. cayenne pepper.

• 4½ oz. of sugar (31% medium brown, 69% white).

• 1 oz. of egg.

• 2¾ oz. of butter.

• ⅛ tsp. orange extract.

• ¾ tsp. vanilla extract.

Or, equivalently:

• 167 grams of all-purpose flour.

• 245 grams of milk chocolate chips.

• 2.96 mL baking soda.

• 2.46 mL salt.

• 0.61 mL cayenne pepper.

• 127 grams of sugar (31% medium brown, 69% white).

• 25.7 grams of egg.

• 81.3 grams of butter.

• 0.59 mL orange extract.

• 3.7 mL vanilla extract.


Still no good because what the hell is 1oz of egg?


Meanwhile I'm feeling pretty good for buying both the 1/10th tsp and the 1/8th tsp scoops right now.

Unfortunately my microliter precision pipette is in the shop otherwise I would be enjoying some of these right now!


iirc, if avg = 50g, yolk is 20g, white is 30g... still up to the foodmage to interpret which part for 1oz (yolk adds taste/color, white has binding properties).


Can we get it to push out more useful output on how precise these measurements have to be to affect taste?

(say, as spectra)


An article linked in a sibling comment to yours mentioned szechuan pepper but it didn't show up in any of the recipes I've seen so far.


They most likely are holding it back for a bigger media event where they will show them off and every tech blog on the internet will run a story about them.


You're not thinking of these gluten-free cardamom cookies? https://www.blog.google/technology/research/makings-smart-co...


That ingredient list is missing the "secret ingredient" that your parent post is talking about, so nah.


Weed?


Good application for Ben Krasnow's Cookie Perfection machine - http://benkrasnow.blogspot.com/2014/01/cookie-perfection-mac...


Seems to be missing the key ingredient of the best tasting chocolate chip cookies: espresso powder.


Is this possibly related to the cookie machine by Ben Krasnow from several years ago? [1]

[1] https://www.youtube.com/watch?v=8YEdHjGMeho


Did they post the recipe?



I heard they also provided samples of the cookies along with the recipes/writeup at NIPS.


Yup, they did. They handed them out at their booth.


They didn't.

I am very disappointed. I, for one, welcome our new robot overlords and also want to try their chocolate chip cookies.



These are not the vizier cookies.


Oh darn you're right (to clarify, I'm pretty sure those are vizier cookies, just not the vizier cookies). I have the recipe sitting on my counter at home. I thought it was public, but it doesn't' seem to exist online anywhere.


The secret formula is love!


Why I have an idea about receipe optimisation as a service ?


I noted this section when I read the paper as well...

This is cute, but "in the authors' opinions, delicious" does not contribute anything to scientific research. Yes, you can use black-box optimization for cookie recipes. No, you should not make any sort of performance claims nor talk about "significant" improvement unless you back it up.


That's the authors' sense of humour. The more relevant part for science is "extremely well-rated".


Their back up: later rounds were extremely well-rated


How about "My brand new image recognition architecture performs extremely well on benchmarks"

But it's worse than the last 100 SOTA models...


> in the authors' opinions, delicious

Can there be/Is there an objective way to measure deliciousness? If not what can they really say?


> Cafe goers tasted the cookies and provided feedback via a survey. Survey results were aggregated and the results were sent back to Vizier.

This, from the paper, is a great start on measuring deliciousness. But if you go to that effort, why not provide a benchmark? Such as the existing chocolate chip cookie recipe or the average scores from 10 random chocolate chip cookie recipes?

As a reader, I might suspect that performance was not that great compared to baselines but that the authors were able to mask it by making this qualitative claim ("the cookies were delicious") -- which I'm sure was still true!


But it really is not a very great start on measuring deliciousness.

Like said elsewhere in this thread, if after a few batches the algorithm optimizes into a local maximum gully that some people don't like, but others really love, then the ones that dislike it will stop taking the cookies and thus stop rating them.

But does that mean they're less delicious? Nobody knows! Because certainly the first and foremost great start on measuring deliciousness, is doing research on what is deliciousness, and are there any good methods to measure deliciousness that can be compared with other literature and research, and if that fails, either measure something else or defend why you chose a target metric that can't be measured properly. The latter being valuable if your way of measuring is novel and you think it's better than what has been used so far in the field. But you gotta provide a good argument for that, and the method used in this experiment wasn't particularly novel, simply flawed.

I mean you can be all giggly-googly about it, but food science is a thing.


This technique is known as anchoring and should be used alongside a descriptive scale when making subjective comparisons.


Yep. There's a lot of pre-existing science that this research ignored ...


This really assumes that there is one-true-deliciousness factor, which is patently false if the spaghetti sauce optimization exists. (in which the previously unknown chunky sauce actually was discovered to be a non-trivial market segment)

This seems more like a spanning/optimization problem (how to cover the majority of preferences while not needing too many varieties) than an actual optimization problem, but then again it's just a sample use case to use to sell their API, so they probably didn't think too hard about it.


To the extent that deliciousness is a physiologically measurable phenomenon, or there are proxy measures that correlated with it, yes, there is an objective way. You give large numbers of people cookies and have them rate it. Tastes vary but generally if you make something with umami, crunch, fat, and sugar, people will say it is delicious.


Whetlab, a startup founded by some former colleagues of mine, provided a service just like this 4 years ago (in fact, it's referenced in the Vizier paper as its open-source variant Spearmint), but unfortunately, it was acquired and shut down by Twitter: https://venturebeat.com/2015/06/17/twitter-acquires-machine-...

Black-box optimization is a hugely important problem to solve, especially when experiments require real wet-work (i.e. medicine, chemistry, etc.). Kudos to Google for commercializing this - I expect it will see a lot of use in those fields. But it's bittersweet to know that it's taken this long for this type of application to be promoted like this.


Whetlab was the first startup in this domain. Then, followed by SigOpt (https://sigopt.com/) which is doing a great job at popularizing the concept.


Yeah SigOpt clearly has taken the space left by Whetlab. They must not be too pleased about this announcement, I mean generally the worst thing that can happen to your startup is a tech giant launching in your space. Then again the problem of hyperparameter/black-box optimization is so ubiquitous that there should be enough space for them both.


I wasn't familiar with SigOpt - this is really cool. I actually think SigOpt will see this as great publicity telling people about the usefulness of the concept - and whereas Google won't lift a finger for (insert hedge fund or pharmaceutical giant here) beyond maintaining uptime, SigOpt can provide those customers with customized advice about how to integrate, what pitfalls to watch for, how to design experiments to maximally take advantage of their technology.

And they're competitive with Spearmint (though not necessarily the closed-source versions of it used at Whetlab), though Vizier remains to be seen: https://arxiv.org/pdf/1603.09441.pdf


Thanks! I'm one of the co-founders of SigOpt (YC W15).

You hit the nail on the head. We've been trying to promote more sophisticated optimization approaches since the company formed 4 years ago and are happy to see firms like Google, Amazon, IBM, and SAS enter the space. We definitely feel like the tide of education lifts all boats. Literally everyone doing advanced modeling (ML, AI, simulation, etc) has this problem and we're happy to be the enterprise solution to firms around the world like you mentioned. We provide differentiated support and products from some of these methods via our hosted ensemble of techniques behind a standardized, robust API.

We're active contributors to the field as well via our peer reviewed research [1], sponsorships of academic conferences like NIPS, ICML, AISTATS, and free academic programs [2]. We're super happy to see more people interested in the field and are excited to see where it goes!

[1]: https://sigopt.com/research

[2]: https://sigopt.com/edu


Spearmint has quite a few practical drawbacks so I guess that both Vizier and SigOpt are much better by now.


This product is publicly available as a part of the hyperparameter turning in Cloud ML Engine (although I haven't played with it): https://cloud.google.com/ml-engine/docs/tensorflow/hyperpara...


Why do you think it is vizier, and not something developed specifically for clous?..


If you click through the explanation in the documentation:

> Cloud Machine Learning Engine is a managed service that enables you to easily build machine learning models that work on any type of data, of any size. And one of its most powerful capabilities is HyperTune, which is hyperparameter tuning as a service using Google Vizier.

https://cloud.google.com/blog/products/gcp/hyperparameter-tu...


SigOpt (YC W15) is a related service that performs a superset of these features as a SaaS API (I am one of the co-founders).

We've been solving this problem for customers around the world for the last 4 years and have extended a lot of the original research I started at Yelp with MOE [1]. We employ an ensemble of optimization techniques and provide seamless integration with any pipeline. I'm happy to answer any questions about the product or technology.

We're completely free for academics [2] and publish research like the above at ICML, NIPS, AISTATS regularly [3].

[1]: https://github.com/yelp/MOE

[2]: https://sigopt.com/edu

[3]: https://sigopt.com/research


It is referenced in the paper, but it's worth pointing out Yelp's MOE. This is an open source (Apache 2.0) implementation for black-box optimization. It doesn't look like it is well maintained, but it is interesting nonetheless. https://github.com/Yelp/MOE


Hi, I'm one of the co-authors of MOE (it was a fork of my PhD thesis).

We haven't been actively maintaining the open source version of MOE because we built it into a SaaS service 4 years ago as SigOpt (YC W15). Since then we've also had some of the authors of BayesOpt and HyperOpt work with us.

Let me know if you'd like to give it a shot. It is also completely free for academics [1]. If you'd like to see some of the extensions we've made to our original approaches as we've built out our ensemble of optimization algorithms check out our research page [2].

[1]: https://sigopt.com/edu

[2]: https://sigopt.com/research


Only up to 64 variables? Why such small problems? And why are their results are averaged across all of the different test functions? I'd like to see the performance difference between Rosenbrock and Rastrigin, thank you very much. And they have a weird fixation on stopping rules, when in general your stopping rule is how many evaluations you can afford. Was this written by interns? There's no discussion of the retrogression by CMAES on small problems? What parameters does their algorithm have and how sensitive is it to its parameterization? What parameters did they use for the competing algorithms? If I were reviewing this paper I'd have these and other pointed questions for them. It's as if they did a cursory lit review and then decided to reinvent the entire field. Their one contribution seems to be not allowing you to run it on your own hardware.


To expand on ScoutOrgo,

>Only up to 64 variables? Why such small problems?

This is a tool tuned for optimizing the hyperparameters of machine learning models. In practice, most models have, maybe, 10 hyperparameters in the normal sense (learning rate, nonlinearity, etc.). If you start talking about model structure as a hyperparameter, you can get vastly more than 64, but then these techniques aren't great anyway.

So why such small problems? Because the blackbox function you're optimizing takes hours to run. So as another user mentions, if it takes you 700 function evaluations, you're gonna be running for the better part of a year.

So the domain over which vizier works is one where you probably are never really using more than ~100 evaluations and often even then you would prefer to stop early if you meet certain conditions (because wasting significant compute doing unnecessary optimization is costly).

As a concrete example, a single iteration of a relatively small and non-SOTA algorithm takes 20 minutes and $40 to run. [0]

So stopping a day early saves you a day and $3000. Now scale that up 5 or 10x for larger models. (another way of putting this is that a day and $10K might be worth a .5% increase in accuracy, but isn't worth a .05% increase in accuracy. Stopping rules can encode that).

[0]: http://www.fast.ai/2018/08/10/fastai-diu-imagenet/


It is 64 hyper-parameters, not variables, which is a huge amount.

Stopping rules are also a requirement of tuning. Can't just have a while True:. If an objective function doesn't improve after 100 (or any #), there needs to be logic to stop the trials since this system is apparently serving all of Alphabet.


Surely that reinforces my point? That what matters isn't the stopping rule but your evaluation budget? If you can afford 100 evaluations, do 100 evaluations. Pick any stopping rule based on the objective function and I can construct you a scenario that either has you spending a long time on the wrong side of it, or stopping prematurely. Might as well have predictable behavior.


I think that in the context that he meant, 64 variables = 64 hyperparameters because it was about benchmark problems solved in black-box settings. In other words, it is not about the common misconception of comparing #hyperparameters and #variables on some ML model, where the latter can be in 10^7.


You're correct -- I'm talking about how many variables control the behavior of the black box.


How long it will take CMAES to solve 10-dimensional Rosenbrock? Here, it takes about 700 function evaluations: https://indiesolver.com/tutorials/rosenbrock_python

Please note that their work is primarily for expensive optimization where you cannot afford millions of function evaluations like you would need on rotated Rastrigin to solve it exactly for n>64.


> How long it will take CMAES to solve 10-dimensional Rosenbrock? Here, it takes about 700 function evaluations

It really depends on parameterization though. Did they use off-the-shelf parameters? Did they make any attempt to tune CMAES for their test functions? Did they make any attempt to tune their own optimizer for the test functions? It's really easy to put your thumb on the scale when you're writing a paper about a black-box optimizer, and my default reaction is to assume that any paper claiming to equal or outdo CMAES is pure B.S. This paper did nothing to convince me otherwise.


I had come across this library - [1] - a while ago that claims to be the open source equivalent of Vizier. Does anyone know how well justified the claim is?

[1] https://github.com/tobegit3hub/advisor


Very related open source project from the RISE Lab at UC Berkeley: http://ray.readthedocs.io/en/latest/tune.html


Very cool project! Glad to see support for PBT and HyperBand. Also related to https://www.microsoft.com/en-us/research/publication/hyperdr...


A related service currently in beta: https://indiesolver.com/access


Chrome on Windows 10 gave me this error (and no, Google is not blocked on my ISP):

"The page you are trying to view cannot be shown because the authenticity of the received data could not be verified."


This site was forbidden to look at from the company I'm working in?


Thanks for reporting! The site is hosted on Google Cloud Platform. If some of GCP's IPs are banned in your company, then it might be the reason.


Another open source tool for hyperparameter optimization inspired by Vizier https://github.com/kubeflow/katib


Totally calling the next GCP AI Service: VaaS (Vizier as a Service)


I know a guy who told me 8 years ago about an idea he had for a "black box" optimization consulting service. You give him a pile of data, he uses ML to optimize for whatever.


That's known nowadays as AutoML. Google recently released AutoML products for image and text classification. https://cloud.google.com/automl/


FYI there is also a different "automl" suite of libraries maintained by Frank Hutters lab at the University of Freiburg - [1]. His group has been leading a lot of research in the area of hyperparameter tuning.

[1] https://www.automl.org/


This is already part of Google's Cloud ML Engine product.


Will the OSS version be called VaaSecotomy?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: