Some quick context: I was inspired to build this by this HN post earlier today [1]. So thank you glorf for making the recipe dataset available.
Thought this would take me 1-2 hours to build, ended up taking about 6 hours - engineering estimates and all!
> Details about the Tech Stack:
The dataset has 2,231,142 recipes and is indexed on Typesense [2], an open source alternative to Algolia/ElasticSearch that a friend and I are working on.
The UI was built using the Typesense adapter for InstantSearch.js [3] and is a static site bundled using ParcelJS.
The app is hosted on S3, with CloudFront for a CDN.
The search backend is powered by a geo-distributed 3-node Typesense cluster running on Typesense Cloud [4], with nodes in Oregon, Frankfurt and Mumbai.
Jason, sorry to ask here rather than read your GitHub docs, but how does Typesense fare against non-romance languages that can't be segmented by whitespace?
I'm guessing you meant to say logographic languages. We don't yet support tokenization for logographic languages (like Chinese, Japanese, etc) but it's on our medium-term radar: https://github.com/typesense/typesense/issues/86
Basically...but I didn't know to use that term! So thanks for teaching me. Then there's also languages like Thai that are not whitespace separated on the word level, but that use an alphabet. So...I meant more 'non-Latin' but I think that's not actually a tight category. It's actually quite difficult to come up with the right term. I guess I was trying to be too clever, the best term is probably "non-whitespace delimited languages". Thanks for your response, and awesome speed to index the dataset and have it up and running in the same day.
Could I ask you a few more questions? What was the dataset size? What was the size of your index? How long (and how much RAM) did it take to index the dataset and what machine (and how many cores) did you do it on?
> Then there's also languages like Thai that are not whitespace separated on the word level, but that use an alphabet.
I did not know that! Good to know.
> What was the dataset size?
2.2GB in size, with ~2.2M records
> What was the size of your index?
2.7GB
> How long (and how much RAM) did it take to index the dataset
It took about 8 minutes to index that data. Typesense stores the entire index in memory, so the index took 2.7GB in RAM
> What machine (and how many cores) did you do it on?
It's running on a 3-node cluster, with each node having 4vCPUs and 8GB of RAM. The nodes are distributed across data centers, so search requests are served by the closest node (like a CDN).
Yeah I meant some concept like "non-Latin derived" or "non-Roman alphabet" languages but then there's Cyrillic, etc. I was pretty sure "non-Romance" sounded like that right term, but not totally sure. I looked it up after and yeah, it wasn't. Actually I have no deep idea of the terms in this and just grabbed the first term that came to me. I thought I did pretty well and I appreciate the learning experience!
You have a UI bug: dismissing the modal recipe popup isn’t entirely reliable, and the site can get stuck in a state that doesn’t allow user interaction. This even survives the back button.
Thanks I was just looking for a cheap alternative for a search engine just today (Algolia is cool but very expensive if you need to index millions of records) - I will check out Typesense.
Type sense is looking great! I was going to use it on a side project. Not sure why I didn’t but it must have been missing something I needed. Been using meilisearch but I’ll definitely be checking out typesense again.
Typesense would indeed be good for a general web page search system, just like Algolia. In fact, even Algolia stores web page data as structured JSON entities.
Did you use CloudFormation to crate the infra? If not, I'd love to hear some details on how you did this. Any API Gateway being used? Seems to be offline at the moment.
We run similarly on CloudFormation. Interesting with the ‘aws s3 cp’ command. I started using ‘aws s3 sync —delete’ nowadays after having issues with pre cleanup required for ‘cp’.
In my case since the assets have fingerprinted filenames and index.html references them, if I delete old files with each deployment, then a user who has a cached version of index.html will see a broken page. So I just leave old asset files as is.
This is awesome. If nothing else, the ability to view a recipe without having to scroll through a 10 page story about the first time the author saw an avocado makes it worth it.
Yeah true. I hate how simple recipe site have been bloated with a) trackers b) ads and c) long text to improve their SEO. Finding "just the recipe" has a great value for me nowadays.
Have you considered the ability to filter out recipes that don't contain a particular ingredient?
Obvious use cases are food allergies and dislikes in general.
My specific use case is that I searched `guacamole` and got 2k that contained `salt` but only 1.8k that contained `avocados`. I want to see the recipes that don't use avocados.
EDIT: It appears that `avocados` and `avocado` are separate ingredients, as are `tomatoes` and `tomato`. I know pluralization rules are hard, particularly in English, but any chance of a cleanup pass for the... low hanging fruit?
As an avocado fan I have trouble grappling with the concept of guac made without them. Isn't that sorta like orange juice made from things that are not oranges? Feels almost oxymoronic. Good suggestion though.
When I say "I want to see the recipes without avocados" it's kind of like when you witness a car crash. You're not looking because you want to see someone crash, you're looking because it's horrifying. Same with avocado-less guacamole. The concept is horrifying so I need to know.
And yes, I would absolutely love to see the recipes for orange juice that don't include any oranges for the same reason.
Either way though, I think it's just because `avocado` and `avocados` are counted as different ingredients.
Not just allergies. I was just looking for biscotti recipes and there's basically two styles of biscotti: made with butter (the soft chewy kind) and made without (the crispy you'd better have a drink to dip it in kind). It's impossible to search for the latter without excluding butter.
Back when I worked on eHow used to find the most ridiculous pages to test with. There was one that was something like “how to make ice water”. Apparently enough people searched for it for it to be worth writing up.
Another good recipe I see butchered is quesadillas. Something with Mexican food and people messing up.
@Jabo: Great site. Some errors though, for example the "4 Ingredient Sauce for Roasted Lamb" says to use 12 cups of brown sugar. The source site has 1/2 cup (0.5), so I'm guessing it's a scraping issue. Wouldn't want to give someone diabetes with their lamb!
Ingredients
2 large ripe avocados
1 lime juice
2 garlic cloves, Minced
13 cup choppled scallion
13 cup choped red bell pepper
14 12 ounces diced tomatoes with jalapenos
14 cup snipped fresh cilantro
2 tablespoons Braggs liquid aminos
Also I wanted to mention that the most important piece of metadata we’ve learnt about that users care about after ingredients is the website it’s sourced from. It’d be great to have that
I do include the source website in the search results. It's the little icon on the bottom right of the search result card and it's also linked from the modal that opens up if you click on "Read cooking directions".
I see, that’s cool but I was referencing something different — That it was important for users to quickly recognize the publisher on the search result page itself. They develop more trust with some sources than others.
I wasn’t able to discern that. But if your goal is to showcase search capabilities to developers instead of a daily driver search engine it wouldn’t matter much.
As an SI unit user who's been looking up more recipes during lockdown, the whole cups, spoons, etc, units of measurements are really annoying. Some recipe sites have on the fly conversion, it would be nice to have here as well, but my first look didn't inspire much confidence that this site will go anywhere (Example search: pizza, top result: fruit pizza. Ohkayyy...)
I unfortunately did not find a good field to sort by in the original dataset from earlier today [1]. So it's just sorted by text match relevance scores, and then the order that they appear in the dataset.
I'm hoping they publish a popularity metric, which will fix the issues like the one you pointed out. Or, once I have sufficient data on popular searches from this site, I can append that metric to the dataset. Early stages, so please pardon the dust in the meantime!
re: SI units, I hear you. There's definitely scope for improvement! :)
Suggestion: Add open search support, so browsers prompt me to add it as a search backend. I added it to Firefox with the "eat" keyword so typing "eat butter chicken" gets me straight to the results page.
How do I tell it I really want to search for just "harissa" and not "Harrison", "Harriet", "Harris", "harrissee", etc.? I tried putting it in quotes, but no such luck.
I've been looking for an Algolia-style alternative (in terms of result quality/fuzziness) and this is great. Definitely going to use this on a project soon.
Wait a minute... What about copyright? Like I would love to have a blog where I can just copy and paste my favorite recipes, and add a few notes myself. But I don't do that because it seems like plagiarism.
Or another option is to use this site, and then use some kind of 1-5 star rating. And then just see my favorites without all the other bs that food sites show you.
IANAL, but as I understand it the ingredient list is not copyright able, though the description may contain sufficient creative content to qualify. There’s even an FAQ covering it. https://www.copyright.gov/help/faq/faq-protect.html
> However, where a recipe or formula is accompanied by substantial literary expression in the form of an explanation or directions, or when there is a collection of recipes as in a cookbook, there may be a basis for copyright protection.
Emphasis mine. A website full of recipes certainly seems to be a collection in the sense protecting a cookbook.
Seriously. I did a similar massive recipe scraping project a few years ago but never would have redistributed it because it’s available for me to use, but not available for me to redistribute.
For a site that gets so opinionated about GPLv3 vs LGPL vs the rest, we really seem to have no qualms about licenses when it comes to actually using other people’s things.
The authors can file a DMCA. If they don't, there's no issue. If they do, the law will sort it out.
Though it sounds like a flippant response, I've spent a lot of time trying to decide how to feel about this. (I released 194k plaintext books as the books3 dataset.)
Copyrights apply to the literal text of a recipe, but not to a recipe itself. Recipes are not copyrightable last time I checked, just the literal text.
Yeah the search could use some improvements. The firs result for "pizza" is "Fruit Cookie Pizza" which I think almost no-one searching for pizza will want.
I unfortunately did not find a good field to sort by in the original dataset from earlier today [1]. So it's just sorted by text match relevance scores, and then the order that they appear in the dataset.
I'm hoping they publish a popularity metric, which will fix the issues like the one you pointed out. Or, once I have sufficient data on popular searches from this site, I can append that metric to the dataset.
@Jabo although search is really great you instantly know the quality of recipes is "aggregated or marketing" junk when one recipe contains quantities in millilitres, mystical cups and a oven temp in unknown centigrade scale
so heres a idea for you to automatically rate those things by just investigating unknowns and either help them to be converted to multiple centigrade scales, and single/multiple comparable metrics you've achieved the ultimate
you are missing direct links to search results, to single search result and it's hell of a task to find a link to click that opens the little modal with recipe information (click on square should be enough)
Cups, spoons, etc, are convertible to millilitres.
Most recipes call for pre-heated over at 180ºC unless states otherwise.
Cooking does not observe reproducible builds. You always, always needs to taste, poke, look. If your flour is of a different type of grain or not as fine, if you use different varieties of vegetables, or if your kitchen is a few degrees warmer or cooler, you WILL get different results.
So go ahead and use any mystical cup you want. Ingredient proportions are what matters. If you fail, write down what went wrong so you know better next time.
I've been cooking and programming for a long time now and wholeheartedly agree with everything you said except the default oven temp thing. Baking is a cargo cult of chemistry and I'd say most folks are well advised to follow those recipes exactly.
>> quantities in millilitres, mystical cups and a oven temp in unknown centigrade scale
As somebody who likes to cook and has not the slightest idea of what Farnheits, Ozs, Yards, Feet, Gallons, are, nor anything about how Freedom Unit Related & Co. converts to simple decimal metric measures, you are speaking Klingon.
However, it's generally frowned upon to show the actual directions for a recipe. Which is why you see most recipe aggregators only show ingredients and link directly to the source to get the actual directions
My team used typesense for a recent project. 16 million records combined with https://www.algolia.com/doc/guides/building-search-ui/what-i.... Really fast and worth the investment. It doesn't have all the enterprise features of ES or Solr but for basic search features it's great.
Very fast and interesting, but I couldn't figure out how to search for ingredients like lime leaf. If I search for "salmon lime leaf" in the title section, none of the top hits have lime leaf in them. If I search in the ingredients section, I get variations on limes (juice, zest, etc.) but no lime leaf. Curiously, if I just search for "lime leaf" in the title (without salmon) I get things that have lime leaf in them.
The top bar only searches within recipe titles. So if a recipe title has those keywords, it will show up in the results. The best way to search for ingredients would be in the sidebar.
In this particular example, the issue is that the source dataset [1] has "lime leaves" in plural, so if you click on "show more" in the sidebar after searching for "lime", you should see it in the list. I'm going to work on normalizing singular / plural ingredients as much as I can as part of this: https://news.ycombinator.com/item?id=25368628
One issue I noticed: I looked up Chana Masala and the first recipe (Vegan Chana Masala) calls for “12 tsp salt” but the source calls for “1/2 tsp salt” :)
This is unfortunately an issue with the source dataset that I got from here[1].
I've now added a prominent warning to the UI, to check with the original site if the ingredient measurements seem off. Don't want any ruined dinners on my hands!
re: UX, I was trying to prevent accidental clicks to an external (ad-ridden) site from the search result cards. The little icon on the bottom right takes you to the source though.
Really love this, but I do see that there are some formatting issues on some recipes. I found that when I was looking into recipe-scraping solutions there were only certain compatible sites since they were formatted so differently, and often they would have to make updates for sites that their their formatting.
Impressive all the same! Keen to dive into the source and learning a bit.
One thing that can be improved is the way the history is populated. Right now, every time a search is performed, a new entry is added to history. I was thinking how to spell Cauliflower as I was typing, now I have to press the back button for each character I typed. It will make more sense to only add a history entry when the input is blurred - onblur()
It's impressive how fast it is, even if this mostly says something about the state of the web today.
2M recipes is a huge database, and without any indication of the quality of a recipe this makes it really hard to tell which ones are worth trying. I hope you can add some sort of rating system in the future.
I'd make the title of the recipe the link to the summary, rather than the "Read Cooking Directions" text, which is sort of confusingly worded. Do you "cook" salad?
Also, as others have said, you're going to get sued if the originator of the recipe isn't linked prominently.
Hmm, I'd think so too. Looks like the source dataset [1] has it wrong :(
This is how it shows up (CSV format):
2230984,Buffalo Chicken Pizza!,"[""1 (9 3/4 ounce) canswanson premium white chunk chicken breast in water, drained"", ""2 tablespoons butter, melted"", ""1 (10 ounce) packageprepared thin pizza crust (12-inch)"", ""12 of a green pepper, thinly sliced"", ""14 cup crumbled blue cheese""]","[""Heat the oven to 425F Stir the chicken, hot sauce and butter in a medium bowl."", ""Spread the chicken mixture on the pizza crust to within 1/2-inch of the edge."", ""Top with the pepper and cheese."", ""Bake for 10 minutes or until the chicken mixture is hot and bubbling.""]",www.food.com/recipe/buffalo-chicken-pizza-394731,Recipes1M,"[""chicken"", ""butter"", ""crust"", ""green pepper"", ""blue cheese""]"
I see in the original posts github link that they have a scrubbed list as well[1]. I am not sure when that was added but it explains the 12 1/2 thing exactly.
Love the concept but please could you show the domain name of the source websites? I use that as a filter for quality and it's time-consuming hovering over the source icon to see the linked URL in the status bar for each recipe.
This is great! I thought that I could filter on ingredients with an intersection of those ingredients and instead got the union. I would love if this could toggle between those and possibly the exclusion as well.
I unfortunately did not find a good field to sort by in the original dataset. So it's just sorted by text match relevance scores, then the order that they appear in the dataset.
If you click on the little icon on the bottom right of each search result card, that should take you to the source website from where the recipe is from.
Ah, it does, but I'd like to second the suggestion to put the domain name at the bottom. All the cards I looked at looked like they had room.
First, it makes a big difference to me if it's from a site I know and trust. Second, I think that clear attribution is a good thing, even if you may be legally in the clear for copyright.
chrome wouldn't load it without a valid ssl cert.
I'm very interested in seeing this though. Direct message me if you want a bit of free help with the infra.
in the meantime, is the recipe dataset open sourced and available somewhere, for other builders?
Could you show me a screenshot of what cert gets loaded for you? I did switch between infra providers in the last few hours. So I wonder if you're hitting the older infra due to stale DNS.
But TBH, logs are a unique beast in that searches are usually temporal and only a tiny portion of the dataset is typically queried. So it will be wasteful to store the entire index in memory 24x7, which is what Typesense (and Algolia) do. ElasticSearch on the other hand has mastered searching log datasets by storing the primary index on disk, so I'd recommend using ES for log data, instead of Algolia / Typesense. The tradeoff with ES is performance, since the ES index needs to be fetched from disk.
For any other structured dataset (like the dataset in this app), Typesense would be a good fit.
Could you show me a screenshot of what cert gets loaded for you? I did switch between infra providers earlier today. So I wonder if you're hitting the older infra due to stale DNS.
Thought this would take me 1-2 hours to build, ended up taking about 6 hours - engineering estimates and all!
> Details about the Tech Stack:
The dataset has 2,231,142 recipes and is indexed on Typesense [2], an open source alternative to Algolia/ElasticSearch that a friend and I are working on.
The UI was built using the Typesense adapter for InstantSearch.js [3] and is a static site bundled using ParcelJS.
The app is hosted on S3, with CloudFront for a CDN.
The search backend is powered by a geo-distributed 3-node Typesense cluster running on Typesense Cloud [4], with nodes in Oregon, Frankfurt and Mumbai.
Here's the source code: https://github.com/typesense/showcase-recipe-search
[1] https://news.ycombinator.com/item?id=25356156
[2] https://github.com/typesense/typesense
[3] https://github.com/typesense/typesense-instantsearch-adapter
[4] https://cloud.typesense.org