Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Instantly search 28M books from OpenLibrary (typesense.org)
424 points by jabo on Dec 14, 2020 | hide | past | favorite | 80 comments



Quick context: I built this from the Open Library (Internet Archive) books dataset (https://openlibrary.org/), as a follow up to this comment on the GoodReads post earlier today: https://news.ycombinator.com/item?id=25408186

Thank you @mekarpeles for helping me get access to the data quickly and giving me pointers about the schema.

I should add - I built this in about 12 hours as a weekend project, so there might be some lurking issues.

> Details about the Tech Stack:

The dataset has ~28.6 million books and is indexed on Typesense [1], an open source alternative to Algolia/ElasticSearch that a friend and I are working on.

The UI was built using the Typesense adapter for InstantSearch.js [2] and is a static site bundled using ParcelJS.

The app is hosted on S3, with CloudFront for a CDN.

The search backend is powered by a geo-distributed 3-node Typesense cluster running on Typesense Cloud [3], with nodes in Oregon, Frankfurt and Mumbai.

Here's the source code: https://github.com/typesense/showcase-books-search

[1] https://github.com/typesense/typesense

[2] https://github.com/typesense/typesense-instantsearch-adapter

[3] https://cloud.typesense.org


Someone really delivers after saying, "I could do that in a weekend." :)

Well done!


Thank you!


I did a few quick tests.

There are some issues with ranking and duplicates that I see. I typed "Neal Stephenson" and got lots of duplicates for his latest novels. Cleaning up the data might be necessary to fix this. It seems biased towards newer/recent titles.

The ranking is also a bit tricky. If you type "Orwell" you get top results about George Orwell, rather than the book that made him famous (1984). I suspect this will be an issue with many literature works that will have a lot of books talking about these books.

Spelling correction is kind of working but surfaces a lot of noise as well. E.g. Neal Stevenson just surfaces a lot of titles and authors with either Neal or Stevenson in them. But impressively it did find one Neal Stephenson title.

Building good search and ranking is hard. So, not bad for a 12 hour effort.


I’m currently ranking by text match score and publish date and some of the records had publish dates that were hard to parse by Date.parse(), so I just set those to 1970: https://github.com/typesense/showcase-books-search/blob/2d38...

I didn’t seem to find a popularity metric in the dataset, which would have solved the issue you pointed out about literary work having books written about them.

I’ve left the typo correction settings to be a little loose, so need to reduce its aggressiveness. I’ll fix that.


Also, forgot to mention this earlier:

Please do contribute to OpenLibrary in any way you can. Here are some ways how: https://news.ycombinator.com/item?id=25408744


The typesense performance is really impressive, seems at par with the (also) impressive meilisearch.

I'm going to try the free tier of typesense right now, perfect for my current use-case of site-search.

How large is the dataset for the books? How large nodes are needed?


Thank you!

The OpenLibrary dataset has ~28M records, takes up 6.8GB on disk and 14.3GB in RAM when indexed in Typesense. Each node has 4vCPUs. Took me ~3 hours to index these 28M records. Could have probably been done in ~1.5hrs - most of the indexing time was because of cross-region latency between the 3 geo-distributed nodes in Oregon, Mumbai and Frankfurt.


That's a lot of text to index, impressive!

Is the indexed size 6.8GB or 14.3GB?


Thank you! The indexed size is 14.3GB.


I'm curious, what do you feel meilisearch offers that is better than elasticsearch or solr?


Pretty good for a 12 hours hacking!


Jabo,

I love to see this, more access for the book community -- especially how quickly it came together. Nice work Jabo and excited to see Typesense in action.

Open Library gladly & thankfully tweeted its support of your hard work:

https://twitter.com/openlibrary/status/1338531822536310789

In every intention of being additive to Jabo's work, thought I'd share a few resources folks may not be aware of:

1. Open Library's Full-text Search

Major hat tip to the Internet Archive's Giovanni Damiola for this one: Folks may also appreciate the ability to full-text search across 4M of the Internet Archive's books on Open Library: https://blog.openlibrary.org/2018/07/14/search-full-text-wit.... You can try it directly here: http://openlibrary.org/search/inside?q=thanks%20for%20all%20...

As per usual, nearly all Open Library urls are themselves APIs, e.g.: http://openlibrary.org/search/inside.json?q=thanks%20for%20a...

2. Open Library's search API

By adding .json to our search endpoint, you can use Open Library's json search endpoint (which also contains bookcover links). You can also use this to fetch books by ISBN:

http://openlibrary.org/search.json?q=isbn:9781400067824

3. Book Cover Service (has issues)

We need to improve our cover service, it's definitely been neglected, but the cover IDs from our data dump, solr search index, and book APIs can be plugged in to our cover store to point to or fetch book covers. We don't yet have a current data bulk dump of covers available (it's on the roadmap) but if you need "all" the covers, please email us openlibrary@archive.org instead of hitting covers 1-by-1. https://openlibrary.org/dev/docs/api/covers

4. The new (unannounced) Open Library Explorer

I'm honestly kind of hoping no one reads this far into the comments because we haven't announced this one yet and I'm sure there are major performance issues (it's beta), but cdrini recently added dewey + library of congress classifications to our search index... Making this masterpiece possible:

https://openlibrary.org/explore

The ability to browse Open Library digitally, with all the glory of a physical library.


Thanks @mekarpeles!


Straight up impossible for me to search for Isaac Asimov's "I, Robot". Some kinda of input cleaning strips out the I from "I robot", and just searches "Robot". "iRobot" does not get the results. And the search does not accept commas. Just some of the fun that comes with searching I suppose.


I’m using a stop word list on the client side at the moment: https://github.com/typesense/showcase-books-search/blob/01f2...

Working on support for exact matches using quotes, which will also help here.


Word bigrams without stopword removal on the title would help in this case as well.


It also strips quotation marks, so can't search phrases at all.

I'm always sort of surprised by this stuff. Like, isn't 99% of the effort to get access to the database, how to search it efficiently, how to display useful snippets? Why go to all that effort and not take a useful search syntax off the shelf?


OP:

> I built this in about 12 hours as a weekend project, so there might be some lurking issues.

> Yup, exact matching using quotes is on the horizon. Should be available in a few weeks.

> Looks like I’m filtering a little too aggressively which is affecting the results, I’ll take a closer look it in a few hours.


Can someone explain why using quotes is a thing that has to be added? Like, it treats words separately by breaking up the search string into tokens using spaces as a delimiter. Is the OP implementing this himself? I would have imagined there's a standard library for converting search strings into a structured query.


Looks like they are using a blanket stop word list - "I, the, and" etc. Also looks like diacritical folding has been done for accent characters.


Cool! Small bit of feedback: typing each character adds an entry to the browser's history which seems a bit excessive. Might consider using replaceState over pushState.


Ah yes! Good point about using replaceState. I’ll update it.


One usability feedback: The small OpenLibrary icon in the results looks a little bit like a trashcan. So after searching I was unsure what to do: I didn't want to got to Amazon nor would I want to trash a result.

Suggestion: Link the red book title to the OpenLibrary page.


Interesting observation about the small Internet Archive icon.

I just pushed out an update to link the title to the OpenLibrary page.


Looks easier to use now, thank you.


A big problem with the OpenLibrary website is that there is no way to filter search results for books that are actually in the library, which I find quite odd TBH.

This "Instant Search" doesn't improve on that.

Also, some of the Amazon links are broken for me. They look like https://www.amazon.com/s?k=9798654289605


I see, let me check if this is a field that's available in the dataset...


Hey, thanks!


Hi markdown, there is a radiobutton on the /search page for only showing `ebooks`.

Do you mean this option doesn't work through the search.json API? If so, can you please make sure we have an issue open https://github.com/internetarchive/openlibrary/issues/new/ch...?

You should be able to add has_fulltext:true to your query (which at least returns a set of books for which there is some full-text in existence:

e.g. http://openlibrary.org/search.json?q=has_fulltext:true%20AND...


Oh wow, I'm not sure how I missed that. Thanks for schooling me. That's really helpful.


I saw in the comment thread where you promised to do it, and bam, a few hours later, here it is. Kudos, and thanks!


Thank you! I thought it would take may be 4-5 hours to build it, but it ended up taking about 12 hours!


Bravo!!


I've searched for 'Lovecraft' and then opened first 15 results - no book was in library. Then I went to the openlibrary.org and again searched for 'Lovecraft' and among first 20 results I could only borrow one book for an hour. What is the point of showing books that aren't in library, or I am missing something?


Yes, you missed that this was a weekend project that appears to simply be a search of the books in the OpenLibrary database.


I understand that it is side project that's why I went to Openlibrary website to check results there. And the thing I don't understand is why Openlibrary shows me books that aren't available to borrow. In this case I was expecting that books in search result will be available to borrow. Now after searching for a book you have manually search for a available book. Why don't they put a checkbox 'Show only available books' that would automatically filter out unavailable books?


Thanks for clarifying. I am not associated with the OpenLibrary project, so the following answer might not be completely correct. I think the goals of OpenLibrary extend beyond just loaning/borrowing books. I believe the original goal of the project was to catalog all books. You could reach out to the OpenLibrary folks to see if they have a search filter for only books available for borrow or if they are planning on making this available.


Mek here w/ openlibrary

@rayrag, there are researchers, vendors, libraries, partners like wikipedia, authors, and hundreds of thousands of people who use Open Library to track when they're reading, to promote books and reading, and to create curated lists.

Also, organizations like Internet Archive use such catalogs to manage what books to acquire (based on interest/demand from the community).

Having a home for bookcovers, citations, quotes, ISBNs, publishers, years, subjects, related editions are all valuable even if the book (source) is not yet available.

Hope this helps!


When I read the title, I thought it was going to search inside the 28M books, but it's just the title, subject, and author. Still a cool project though.


> I thought it was going to search inside the 28M books

Google Books does this, not sure how many books though.


Sadly it isn’t finding the first two books I looked for that are in the dataset.

https://openlibrary.org/works/OL7116092W/The_church_of_Chris...

And https://openlibrary.org/works/OL2525391W/Holiness?edition=


Looks like I’m filtering out stop words a little too aggressively which is affecting the results in this case, I’ll take a closer look it in a few hours.


You might want to clarify that this does not search 28M books, it searches metadata for 28M books. Very different.

I notice many titles did not have authors. I spot checked this one: https://openlibrary.org/books/OL25434821M/Still_Star-Crossed

This title has no author in the search results, but does have one on the linked page. Perhaps an import issue.

Edit: Forgot to say "This is awesome. Particularly for 12 hours work!"


There needs to be debounce. Currently seeing 2-10s+ for simple typing, and it jitters around from what I typed.

Searching from Perth, Australia.


Strange, I didn't notice it from Los Angeles.

Just added a little debounce in any case. Hopefully that helps.


Same here (EU, The Netherlands). I thought it was broken at first. Took longer than 10 seconds (I think about 20/30s) before I started seeing results. Unusable.


Are you that isn't your connection? I am getting 20ms TTFB from Romania.


I'd suggest adding some kind of explanation as to how many records each of your cloud offerings can support (1gb = N records, and so forth)


Like another comment pointed out, record sizes can be wildly different depending on the data you index. So it’s hard to quantify it in terms of number of records.

That said, I’m planning to add some examples and pre-configured configs to that page so it’s easier to get a good picture.


Wouldn't that depend on the record size? 2 columns wide table vs 100-nested columns with eg. full book chapters.

And I don't think there's any "average" you could easily advertise for without misinforming.


Zero results for ISBN-13: 978-1888118049 ISBN-10: 1888118040 Unintended Consequences by John Ross

https://openlibrary.org/works/OL2964952W/Unintended_conseque...


Had to search by author and then add the general query via the url. As OP said, "built this in about 12 hours as a weekend project"

https://books-search.typesense.org/?b[query]=Unintended%20Co...


I would love, if possible, to do an exact multi-word search, e.g. "Bill Gates". Currently if the phrase is not found, you get results for one of the words. Putting the search in quotes does not work (like it does when performing a search on Google).


Yup, exact matching using quotes is on the horizon. Should be available in a few weeks.


I’ve asked about this before, it’s on their roadmap.


Great project. It seems though, that Books search does not support phrase search and stop words are ignored: "king of england" becomes king england. Is this by design? Does this affect only Books search or typesense generally?


I excluded stop words on the client-side for just this site here: https://github.com/typesense/showcase-books-search/blob/01f2...

Typesense does support searching by phrases on the server-side.


Not sure if this is unintended behavior, but after clicking from this thread to the site and doing a few searches (all while remaining on the same page), I wanted to history back to this thread but it took me 7-8 clicks to do so.


Not intended behavior. I have it on my list to fix: https://news.ycombinator.com/item?id=25415446


It's super uncomfortable when any site uses OR based search. Is there a way to perform AND based search so that when I write more words, the results reduce instead of increasing exponentially?


The Typesense search backend supports this, I just haven't added a UI element for it.


Can't fault the performance! Nice work. My one UX suggestion would be to present search results in a more table-like layout, as reading both across and down to see every result is not ideal.


That’s a good point about reading directions, especially in a list that’s ranked. Thank you for the feedback.


Seems like this exposes the flaw in the dataset. Eg, searching for "cixin liu" there are many different variations of the same book that show up, lots and lots of duplication.


I iterated through the "editions" dataset to build the index, which contains one record for each book (edition), which is why you see one result for each edition.

There's another dataset called "works" which has one record for all editions of a book, but it didn't have all the data points I wanted. So I decided to keep it simple for a weekend project and use the editions dataset.


I think that's a flaw [if you want to call it that] in the presentation of the search results, not the underlying dataset.

If you click through to the OpenLibrary dataset, you can see that the data model understands that the variations are editions of the same underlying text.


@jabo is it possible to connect typesense (or any other instance search provider) to a postgresql instance and have it “just work”?


The challenge is that there are several ways of doing this depending on your larger architecture. So the definition of "just work" depends on the context of your particular architecture.

For eg: you could hook into your ORM and send change events to Typesense like another commenter pointed out; if you have an event bus, you could hook into those messages and push to Typesense from a listener; you could use something like Maxwell to get a JSON stream from the DB binlogs and write the data you need into Typesense; or you could write a batch script which runs a set of SQL queries on a schedule and batch export + import into Typesense.

Also, typically you'd want to curate the fields that you put in a search index, vs what you store in your transactional database, so there is some level of data transformation that's needed anyway.

So long story short, it depends on your architecture!


For our setup with Typesense we created a mapping, did an initial bulk import and then set up ORM lifecycle events to sync insert/delete/updates from the database to Typesense. We are using Symfony with Doctrine as the ORM.


Keeping an external search system in-sync with the primary DB is certainly a pain point. The biggest problem is that the records in the DB will likely be normalized, while the records in a search store will not be.

You can "poll" for new records for indexing by using a query, but handling deletes/updates is the real deal breaker. You will then have to use both the binlog (atleast for mysql) and querying, at which point it becomes quite complicated to reason about.


Is such a thing possible? Yes.

Has anybody built it? No :(.

That would totally be the dream though. No reindexing. Pg_search is def fast enough for a large swath of use cases.

What I do is I’ve built a real barebones search controller that connects to Postgres. My controller only lets you search on one attribute, and I use it for autocompletes. Anything more involved, I use typesense.


As I posted in another comment [1], the biggest problem is handling updates and deletes. Requires that you do some kind of logical deletion with an `last_updated_at` timestamp to fetch updates from a primary table.

[1] https://news.ycombinator.com/item?id=25417856


I don't quite follow. For a system reliant entirely on Postgres, you can just fetch the records, and join them in real time, no? It would be much slower than elasticsearch, but still _good enough_ for a lot of use cases.


I was referring to an integration between Postgres and an external search system such that the data sync is automated. That was what the parent comment was asking about (I think).


Ahh, I understand, this is my mistake.

What I really want is an integration between Instantsearch.js, and Postgres.

Typesense has written a custom adapter for Instantsearch & Typesense. I would love to just hook Instantsearch up to directly to my rails all. You could use pg_search to get 90% of the value, with next to no complexity on the backend.


Never tried it out (it's on my never-getting-done list), but someone has built it: https://github.com/zombodb/zombodb


I think relevancy can be improved for e.g. https://books-search.typesense.org/?b%5Bquery%5D=linus%20tor...


Yeah, I was trying to work with the data I had:

https://news.ycombinator.com/item?id=25418661


In any language?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: