Thank you @mekarpeles for helping me get access to the data quickly and giving me pointers about the schema.
I should add - I built this in about 12 hours as a weekend project, so there might be some lurking issues.
> Details about the Tech Stack:
The dataset has ~28.6 million books and is indexed on Typesense [1], an open source alternative to Algolia/ElasticSearch that a friend and I are working on.
The UI was built using the Typesense adapter for InstantSearch.js [2] and is a static site bundled using ParcelJS.
The app is hosted on S3, with CloudFront for a CDN.
The search backend is powered by a geo-distributed 3-node Typesense cluster running on Typesense Cloud [3], with nodes in Oregon, Frankfurt and Mumbai.
There are some issues with ranking and duplicates that I see. I typed "Neal Stephenson" and got lots of duplicates for his latest novels. Cleaning up the data might be necessary to fix this. It seems biased towards newer/recent titles.
The ranking is also a bit tricky. If you type "Orwell" you get top results about George Orwell, rather than the book that made him famous (1984). I suspect this will be an issue with many literature works that will have a lot of books talking about these books.
Spelling correction is kind of working but surfaces a lot of noise as well. E.g. Neal Stevenson just surfaces a lot of titles and authors with either Neal or Stevenson in them. But impressively it did find one Neal Stephenson title.
Building good search and ranking is hard. So, not bad for a 12 hour effort.
I didn’t seem to find a popularity metric in the dataset, which would have solved the issue you pointed out about literary work having books written about them.
I’ve left the typo correction settings to be a little loose, so need to reduce its aggressiveness. I’ll fix that.
The OpenLibrary dataset has ~28M records, takes up 6.8GB on disk and 14.3GB in RAM when indexed in Typesense. Each node has 4vCPUs. Took me ~3 hours to index these 28M records. Could have probably been done in ~1.5hrs - most of the indexing time was because of cross-region latency between the 3 geo-distributed nodes in Oregon, Mumbai and Frankfurt.
I love to see this, more access for the book community -- especially how quickly it came together. Nice work Jabo and excited to see Typesense in action.
Open Library gladly & thankfully tweeted its support of your hard work:
By adding .json to our search endpoint, you can use Open Library's json search endpoint (which also contains bookcover links). You can also use this to fetch books by ISBN:
We need to improve our cover service, it's definitely been neglected, but the cover IDs from our data dump, solr search index, and book APIs can be plugged in to our cover store to point to or fetch book covers. We don't yet have a current data bulk dump of covers available (it's on the roadmap) but if you need "all" the covers, please email us openlibrary@archive.org instead of hitting covers 1-by-1. https://openlibrary.org/dev/docs/api/covers
4. The new (unannounced) Open Library Explorer
I'm honestly kind of hoping no one reads this far into the comments because we haven't announced this one yet and I'm sure there are major performance issues (it's beta), but cdrini recently added dewey + library of congress classifications to our search index... Making this masterpiece possible:
Straight up impossible for me to search for Isaac Asimov's "I, Robot". Some kinda of input cleaning strips out the I from "I robot", and just searches "Robot". "iRobot" does not get the results. And the search does not accept commas. Just some of the fun that comes with searching I suppose.
It also strips quotation marks, so can't search phrases at all.
I'm always sort of surprised by this stuff. Like, isn't 99% of the effort to get access to the database, how to search it efficiently, how to display useful snippets? Why go to all that effort and not take a useful search syntax off the shelf?
Can someone explain why using quotes is a thing that has to be added? Like, it treats words separately by breaking up the search string into tokens using spaces as a delimiter. Is the OP implementing this himself? I would have imagined there's a standard library for converting search strings into a structured query.
Cool! Small bit of feedback: typing each character adds an entry to the browser's history which seems a bit excessive. Might consider using replaceState over pushState.
One usability feedback: The small OpenLibrary icon in the results looks a little bit like a trashcan. So after searching I was unsure what to do: I didn't want to got to Amazon nor would I want to trash a result.
Suggestion: Link the red book title to the OpenLibrary page.
A big problem with the OpenLibrary website is that there is no way to filter search results for books that are actually in the library, which I find quite odd TBH.
I've searched for 'Lovecraft' and then opened first 15 results - no book was in library. Then I went to the openlibrary.org and again searched for 'Lovecraft' and among first 20 results I could only borrow one book for an hour. What is the point of showing books that aren't in library, or I am missing something?
I understand that it is side project that's why I went to Openlibrary website to check results there. And the thing I don't understand is why Openlibrary shows me books that aren't available to borrow. In this case I was expecting that books in search result will be available to borrow. Now after searching for a book you have manually search for a available book. Why don't they put a checkbox 'Show only available books' that would automatically filter out unavailable books?
Thanks for clarifying. I am not associated with the OpenLibrary project, so the following answer might not be completely correct. I think the goals of OpenLibrary extend beyond just loaning/borrowing books. I believe the original goal of the project was to catalog all books. You could reach out to the OpenLibrary folks to see if they have a search filter for only books available for borrow or if they are planning on making this available.
@rayrag, there are researchers, vendors, libraries, partners like wikipedia, authors, and hundreds of thousands of people who use Open Library to track when they're reading, to promote books and reading, and to create curated lists.
Also, organizations like Internet Archive use such catalogs to manage what books to acquire (based on interest/demand from the community).
Having a home for bookcovers, citations, quotes, ISBNs, publishers, years, subjects, related editions are all valuable even if the book (source) is not yet available.
When I read the title, I thought it was going to search inside the 28M books, but it's just the title, subject, and author. Still a cool project though.
Looks like I’m filtering out stop words a little too aggressively which is affecting the results in this case, I’ll take a closer look it in a few hours.
Same here (EU, The Netherlands). I thought it was broken at first. Took longer than 10 seconds (I think about 20/30s) before I started seeing results. Unusable.
Like another comment pointed out, record sizes can be wildly different depending on the data you index. So it’s hard to quantify it in terms of number of records.
That said, I’m planning to add some examples and pre-configured configs to that page so it’s easier to get a good picture.
I would love, if possible, to do an exact multi-word search, e.g. "Bill Gates". Currently if the phrase is not found, you get results for one of the words. Putting the search in quotes does not work (like it does when performing a search on Google).
Great project. It seems though, that Books search does not support phrase search and stop words are ignored: "king of england" becomes king england. Is this by design? Does this affect only Books search or typesense generally?
Not sure if this is unintended behavior, but after clicking from this thread to the site and doing a few searches (all while remaining on the same page), I wanted to history back to this thread but it took me 7-8 clicks to do so.
It's super uncomfortable when any site uses OR based search. Is there a way to perform AND based search so that when I write more words, the results reduce instead of increasing exponentially?
Can't fault the performance! Nice work. My one UX suggestion would be to present search results in a more table-like layout, as reading both across and down to see every result is not ideal.
Seems like this exposes the flaw in the dataset. Eg, searching for "cixin liu" there are many different variations of the same book that show up, lots and lots of duplication.
I iterated through the "editions" dataset to build the index, which contains one record for each book (edition), which is why you see one result for each edition.
There's another dataset called "works" which has one record for all editions of a book, but it didn't have all the data points I wanted. So I decided to keep it simple for a weekend project and use the editions dataset.
I think that's a flaw [if you want to call it that] in the presentation of the search results, not the underlying dataset.
If you click through to the OpenLibrary dataset, you can see that the data model understands that the variations are editions of the same underlying text.
The challenge is that there are several ways of doing this depending on your larger architecture. So the definition of "just work" depends on the context of your particular architecture.
For eg: you could hook into your ORM and send change events to Typesense like another commenter pointed out; if you have an event bus, you could hook into those messages and push to Typesense from a listener; you could use something like Maxwell to get a JSON stream from the DB binlogs and write the data you need into Typesense; or you could write a batch script which runs a set of SQL queries on a schedule and batch export + import into Typesense.
Also, typically you'd want to curate the fields that you put in a search index, vs what you store in your transactional database, so there is some level of data transformation that's needed anyway.
So long story short, it depends on your architecture!
For our setup with Typesense we created a mapping, did an initial bulk import and then set up ORM lifecycle events to sync insert/delete/updates from the database to Typesense. We are using Symfony with Doctrine as the ORM.
Keeping an external search system in-sync with the primary DB is certainly a pain point. The biggest problem is that the records in the DB will likely be normalized, while the records in a search store will not be.
You can "poll" for new records for indexing by using a query, but handling deletes/updates is the real deal breaker. You will then have to use both the binlog (atleast for mysql) and querying, at which point it becomes quite complicated to reason about.
That would totally be the dream though. No reindexing. Pg_search is def fast enough for a large swath of use cases.
What I do is I’ve built a real barebones search controller that connects to Postgres. My controller only lets you search on one attribute, and I use it for autocompletes. Anything more involved, I use typesense.
As I posted in another comment [1], the biggest problem is handling updates and deletes. Requires that you do some kind of logical deletion with an `last_updated_at` timestamp to fetch updates from a primary table.
I don't quite follow. For a system reliant entirely on Postgres, you can just fetch the records, and join them in real time, no? It would be much slower than elasticsearch, but still _good enough_ for a lot of use cases.
I was referring to an integration between Postgres and an external search system such that the data sync is automated. That was what the parent comment was asking about (I think).
What I really want is an integration between Instantsearch.js, and Postgres.
Typesense has written a custom adapter for Instantsearch & Typesense. I would love to just hook Instantsearch up to directly to my rails all. You could use pg_search to get 90% of the value, with next to no complexity on the backend.
Thank you @mekarpeles for helping me get access to the data quickly and giving me pointers about the schema.
I should add - I built this in about 12 hours as a weekend project, so there might be some lurking issues.
> Details about the Tech Stack:
The dataset has ~28.6 million books and is indexed on Typesense [1], an open source alternative to Algolia/ElasticSearch that a friend and I are working on.
The UI was built using the Typesense adapter for InstantSearch.js [2] and is a static site bundled using ParcelJS.
The app is hosted on S3, with CloudFront for a CDN.
The search backend is powered by a geo-distributed 3-node Typesense cluster running on Typesense Cloud [3], with nodes in Oregon, Frankfurt and Mumbai.
Here's the source code: https://github.com/typesense/showcase-books-search
[1] https://github.com/typesense/typesense
[2] https://github.com/typesense/typesense-instantsearch-adapter
[3] https://cloud.typesense.org