About seven years ago, I got a contracting gig for a website that wanted a "search engine". I remember thinking "Solr/Lucene is old, not pure-functional, and therefore awful!" and decided to build my own. Somehow I even managed to convince the client that this was a good idea.
I ended up trying to reinvent Solr for the client, realizing after about two days of trying to reinvent stemming and indexing, that this was stupid to do on someone else's time, and called the client to tell them that I'm moving to Solr, and I got the project done before-schedule as a result.
====
I think for 99% of usecases (involving search), Lucene/Solr/ES is perfectly fine. However, I do absolutely hate that some companies have decided to make it their primary database.
EDIT:
I just want to make it clear, I think it's totally valid to try and reinvent Solr for fun, or if that's something you're paid specifically to do; nothing is perfect, and I am actually a big fan of the "if it works, break it and make it better!" mentality.
I can chime in here that lucene- based solutions are sufficient almost always, for a purely frontend, js-based fuzzy search engine check out fusejs. https://fusejs.io/
Can you explain a bit why ES isn't a good solution for storing data itself?
I inherited a legacy Mongo solution, and all the data is duplicated and indexed in ES, so I've always wondered why we're using both. Mongo has none of the SQL capabilities that would make my life easier, and the types of queries allowed by Mongo could be done with ES.
The v7 upgrade to a new cluster protocol (zen2) has improved things but overall the system has a long history of losing or destroying data. It's better to have a primary OLTP system that's ACID and reliable while using ES as the secondary search source. You can also remove the _source field if you just need matches without the original content.
It's common to see pattern used with a relational database since, as you can see, Mongo doesn't buy you much else as another document-store.
Thank you very much. It seemed like we were just duplicating a bunch of scheme-less json for no reason, but if ES can lose data, that's probably not a good idea.
See my reply to bryanrasmussen for a full explanation.
Basically: you reindex ES periodically, so when a user is deleted from the primary, it will disappear from ES upon the next reindex. The old index is deleted at the file system level.
At some point, though, the pedantry can get out of hand. After all, 'deleting' at the file system level is just 'unlinking' the inode from the underlying data blocks... in fact, data forensics at the file system level is probably more well-understood than recovering deleted data from a Lucene shard.
at what point would you be able to 'delete' data without being in violation of GDPR?
I think the instructor's response in the linked article is a reasonable defense, you don't really know that data is deleted all the way down to the file level. It is just marked as deleted and could be retrieved by someone clever enough to do so. At some point in the future it will be really deleted.
I don't think the GDPR regulatory agencies are operating at a technical level that they would make an argument that it was not a good enough deletion.
Finally I have to ask this part: assuming ES is not your primary database, how does this get around the GDPR issues? If someone wants their data erased you are supposed to erase it from wherever you store data, I suppose this means ES when it indexes a primary store and finds it has deleted data actually deletes it but if it is told to delete something it keeps it around?
When using ES for indexing and not the primary store, you can (and should) periodically fully reindex the data set. You can use a blue / green pattern — create a new index then swap from the old one to the new one. ES supports aliases, making this swapping transparent to the apps using the index. Now you have more options.
If it is easy to delete specific users from the primary database, the deleted users will naturally disappear during the next ES reindex.
Edit: The old index is deleted at the file system level.
If the reindexing occurs daily or weekly, perhaps this will satisfy GDPR.
There are other good reason to not use ES as the primary data store. First, it isn’t entirely reliable. It’s good and I’ve never seen a corruption, but ES and Lucene’s history isn’t as a reliable database. Second, if you want to change how you index, it is a bit easier to do if the source data is outside of ES.
thanks, I wasn't arguing that using ES as primary was good. Just don't necessarily see the GDPR argument as being a reasonable one. Although I've seen some startups using Mongo as primary and have to wonder if there would be that big a difference in using ES at that point (not a Mongo dig as I've kept away from it for various reasons)
I ended up trying to reinvent Solr for the client, realizing after about two days of trying to reinvent stemming and indexing, that this was stupid to do on someone else's time, and called the client to tell them that I'm moving to Solr, and I got the project done before-schedule as a result.
====
I think for 99% of usecases (involving search), Lucene/Solr/ES is perfectly fine. However, I do absolutely hate that some companies have decided to make it their primary database.
EDIT: I just want to make it clear, I think it's totally valid to try and reinvent Solr for fun, or if that's something you're paid specifically to do; nothing is perfect, and I am actually a big fan of the "if it works, break it and make it better!" mentality.