About seven years ago, I got a contracting gig for a website that wanted a "sear...

aliswe · on July 19, 2019

I can chime in here that lucene- based solutions are sufficient almost always, for a purely frontend, js-based fuzzy search engine check out fusejs. https://fusejs.io/

hiram112 · on July 20, 2019

Can you explain a bit why ES isn't a good solution for storing data itself?

I inherited a legacy Mongo solution, and all the data is duplicated and indexed in ES, so I've always wondered why we're using both. Mongo has none of the SQL capabilities that would make my life easier, and the types of queries allowed by Mongo could be done with ES.

What are the negatives of ES alone?

manigandham · on July 20, 2019

It's not reliable: https://www.quora.com/Why-shouldnt-I-use-ElasticSearch-as-my...

The v7 upgrade to a new cluster protocol (zen2) has improved things but overall the system has a long history of losing or destroying data. It's better to have a primary OLTP system that's ACID and reliable while using ES as the secondary search source. You can also remove the _source field if you just need matches without the original content.

It's common to see pattern used with a relational database since, as you can see, Mongo doesn't buy you much else as another document-store.

hiram112 · on July 26, 2019

Thank you very much. It seemed like we were just duplicating a bunch of scheme-less json for no reason, but if ES can lose data, that's probably not a good idea.

manigandham · on July 20, 2019

Follow up: MongoDB is adding full-text search capabilities: https://www.youtube.com/watch?v=4QUGWnz-XaA

penagwin · on July 19, 2019

> "if it works, break it and make it better!"

Fix it 'til it breaks is what I always say :D

Needless to say my 3D printer had a lot of down-time haha.

twic · on July 19, 2019

"If it ain't broke, open it up and find out what makes it so bloody special" - i think that's wisdom from the BOFH.

eivarv · on July 19, 2019

Agreed re: using ES as primary storage (which it is NOT meant as) - as far as I can tell, it might even make you in breach of the GDPR [0].

TLDR: Lucene < 7.5 won't merge segments larger than 5GB (default) unless they accumulate 50% deletions.

Delivering a conference talk [1] later this year about it.

[0]: https://www.eivindarvesen.com/blog/2018/09/16/elasticsearch-...

[1]: https://2019.javazone.no/program/3f7cd8a7-a9ea-4874-a7dd-531...

balfirevic · on July 19, 2019

> Agreed re: using ES as primary storage (which it is NOT meant as) - as far as I can tell, it might even make you in breach of the GDPR [0].

How is GDPR compliance of having data in Elastic Search influenced by it being primary vs. secondary storage?

rickmode · on July 19, 2019

See my reply to bryanrasmussen for a full explanation.

Basically: you reindex ES periodically, so when a user is deleted from the primary, it will disappear from ES upon the next reindex. The old index is deleted at the file system level.

ipython · on July 19, 2019

At some point, though, the pedantry can get out of hand. After all, 'deleting' at the file system level is just 'unlinking' the inode from the underlying data blocks... in fact, data forensics at the file system level is probably more well-understood than recovering deleted data from a Lucene shard.

at what point would you be able to 'delete' data without being in violation of GDPR?

eivarv · on July 19, 2019

I know - it really is a matter of definition.

Though the EU has said it will consider intention etc. there's really no way of knowing for certain until when and if it's settled in a court case.

bryanrasmussen · on July 19, 2019

I think the instructor's response in the linked article is a reasonable defense, you don't really know that data is deleted all the way down to the file level. It is just marked as deleted and could be retrieved by someone clever enough to do so. At some point in the future it will be really deleted.

I don't think the GDPR regulatory agencies are operating at a technical level that they would make an argument that it was not a good enough deletion.

Finally I have to ask this part: assuming ES is not your primary database, how does this get around the GDPR issues? If someone wants their data erased you are supposed to erase it from wherever you store data, I suppose this means ES when it indexes a primary store and finds it has deleted data actually deletes it but if it is told to delete something it keeps it around?

rickmode · on July 19, 2019

When using ES for indexing and not the primary store, you can (and should) periodically fully reindex the data set. You can use a blue / green pattern — create a new index then swap from the old one to the new one. ES supports aliases, making this swapping transparent to the apps using the index. Now you have more options.

If it is easy to delete specific users from the primary database, the deleted users will naturally disappear during the next ES reindex.

Edit: The old index is deleted at the file system level.

If the reindexing occurs daily or weekly, perhaps this will satisfy GDPR.

There are other good reason to not use ES as the primary data store. First, it isn’t entirely reliable. It’s good and I’ve never seen a corruption, but ES and Lucene’s history isn’t as a reliable database. Second, if you want to change how you index, it is a bit easier to do if the source data is outside of ES.

bryanrasmussen · on July 20, 2019

thanks, I wasn't arguing that using ES as primary was good. Just don't necessarily see the GDPR argument as being a reasonable one. Although I've seen some startups using Mongo as primary and have to wonder if there would be that big a difference in using ES at that point (not a Mongo dig as I've kept away from it for various reasons)