This is a very useful tool. I’ve known people doing research on the evolution of grammar, vocabulary, literary style, etc. who use only small subsets of the Project Gutenberg data. I'm sure they would appreciate being able to search the entire corpus.
The corpus-search functions those researchers use include wildcards, exact-phrase specification with quotation marks, proximity searches, and Boolean search strings. When you have a chance, you might want to add a list of the syntax formats that currently work. (I tried using * as a wildcard in a phrase surrounded by quotation marks, and it didn’t seem to work.)
One small improvement you could make would be to widen the “Search terms” field so that longer search strings are visible.
Thank you for taking the time to lay out feature requests in such details! I really appreciate it.
The current search box is a wrapper around Postgres phraseto_tsquery [1] whilst the Discovery tab uses plainto_tsquery, so you could play with either as an ersatz for some of these features for now, although special characters might get stripped or parsed incorrectly.
Do you know where the people you are talking about hang out online (for example, subreddits)? I'd love to get in touch with them once the features are built and for more general feedback.
I'm curious if you ever tried MeiliSearch for this, I tried it recently for something unrelated and had a very good experience with it. Since you have the corpus already, it might be worth trying and seeing if it speeds search up? I'd be interested in the results either way.
Thank you for the suggestion! The demo looks great and I'm curious to see how they automatically handle language. It would be nice to add support for Chinese which is the great absent from this attempt, even at the cost of several other languages which in any case have few books transcribed to text in Gutenberg.
I will probably write a blog post once I've tried a few of the approaches suggested in this thread.
I’m afraid it’s been a few years since I heard their research presentations. The only specific tool I remember is AntConc [0], as I happen to know the developer. (I teach at a university in Japan, as does he.) But I think they mentioned other concordance and corpus-related tools as well.
Most of the researchers I remember were literature scholars. While they were comfortable using computers, not many of them seemed savvy enough to write their own programs, run their own servers, or even use grep.
The corpus-search functions those researchers use include wildcards, exact-phrase specification with quotation marks, proximity searches, and Boolean search strings. When you have a chance, you might want to add a list of the syntax formats that currently work. (I tried using * as a wildcard in a phrase surrounded by quotation marks, and it didn’t seem to work.)
One small improvement you could make would be to widen the “Search terms” field so that longer search strings are visible.