This is a very useful tool. I’ve known people doing research on the evolution of...

Jabbles · on Jan 24, 2021

Surely someone doing serious research would download the entire corpus and use grep (or equivalent)?

It wouldn't have the same nice interface and searches may take several seconds, but there's only 60,000 books...

gutensearch · on Jan 24, 2021

Thank you for taking the time to lay out feature requests in such details! I really appreciate it.

The current search box is a wrapper around Postgres phraseto_tsquery [1] whilst the Discovery tab uses plainto_tsquery, so you could play with either as an ersatz for some of these features for now, although special characters might get stripped or parsed incorrectly.

Do you know where the people you are talking about hang out online (for example, subreddits)? I'd love to get in touch with them once the features are built and for more general feedback.

[1] https://www.postgresql.org/docs/12/textsearch-controls.html

stavros · on Jan 24, 2021

I'm curious if you ever tried MeiliSearch for this, I tried it recently for something unrelated and had a very good experience with it. Since you have the corpus already, it might be worth trying and seeing if it speeds search up? I'd be interested in the results either way.

gutensearch · on Jan 24, 2021

Thank you for the suggestion! The demo looks great and I'm curious to see how they automatically handle language. It would be nice to add support for Chinese which is the great absent from this attempt, even at the cost of several other languages which in any case have few books transcribed to text in Gutenberg.

I will probably write a blog post once I've tried a few of the approaches suggested in this thread.

stavros · on Jan 24, 2021

That would be great, thanks!

iiv · on Jan 24, 2021

I know /r/CompLing on Reddit is quite popular.

andai · on Jan 24, 2021

Do you know what tools they're using for that?

tkgally · on Jan 25, 2021

I’m afraid it’s been a few years since I heard their research presentations. The only specific tool I remember is AntConc [0], as I happen to know the developer. (I teach at a university in Japan, as does he.) But I think they mentioned other concordance and corpus-related tools as well.

Most of the researchers I remember were literature scholars. While they were comfortable using computers, not many of them seemed savvy enough to write their own programs, run their own servers, or even use grep.

[0] https://www.laurenceanthony.net/software/antconc/