Show HN: Search code in GitHub repos using regular expressions

jasoncwarner · on Feb 23, 2020

This is awesome!

@danfox, sent you an email though commenting here too.

I'm the CTO @ GitHub. Would love to talk to you about this and other things we are building in this area at GitHub.

Feel free to email direct to jason at github.com

sovietmudkipz · on Feb 24, 2020

I do enjoy moments like these on hacker news when someone presents a project for X and the CTO of X shows up and wants to talk. It shows how directly of an impact one can potentially have in this community.

I hope this means we’re getting grep searches for github soon. Cheers.

travbrack · on Feb 24, 2020

If OP's goal was to get a job at Github, I'd say this was very well played.

sdesol · on Feb 24, 2020

It would certainly be an expensive play. When he mentioned a 20 core system, I'm assuming it is some VM system, since I don't know of any 20 core CPUs. I'm guessing he is using DigitalOcean and he has two of them, so he is looking at $1000 a month in hosting cost.

samcrawford · on Feb 24, 2020

It's an expensive side project for sure, but it doesn't have to be anywhere near as expensive as that.

My own side project uses a server with 20 cores (2x E5-2690v2 CPUs), with 256GB RAM and a 2TB SSD. This is a dedicated server I rented from tier.net in Texas, after seeing it listed on webhostingtalk [1]. It costs about $160/mo, and that's recently fallen further by paying for 3 months up front.

1. https://www.webhostingtalk.com/forumdisplay.php?f=36

latenightcoding · on Feb 23, 2020

github's code search is notoriously bad, feels like a huge missed opportunity. Nice to see you guys reaching out to other people working in this area.

sophiebits · on Feb 23, 2020

Good news – they're working on it: https://help.github.com/en/github/searching-for-information-...

lucasverra · on Feb 23, 2020

I'm not a dev and what i like to to do is go searching for code (a la exact match) to replace whatever variable or text should be changed. Github search in repo kinda worked at some moment, then not.

Then i had to download repo in my local; run VS code (updating first), search there, modify, push. I wish i could do this on Ghub web GUI

erikpukinskis · on Feb 23, 2020

The fact that you can’t search for file names is the funniest part to me.

WhiteOwlLion · on Feb 24, 2020

If you fork a repo, you can't search on the forked repo. Confirmed grep.app behaves the same way as github search.

DrJones1098 · on Feb 23, 2020

You can but only in the repo itself not on a site wide scale.

cole-h · on Feb 24, 2020

Really? I searched for `filename:home.nix` (which brought me to https://github.com/search?utf8=%E2%9C%93&q=filename%3Ahome.n...). That seems site-wide to me, unless I'm misunderstanding you.

giancarlostoro · on Feb 24, 2020

These kind of keywords really should be next to the search box with a question mark next to them or something.

TIL some of them are on this page that you only see if you search for an empty string:

https://github.com/search?q=

Click on 'prefixes'. This kind of thing should be readily available from any search box that searches through GitHub.

DrJones1098 · on Feb 24, 2020

That's awesome. I didn't know about the keyword filename: I've been using the button to the left of "Clone or Download" this whole time. Thanks for the info!

aantix · on Feb 24, 2020

Agree. You can’t even do a literal search with symbols.

I think there were better solutions on the early 2000’s.

sixwing · on Feb 23, 2020

hah. you beat me to it, Jason.

@danfox, i'm always down to talk code search as well - rand@github.com

glangdale · on Feb 24, 2020

iirc, Github uses (used?) my old project (https://github.com/intel/hyperscan) at Intel. It's probably faster than the alternatives, although if you want to support all types of regex you'll need to use Hyperscan as a prefilter for a richer regex engine like PCRE.

This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.

It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.

glangdale · on Feb 24, 2020

pattern is, btw: (teakettle\w|abcd)

Either pattern in alternation works fine, but even a simple alternation of the two goes back to the behavior that you might expect to get from awful patterns like \d..\d..\w...\s...\d (i.e. reporting only a partial set of matches).

michaelmior · on Feb 24, 2020

Did you mean either patter in "isolation"?

glangdale · on Feb 24, 2020

Yup, you're right - I wound up saying the opposite of what I meant!

WhiteOwlLion · on Feb 24, 2020

How about the ability to search code on forks? GitLab allows it. At least have feature parity? Thanks.

simonw · on Feb 23, 2020

Impressive! Really fast, full featured code search across a huge corpus.

1. How did you build the index? Did you use a GitHub dump of some sort? How often do you refresh it?

2. Is it Elasticsearch or similar or a completely custom engine?

3. What kind of RAM/CPU are you using to power it?

4. Any plans to open source the code or commercialize the technology?

I could absolutely imagine paying for a private code search engine like this to run against a large internal company codebase spread across many repositories.

danfox · on Feb 23, 2020

Thanks! It's built on top of Solr. It fetches the repos from GitHub - it should pick up any updates to repos within a few days. It's running on a couple servers with 20 cores each, which is not really enough for the traffic it's getting right now.

rattray · on Feb 23, 2020

Have you seen livegrep?

Blazing fast multi-repo regex code search. May be more expensive to run in prod, not sure.

ngold · on Feb 23, 2020

This is so good I imagine you're gonna need more.

1024 · on Feb 24, 2020

Cool！I love it.

heipei · on Feb 24, 2020

I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!

speedplane · on Feb 24, 2020

> I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!

I'm pretty sure Elasticsearch supports regex search, it's just that it's horrendously slow and can blow up the system.

dang · on Feb 23, 2020

I still miss Google Code Search, which was a great way to find examples of anything I wanted to learn about in programming and usually answered my questions better than anything else, including Stack Overflow. Has it really been 8 years... https://news.ycombinator.com/item?id=3112029

If this tool can fill that hole in my world, I'll be stoked. I've bookmarked it.

londons_explore · on Feb 25, 2020

Google code search still exists as long as you want to search Chromium source code.

[1]: https://cs.chromium.org/

londons_explore · on Feb 25, 2020

The main difference it has IMO is it indexes a symbolic code graph extracted from halfway through the compilation process. That means when you search, it knows which functions are frequently called. For example, the LOG() macro is defined in hundreds of places, but the one in logging.h is the one everyone calls, so that's the one that comes top of the results.

It also keeps track of back references, so you can search "who calls any function in this file", which is very hard to do with any other search system.

Major disadvantages are it only indexes one build config, so if you're debugging android code in a multi-platform project and the indexing was done on the windows version, you won't find much (apart from dumb text based search which it does in addition).

The difficulty of compiling every project to build a decent index would make this approach hard on a GitHub scale - all it takes is one missing header file from a dependency not in the repo and the build fails and the whole project can't be indexed. Also, have fun with things like JavaScript which are so dynamic you have to solve the halting problem to know which bit of code calls which other.

thanatos_dem · on Feb 23, 2020

Next post from danfox - “how to get 3 job offers in 3 hours”.

Already has been publicly contacted by:

- GitHub CTO

- SerpApi CEO

- SourceGraph CEO

Search is hot right now!

swat535 · on Feb 23, 2020

Actually, It would more be like: "How I failed at 3 interviews, despite being directly contacted by execs."

wolco · on Feb 24, 2020

Couldn't whiteboard a solution without the temp variable.

SlowRobotAhead · on Feb 24, 2020

I’m in a field and physical area with a pool so shallow - that it seems like straight up madness to throw questions like that at people and kick them out the door for it.

TheSpiceIsLife · on Feb 24, 2020

New game show idea:

CTOs from software companies interview at other software companies.

mtnGoat · on Feb 24, 2020

this would be awesome, i would watch this!

i bet they would all go back home and immediately fix their own hiring practices.

DrJones1098 · on Feb 23, 2020

Sure you built app on multi 20 core machines with functionality to search hundreds of millions of lines of code almost instantaneously, but are you someone I'd drink a beer with?

yakshaving_jgt · on Feb 24, 2020

This snide remark dismisses the fact that working on software does mean working with other humans, not just unemotional robots devoid of any kind of irrational ideas. Being able to “drink a beer with” (and reasonably substituting the drinking of beer for just about any other social interaction) is an important part of being able to work with someone. Unless of course you believe an office environment consisting of a tyrannical manager barking orders at worker drones is a healthy relationship.

IshKebab · on Feb 24, 2020

100%. I don't really care if you're a super genius if you're also a massive dick that everybody hates.

DrJones1098 · on Feb 24, 2020

Are you having intimate romantic relationships with all of your co-workers?

If they get me out of work at 4:30 pm and keep the project I'm working on in quality code so I have less fires to deal with, that's good enough for me.

fjp · on Feb 24, 2020

I think when people talk about this, they mean to push back against the fact that people will often to be biased to hire someone they think they could be casual friends with, share interests with, etc.

I like my coworkers and I find them perfectly find to work and make small talk with, but I don't share interests with many of them and wouldn't really care to hang with them outside of work. That shouldn't be a criterion for hiring.

I have found it highly annoying to work in engineering orgs where everyone seems to have the same interests. Everyone talking about Star Wars, Dungeons and Dragons, Lord of the Rings, etc. constantly because it's assumed everyone else around also enjoys that conversation.

DrJones1098 · on Feb 24, 2020

It's an ego thing to want to work with someone just like you instead of adapting yourself to others. It's basically bro culture. It's kind of what's wrong with technology culture.

Give me someone who is talented who makes great code so I can be home at 4:30pm and I don't care what their personality is like. Additionally someone who tells me when something is an issue even at my ego's expense is extremely valuable, over back patters and schmoozers who just want to keep everyone happy. That leads to a terrible product. I would not like to see whatever product you're working on is like.

You all should take a long look at yourselves and ask why you have to work with people who are just like you instead of being adaptive to other walks of life, personality, and backgrounds. Try getting out of yourselves for a minute. You might even learn something now outside of your own tiny tiny worlds!

yakshaving_jgt · on Feb 24, 2020

That’s a pretty unfortunate interpretation of my comment, and not entirely logically consistent either.

I mean, if one person who rejects bro culture only wants to collaborate with other people who also reject bro culture, does that mean they are now proponents of bro culture?

I also find it frankly a bit weird for you to make grand sweeping assumptions about who some strangers on an Internet forum choose to associate and collaborate with. How do you know people here don’t work with people from other backgrounds?

DrJones1098 · on Feb 24, 2020

I found your interpretation of my original 'snide' comment pretty unfortunate.

And not a single thing you just said makes any logical sense.

I do know I would never want to work on any project that you're in charge of because I guarantee they're nightmare environments.

Best of luck to you nonetheless.

tmpz22 · on Feb 23, 2020

For some of those companies it would be "drink a La Croix with"

runawaybottle · on Feb 23, 2020

That would be quite the dystopian interview nightmare.

nonbirithm · on Feb 23, 2020

If only the answer to "how" was as simple as "writing a web service for searching GitHub repos with regexes," even though the problem is probably in itself non-trivial if there's this much interest in search at all. At least the specification is clear enough.

I guess what I mean to ask is, how would people know this is a "correct" answer to the "how" question beforehand? Is the answer literally just "search" because that's simply what's trending right now?

sdesol · on Feb 23, 2020

It also probably goes without saying he should be careful with what details to share.

Existenceblinks · on Feb 23, 2020

I'm surprised as well, think why big tech companies didn't have this awesome search already.

thanatos_dem · on Feb 23, 2020

If this were to be offered by an actual company (a first party solution), there are some features that'd be expected that make the problem space a lot harder. Here's an "intro to search" article that's a good read, and I'll use it to highlight some of the things that'd be different in a first party solution - https://medium.com/startup-grind/what-every-software-enginee...

(See the "Theory: the search problem" section)

Size: This is only indexing ~500k public repos. A first party solution would be expected to index all of it, public and private.

Indexing speed: This can take up to a few days to index. A first party solution would be expected to have a much lower index latency - seconds to minutes.

Query language: This can (and does) have its own simple query language. A first party solution would need to have support embedded into and not break backwards compatibility with the current query language.

Context-dependence: A first party solution would be expected to index private repos as well, and now the query context (logged in user) becomes another variable in an already multi-variate problem space.

Latency: Gets harder with scale, and a first party solution would likely provide a SLA/SLO around latency.

Access control: Same issue as context-dependence, with private repos being included.

There's also unknown but likely considerations around compliance and internationalization, which are quite tricky problems.

Note - I don't mean for this to be critical of the author at all. This is an awesome and useful tool, with a fantastic UX. I just want to make it clear that search at scale is a lot harder than it seems at first glance, especially as the feature requirements increase.

fjania · on Feb 23, 2020

Engineering manager for code search at GitHub here... this is an excellent summary of many of the concerns we have as we work on code search at GitHub scale!

sdesol · on Feb 23, 2020

For GitHub, I would have to imagine only being able to search public repos with regexp would be good enough. GitHub has many strategies, but the main one is, they want to maintain, if not, expand their open source mind share.

The more reasons you give people to go to GitHub, the better off they will be in the future. So I do agree with you that as a commercial solution, this may not be viable, but for GitHub's public repos, this can turn into a very positive thing.

marceloabsousa · on Feb 24, 2020

That might well be true but to scale this type of service to all public repos with decent latency and update ratio is a major technical challenge and likely very costly to maintain.

sdesol · on Feb 24, 2020

This is my personal observation, but GitHub appears to be a much more ambitious company, now that they are part of Microsoft. With a CEO that understands both the open source and the enterprise world and with Microsoft cash at hand, I don't think spending money to make search better would cause any concerns.

Doing technical things that GitLab, Bitbucket, etc. can't is quite valuable. It also helps with recruiting, since smart people want to work on difficult problems.

It may well be costly to maintain, but I think the operating cost would be well within the realm of an incumbent that wants to maintain and expand their reach. I've been studying the code hosting space for quite sometime and GitHub, from an outsiders perspective, appears to be much more focused and ambitious, which should cause serious concerns for GitLab.

neonate · on Feb 23, 2020

Also by the co-creator of Django: https://news.ycombinator.com/item?id=22397023

sqs · on Feb 23, 2020

This is really cool. What are you using it for? Usage examples, debugging, etc.?

I'm the CEO at Sourcegraph (universal code search for companies to use on their internal code). Our product is really optimized for searching a company's internal code right now, but soon we'll start working on offering much better search for public and open-source code as well. If you'd like to help out or just chat, please reach out! sqs@sourcegraph.com

edwinyzh · on Feb 24, 2020

Sorry, but his code search covers far more languages than yours the last time I tried yours :)

akavel · on Feb 24, 2020

Doesn't sourcegraph allow to just search regex over any files in a repo? This is textual search, so how are languages relevant to it? I didn't seem to have problems with that

edwinyzh · on Feb 24, 2020

Sorry, maybe I have confused SourceGraph with https://searchcode.com, but last time I tried, it supports only most widely used languages such as Java, Python and so on, but not the language I use (Delphi/Object Pascal). I'm sorry if I'm wrong.

sqs · on Feb 24, 2020

Sourcegraph CEO here. You can definitely search all languages (and all files, and cross-repo, and all commits, etc.) with Sourcegraph.

edwinyzh · on Feb 24, 2020

Great! Do you have a live demo? like the one being Showed HN?

sqs · on Feb 24, 2020

Here ya go: https://sourcegraph.com/search?q=open+repo:edwinyzh+lang:pas... (search) and https://sourcegraph.com/github.com/edwinyzh/EditBone@d9ec56a... (find references in Pascal)

Sourcegraph.com is universal code search and navigation across all public repositories. To use it on private code inside your company, run a self-hosted instance at https://docs.sourcegraph.com/#quickstart.

We've been so focused on internal code search for companies. See https://about.sourcegraph.com for some of the logos of well-known companies whose devs all use Sourcegraph. Because of that, our "public demo" site at Sourcegraph.com has a few limitations that we're working on lifting, such as only searching across a subset of popular repositories by default (unless you specify a specific subset with `repo:` in the query).

franciscop · on Feb 24, 2020

This is amazing! One thing that allows me to do, which I wasn't before, is to do a search for the repos that use some of my open source.

While there were some tools for this, they fail sort for older projects where using a library meant copy/paste it into your project, which is not reported in the CDN stats, npm installs or github "uses".

Now I can run a search with a bit of code that is only present in my library and find reliably those who copy/pasted it. While I publish my code under the MIT, this would also be very useful for those publishing under the GPL to detect bad actors.

danielecook · on Feb 23, 2020

Wow. This is incredibly helpful. You can use it to see how someone may have used a function with named parameters:

  my_function(label=x, option_1=2)
  my_function.*option_1 # search

SlowRobotAhead · on Feb 24, 2020

That was my first thought. I’ll have to wait until tomorrow to try it, but I have one super rarely used function ima rare package I’d love to see how other people are using.

hoorayimhelping · on Feb 23, 2020

to grep specific repos locally, I use a tool called Hound, https://github.com/hound-search/hound developed by a couple of engineers at Etsy while I was there, but never released officially.

glouwbug · on Feb 23, 2020

Amazin, why Microsoft hasn't built this for GitHub yet is beyond me.

Can it grep on individual repos?

hcm · on Feb 24, 2020

I built https://grephub.com for that. It doesn’t maintain an index so it’s not super snappy, but it’s good enough / better than you’d expect in many cases!

funklute · on Feb 23, 2020

Why would you want to use this tool to grep individual repos? If you know the repo you're interested in, you can just clone it and then grep it locally...?

m3kw9 · on Feb 23, 2020

So you don’t have to clone it and grep it locally

tehlike · on Feb 23, 2020

I like to grep a code pattern through out all repos. I use gstreamer, and sometimes i just don't know how to use it to do a specific thing. So i search substrings to find out usage patterns by other people.

big_chungus · on Feb 23, 2020

Some things can take a while to clone. On the top end, repos like blink, webkit, and gecko can take half an hour or more.

leni536 · on Feb 23, 2020

Even with --depth=1?

oefrha · on Feb 23, 2020

A tangent, my biggest gripe with GitHub code search (within a repo) off the top of my head is the inability to blacklist directories or only search whitelisted directories. Often times I want to look up the implementation of a function, and bam, three pages of results from tests.

Noctem · on Feb 24, 2020

I'm glad I'm not the only one. It's very common that I'll be searching for a keyword that only appears in the actual code a handful of times but hundreds of times in tests. GitHub's search is practically useless in those cases.

I almost always just resort to cloning and searching with ripgrep, which can be annoying if I have no other reason to have the codebase on my machine or it's just a one-off.

cynicalreason · on Feb 24, 2020

yeap .. having this issue as well, trying to easily find where a method is defined in JS/TS I'd so much want to be able to exclude `*.(spec|test).(js|ts|jsx|tsx)`

patrickdevivo · on Feb 24, 2020

This is really cool! Awesome work. I assume you've seen https://sourcegraph.com/ as well? This to me seems much clearer and a bit more intuitive (though I've only spent a little time in sourcegraph). Really really cool. Does it also search code comments?

edwinyzh · on Feb 24, 2020

last time I tried sourcegraph doesn't cover the language I use, so it's useless to me.

akavel · on Feb 24, 2020

For regex?? how's language relevant?

edwinyzh · on Feb 24, 2020

Sorry, maybe I have confused SourceGraph with https://searchcode.com, but last time I tried, it supports only most widely used languages such as Java, Python and so on, but not the language I use (Delphi/Object Pascal)

hartator · on Feb 23, 2020

Excellent work!

I am the CEO at SerpApi. If you need a job, shot me an at julien _at_ serpapi.com.

fanf2 · on Feb 23, 2020

I wonder how this compares to Debian Code Search (https://codesearch.debian.net/about) and Russ Cox’s code search tools (https://swtch.com/~rsc/regexp/regexp4.html).

Obviously the source material is different (Debian packages vs GitHub repos) and grep.app also uses re2, but that is all I can see from a look at the “about” blurb.

sciurus · on Feb 24, 2020

Another related tool is

https://searchfox.org/

https://github.com/bgrins/searchfox

nickjj · on Feb 23, 2020

Hey Dan, if you ever wanted to come on my podcast to talk about your tech stack (how your site is developed / deployed, lessons learned, etc.), I'd love to have you on.

That podcast is at: https://runninginproduction.com/, drop me a line at nick.janetakis@gmail.com if you're interested.

edwinyzh · on Feb 24, 2020

@danfox, Without revealing your tech/business secretes, I wonder if you can share some tips about building such a search app :)

lol768 · on Feb 23, 2020

How did you pick the 500k repositories to index out of the 28 million or so which are public?

danfox · on Feb 23, 2020

It was based on the number of stars/forks and the size of the repository.

atxbcp · on Feb 23, 2020

There must be something else or something wrong, because you indexed one of my small repo (~100 stars, ~20 forks, ~20Mb) and not the bigger ones (~500 stars, ~100/150 forks, ~150Mb)

giovannibonetti · on Feb 23, 2020

Maybe he is limiting it to repositories of 50 MB or less, for example.

tempay · on Feb 24, 2020

Looking around at repositories I'm familiar with this seems to be the case.

patrickdevivo · on Feb 24, 2020

or possibly there's a "time decay" element where more recently "popular" repos are prioritized, not just based on absolute start/fork count

carom · on Feb 23, 2020

I do not have a great example to try on my phone, but are results deduplicated? That's my big peeve with GitHub search is getting 5 pages of the same forked repo.

danfox · on Feb 24, 2020

There isn't any deduplication, although that will hopefully be less of an issue at this point since there's a limited number of repositories in the index.

aodj · on Feb 23, 2020

You have no idea how often I've wanted something like this for GitHub. Thanks so much!

j1elo · on Feb 23, 2020

GitHub confirmed to me that their search is not able to find in substrings; this is annoying because if you want to find all affected code among all possibly involved repositories, before a change, you need to clone them and grep locally. In the end this means you need to clone absolutely everything you work with, because otherwise you might miss changing that one repo you didn't think of:

https://stackoverflow.com/questions/43891605/search-partial-...

I've used Sourcegraph and it was cool; will have a look at this new tool too. But, GitHub pretty please add plain food old grep abilities to your search!

w-m · on Feb 25, 2020

Amazing feat!

Something I found when testing the regexp: the highlights seem to be off sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing that came to mind to try out the regexp), the second highlight in the first result seems to be in the wrong location:

https://grep.app/search?q=%3C.%2A%3F%40gmail.com%3E&regexp=t...

https://imgur.com/a/VyUXhcF

sn4pp · on Feb 24, 2020

Seems to be good for stuff like

api_key="[a-z0-9]+"

Ty

bananaeater · on Feb 24, 2020

"We didn't find any matching results."

rafi_kamal · on Feb 25, 2020

You need to enable regular expression.

ferenczy · on Feb 23, 2020

I would say this needs a list of indexed repos and mainly an explanation of how it exactly works to be usable (how's the index build and how often it's refreshed, what types of files are being indexed, etc.). Otherwise, there's no much value in searching in an unknown data, is it?

Anyway, to not only criticize, good job! It's definitely one of GitHub's missing features. And I can imagine it's not an easy job to build something like that. But as I wrote, it really has to be well explained to be actually usable.

clarry · on Feb 24, 2020

> there's no much value in searching in an unknown data, is it?

So you know exactly how Google's index works?

I think "best effort", whatever it is, is useful even if I don't know the specifics of what it captures or misses. As long as it returns useful results.

tekkk · on Feb 25, 2020

Superb work. You built a better code search than Github (well with some of its features missing sure) with a lot less resources. Shows how stagnated the progress in big companies is after a service is deemed "good enough". Good for you kicking them in their butts to lead the way. Hope you get out of this something else too than HN karma.

Really like the minimalistic design, not too designy but still easy on my eyes. Just the way I want it to let me focus on the task at hand

jakear · on Feb 23, 2020

Any plans to include backrefs? I'd like to see how many examples of /(\w+) && \1\./ are out there in .js/.ts compared to /(\w+)\?\./

tyingq · on Feb 23, 2020

The about blurb mentions it uses RE2. So backreferences aren't likely. See https://github.com/google/re2/issues/101

jakear · on Feb 23, 2020

Ripgrep is based on RE2 and supports backrefs. Wonder why they didn't use that.

burntsushi · on Feb 23, 2020

Not quite. ripgrep uses Rust's regex engine, not RE2. Rust's regex engine is descended from RE2, but there is no code sharing.

Rust's regex engine does not support backreferences. RE2 does not either. ripgrep does however have a -P/--pcre2 flag which causes it to use PCRE2 instead of Rust's regex engine. PCRE2 supports backreferences and other things, like look-around. (ripgrep also has an --auto-hybrid-regex flag, which will automatically enable PCRE2 for you if you write a regex with backreferences or look-around.)

The reason not to use an engine like PCRE2 for a project like this is because it would be trivially exposed to ReDoS: https://en.wikipedia.org/wiki/ReDoS

manthideaal · on Feb 25, 2020

Perhaps to protect against ReDoS the client should use an extended finite automata (1).

https://www.arl.wustl.edu/~pcrowley/a25-becchi.pdf

(1) Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions.

burntsushi · on Feb 25, 2020

Nope. That still supports backreferences, and resolving backreferences is an NP-complete problem.[1] And I don't see anything in that paper that addresses that. Note that there may be some versions of the problem that maybe aren't NP-complete[2], but again, not addressed by that paper.

Besides, that paper was published 12 years ago. Where is the productionized version of it? Or are you suggesting the the OP go spend a few years writing a regex eninge? :-) Doesn't seem like a particularly practical suggestion.

[1] - https://perl.plover.com/NPC/NPC-3SAT.html

[2] - https://branchfree.org/2019/04/04/question-is-matching-fixed...

manthideaal · on Feb 25, 2020

In the paper there are some bounds about the number of states in the automata as a function of the length of the input. So one could limit the length of the input when using back references to bound the complexity of the algorithm. They have used their algorithm for snort (network intrusion detection) using asic. The author could contact the authors of the paper and ask for (or pay for) an implementation.

By the way, good work ripgrep and rust.

jakear · on Feb 24, 2020

Thanks for the clarification. As an aside, how difficult do you think it would be to compile ripgrep to wasm? In VS Code we use ripgrep for full-workspace search and Node's regex library for in-memory searches, but this leads to discrepancies and issues such as catastrophic backtracking in the in-memory search.

burntsushi · on Feb 24, 2020

I've never tried to compile to WASM. It really depends on how much of the OS APIs need to be fixed. e.g., I don't think WASM supports memory maps as one example. In that case, ripgrep could be made to compile without support for memory maps with a bit of work. But that's an easy case. What other things does WASM not support? What about typical file/directory APIs? I don't think it does, or it least, it looks like Rust's standard library doesn't implement anything for them: https://github.com/rust-lang/rust/blob/master/src/libstd/sys...

At that point, it would be hopelessly difficult to build ripgrep. The right path then would be to build a new application that uses whatever of ripgrep's libraries make sense.

Popping up a level though, why would you want to compile to WASM? If you're using Node, then surely you can build an FFI bridge to Rust's regex library. At least at that point, you'd be using the same regex engine. I even maintain official C bindings for them: https://github.com/rust-lang/regex/tree/master/regex-capi

EDIT: Oh, and not sure if this is useful, but the regex crate itself should compile to WASM just fine. I know I've seen people run it in the browser before. If there's a problem here, then please file a bug!

jakear · on Feb 24, 2020

Thanks for all the advice. We'd just be running it on single buffers so I agree it makes more sense to start from the rust regex library than ripgrep. We do however need to continue supporting backrefs and lookaround, so we'll need to add `--auto-hybrid-regex` functionality to fall back to either Node's engine or a webassembly PCRE2.

As for wasm vs FFI, it would ideally work in browser (Monaco), which makes wasm the best bet I believe.

burntsushi · on Feb 24, 2020

Ah yeah, for backrefs you'll need to find a way to use PCRE2. Not sure what the WASM story is there. But at that point, if your only problem with Node regexes is catastrophic backtracking, then you might as well just stick with Node. PCRE2 will have the same problem.

dabei · on Feb 24, 2020

It’s interesting how it took so many years for such an obviously useful tool to emerge. I guess hosting this is finally getting cheap enough.

edwinyzh · on Feb 24, 2020

I've been wondering the same thing for many years. And I don't know why Google killed Code Search

blackandblue · on Feb 24, 2020

thank you so much for doing this! i hope it continues to open more doors of opportunities to you!

primo, this is a crazy snappy proof that shows that github search can be done. next, the UI is amazing. and finally, all my queries worked!

i am now going to remove "github search sucks" from my to-be-published rants because this post demonstrates that 1. people care 2. github was already working on it.

mrkramer · on Feb 24, 2020

Very similar to https://news.ycombinator.com/item?id=18565239

Backend for codegrep was Play framework + Elasticsearch and you could search by programming languages.

Screenshot: http://archive.is/0mFML

edwinyzh · on Feb 24, 2020

Awesome! To me it looks like the come back of "Google Code Search" which I've been missing for many years!

enriquto · on Feb 23, 2020

Curious that I found many "secret forks" of my stuff, but none of my repos is directly indexed.

justanotheratom · on Feb 23, 2020

Can you elaborate how you found them?

enriquto · on Feb 23, 2020

I looked for strings that I am sure only appear in my code, and I found several copies of them, but not mine.

polyphonicist · on Feb 23, 2020

Can you provide detailed steps to reproduce? What strings did you search? Two examples of repos that appeared in the results? What is the link to your repo that did not appear in the results?

Details like this would help the OP to track down the exact cause of why it has indexed the forks but not the original repo.

enriquto · on Feb 23, 2020

The authors are quite explicit that this site only includes a fraction of all github repos. Thus, this is not a "bug" that needs to be corrected.

In my case, I am not talking about forks but about people who copied my files into their repositories (with proper attribution and respecting the license). I just searched for my surname and was happily surprised to see it in major projects like ffmpeg, pytorch, bytedeco, scikit and opencv.

welder · on Feb 24, 2020

Can I search only additions/deletions? Recently when searching GitHub I wanted to find if anyone had replaced the usage of a deprecated method with the new one, because the docs for that library don't mention the non-deprecated method name.

yuz · on Feb 23, 2020

Do you index the default branch of every repo? Or do you just index the master branch?

danfox · on Feb 24, 2020

It indexes the default branch of each repo.

yuz · on Feb 24, 2020

Cool. Keep up :) definitely gonna share with my co-workers.

Can't wait for filename filters which would make this the perfect solution

danfox · on Feb 24, 2020

Thanks :) If you type into the path filter box, that'll match against the full path for each file, so you can use that to filter on a filename.

cddotdotslash · on Feb 23, 2020

The interface for this is really clean and nice - did you use a theme or framework?

danfox · on Feb 24, 2020

Thanks! It's using Elastic's Search UI (https://github.com/elastic/search-ui) and Ant Design (https://github.com/ant-design/ant-design).

inetknght · on Feb 23, 2020

I was going to say that I didn't want javascript on this.

But it's actually pretty #neat. It's all tidied up into a single app without any dependencies.

This rocks and, so far, seems way way WAY better than Github's own search tool.

bilekas · on Feb 24, 2020

This is cool, reminds me of the vulnerability search too.

https://shhgit.darkport.co.uk/

Existenceblinks · on Feb 23, 2020

^(.)'(.)'(.)$
I got a tooltip say:
Error: JSON.parse: unexpected character at line 1 column 1 of the JSON data

Update:: Oh ^(.)"(.)"(.)$ works and fast.

danfox · on Feb 24, 2020

I think that error was just because the server was overloaded - sorry about that.

stagas · on Feb 24, 2020

I wish there was something this fast, but for searching error outputs instead (along with discussions/solutions).

AdrianEGraphene · on Feb 24, 2020

Feels like magic to me! Lets me easily see who's working on similar topics. Thanks!

edwinyzh · on Feb 24, 2020

Can you share your search string? Thanks.

OutsmartDan · on Feb 24, 2020

This is one of the fastest, most responsive searches i've ever used. Great work!

thrownaway954 · on Feb 24, 2020

might be a good idea to have some sort of clickable "demo" search or "try these" example on the frontend page to show off the capabilities of this.

KhoomeiK · on Feb 24, 2020

How is it that fast?

chasers · on Feb 24, 2020

How do you handle expensive regex statements?

doubleorseven · on Feb 24, 2020

My last name(Ament) is really rare where I come from, so I've used the tool to find other people with the same last name. Was not disappoint. Thank you!

mtnGoat · on Feb 24, 2020

this is awesome stuff, thank you! great work!

habit20 · on Feb 25, 2020

Hello world

sonicxxg · on Feb 23, 2020

[flagged]

dang · on Feb 23, 2020

There's no need for personal attack. We ban accounts that do that, so please don't.

Cherry-picking one post from a statistical cloud and calling it typical is dodgy. Even the distribution in this thread doesn't match your description. Actually, even the comment you're picking on doesn't match your description.

We detached this subthread from https://news.ycombinator.com/item?id=22397156.

dbielik · on Feb 23, 2020

[flagged]

_bz2r · on Feb 24, 2020

This seems unrelated.

I hope u/dang sees your comment history; you are basically just spamming nerdydata.com

whatever1 · on Feb 23, 2020

Why regex still exists? It is unintuitive, requires mastering an obscure syntax, it is very hard to debug, and very difficult to explain to others how it works. It feels like we are trying to write intermediate code by ourselves, while we should have a human readable language that generates regex.

roryokane · on Feb 24, 2020

You might be interested in “Eggex”, which aims to be a human-readable language that generates regexes. It’s currently written as a feature of the Oil shell, but in theory any tool could support them. Eggex docs: https://www.oilshell.org/release/latest/doc/eggex.html. Recent blog post about their development: https://www.oilshell.org/blog/2019/12/22.html.

However, Eggexes are a thin, mostly-syntactic layer over regexes. You still have to understand the regex engine to use them. If this sounds useless to you because you don’t currently understand any flavor of regex or parsing, I encourage you not to give up on learning regexes. (https://www.regular-expressions.info/ was how I learned; it’s a great tutorial.) Text-parsing engines, including regex engines, are a powerful concept that can be used in many situations, and I think it’s worth spending the effort learning them until, to paraphrase another commenter, regexes become the human-readable language you were searching for. Or Eggexes, at least.

GrantZvolsky · on Feb 23, 2020

The investment into learning regexes is worth it if you write or read enough of them. They become the human readable language you speak of, eventually. The question is where the threshold lies.

frabert · on Feb 23, 2020

Do it! You will find that it's very easy, but the result will either be extermely verbose or just like regex. Since most regexes (at least for me) are meant as one-time-use, the extra verboseness has no added benefit. If you have complex needs, you should probably be using something other that regex, anyways.

thanatos_dem · on Feb 23, 2020

Extremely verbose is right. Here's one such approach in java that I found last year - https://github.com/sgreben/regex-builder.

Yeah, regex can be a bit clunky at times and has a steeper learning curve, but they're pretty industry standard at this point, and portable across languages with a few caveats.

tyingq · on Feb 23, 2020

"Why regex still exists?"

Is there an alternative that is clearly superior?

samatman · on Feb 23, 2020

Your mileage may vary, but to my taste, the lpeg flavor of Parsing Expression Grammars is clearly superior.

It uses operator overloading to build patterns from component parts. I don't think anything can replace the terseness of regex for command line use, or vim searching, cases like that.

But for a program, give me lpeg every time.

GordonS · on Feb 24, 2020

Because it's really powerful, and some people actually like it (I'm one of them).

I can understand that a complex pattern might look scary if you're unfamiliar, but if you work with it long enough, you can put patterns together with relative ease.