Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Search code in GitHub repos using regular expressions (grep.app)
614 points by danfox on Feb 23, 2020 | hide | past | favorite | 155 comments



This is awesome!

@danfox, sent you an email though commenting here too.

I'm the CTO @ GitHub. Would love to talk to you about this and other things we are building in this area at GitHub.

Feel free to email direct to jason at github.com


I do enjoy moments like these on hacker news when someone presents a project for X and the CTO of X shows up and wants to talk. It shows how directly of an impact one can potentially have in this community.

I hope this means we’re getting grep searches for github soon. Cheers.


If OP's goal was to get a job at Github, I'd say this was very well played.


It would certainly be an expensive play. When he mentioned a 20 core system, I'm assuming it is some VM system, since I don't know of any 20 core CPUs. I'm guessing he is using DigitalOcean and he has two of them, so he is looking at $1000 a month in hosting cost.


It's an expensive side project for sure, but it doesn't have to be anywhere near as expensive as that.

My own side project uses a server with 20 cores (2x E5-2690v2 CPUs), with 256GB RAM and a 2TB SSD. This is a dedicated server I rented from tier.net in Texas, after seeing it listed on webhostingtalk [1]. It costs about $160/mo, and that's recently fallen further by paying for 3 months up front.

1. https://www.webhostingtalk.com/forumdisplay.php?f=36


github's code search is notoriously bad, feels like a huge missed opportunity. Nice to see you guys reaching out to other people working in this area.



I'm not a dev and what i like to to do is go searching for code (a la exact match) to replace whatever variable or text should be changed. Github search in repo kinda worked at some moment, then not.

Then i had to download repo in my local; run VS code (updating first), search there, modify, push. I wish i could do this on Ghub web GUI


The fact that you can’t search for file names is the funniest part to me.


If you fork a repo, you can't search on the forked repo. Confirmed grep.app behaves the same way as github search.


You can but only in the repo itself not on a site wide scale.


Really? I searched for `filename:home.nix` (which brought me to https://github.com/search?utf8=%E2%9C%93&q=filename%3Ahome.n...). That seems site-wide to me, unless I'm misunderstanding you.


These kind of keywords really should be next to the search box with a question mark next to them or something.

TIL some of them are on this page that you only see if you search for an empty string:

https://github.com/search?q=

Click on 'prefixes'. This kind of thing should be readily available from any search box that searches through GitHub.


That's awesome. I didn't know about the keyword filename: I've been using the button to the left of "Clone or Download" this whole time. Thanks for the info!


Agree. You can’t even do a literal search with symbols.

I think there were better solutions on the early 2000’s.


hah. you beat me to it, Jason.

@danfox, i'm always down to talk code search as well - rand@github.com


iirc, Github uses (used?) my old project (https://github.com/intel/hyperscan) at Intel. It's probably faster than the alternatives, although if you want to support all types of regex you'll need to use Hyperscan as a prefilter for a richer regex engine like PCRE.

This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.

It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.


pattern is, btw: (teakettle\w|abcd)

Either pattern in alternation works fine, but even a simple alternation of the two goes back to the behavior that you might expect to get from awful patterns like \d..\d..\w...\s...\d (i.e. reporting only a partial set of matches).


Did you mean either patter in "isolation"?


Yup, you're right - I wound up saying the opposite of what I meant!


How about the ability to search code on forks? GitLab allows it. At least have feature parity? Thanks.


Impressive! Really fast, full featured code search across a huge corpus.

1. How did you build the index? Did you use a GitHub dump of some sort? How often do you refresh it?

2. Is it Elasticsearch or similar or a completely custom engine?

3. What kind of RAM/CPU are you using to power it?

4. Any plans to open source the code or commercialize the technology?

I could absolutely imagine paying for a private code search engine like this to run against a large internal company codebase spread across many repositories.


Thanks! It's built on top of Solr. It fetches the repos from GitHub - it should pick up any updates to repos within a few days. It's running on a couple servers with 20 cores each, which is not really enough for the traffic it's getting right now.


Have you seen livegrep?

Blazing fast multi-repo regex code search. May be more expensive to run in prod, not sure.


This is so good I imagine you're gonna need more.


Cool!I love it.


I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!


> I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!

I'm pretty sure Elasticsearch supports regex search, it's just that it's horrendously slow and can blow up the system.


I still miss Google Code Search, which was a great way to find examples of anything I wanted to learn about in programming and usually answered my questions better than anything else, including Stack Overflow. Has it really been 8 years... https://news.ycombinator.com/item?id=3112029

If this tool can fill that hole in my world, I'll be stoked. I've bookmarked it.


Google code search still exists as long as you want to search Chromium source code.

[1]: https://cs.chromium.org/


The main difference it has IMO is it indexes a symbolic code graph extracted from halfway through the compilation process. That means when you search, it knows which functions are frequently called. For example, the LOG() macro is defined in hundreds of places, but the one in logging.h is the one everyone calls, so that's the one that comes top of the results.

It also keeps track of back references, so you can search "who calls any function in this file", which is very hard to do with any other search system.

Major disadvantages are it only indexes one build config, so if you're debugging android code in a multi-platform project and the indexing was done on the windows version, you won't find much (apart from dumb text based search which it does in addition).

The difficulty of compiling every project to build a decent index would make this approach hard on a GitHub scale - all it takes is one missing header file from a dependency not in the repo and the build fails and the whole project can't be indexed. Also, have fun with things like JavaScript which are so dynamic you have to solve the halting problem to know which bit of code calls which other.


Next post from danfox - “how to get 3 job offers in 3 hours”.

Already has been publicly contacted by:

- GitHub CTO

- SerpApi CEO

- SourceGraph CEO

Search is hot right now!


Actually, It would more be like: "How I failed at 3 interviews, despite being directly contacted by execs."


Couldn't whiteboard a solution without the temp variable.


I’m in a field and physical area with a pool so shallow - that it seems like straight up madness to throw questions like that at people and kick them out the door for it.


New game show idea:

CTOs from software companies interview at other software companies.


this would be awesome, i would watch this!

i bet they would all go back home and immediately fix their own hiring practices.


Sure you built app on multi 20 core machines with functionality to search hundreds of millions of lines of code almost instantaneously, but are you someone I'd drink a beer with?


This snide remark dismisses the fact that working on software does mean working with other humans, not just unemotional robots devoid of any kind of irrational ideas. Being able to “drink a beer with” (and reasonably substituting the drinking of beer for just about any other social interaction) is an important part of being able to work with someone. Unless of course you believe an office environment consisting of a tyrannical manager barking orders at worker drones is a healthy relationship.


100%. I don't really care if you're a super genius if you're also a massive dick that everybody hates.


Are you having intimate romantic relationships with all of your co-workers?

If they get me out of work at 4:30 pm and keep the project I'm working on in quality code so I have less fires to deal with, that's good enough for me.


I think when people talk about this, they mean to push back against the fact that people will often to be biased to hire someone they think they could be casual friends with, share interests with, etc.

I like my coworkers and I find them perfectly find to work and make small talk with, but I don't share interests with many of them and wouldn't really care to hang with them outside of work. That shouldn't be a criterion for hiring.

I have found it highly annoying to work in engineering orgs where everyone seems to have the same interests. Everyone talking about Star Wars, Dungeons and Dragons, Lord of the Rings, etc. constantly because it's assumed everyone else around also enjoys that conversation.


It's an ego thing to want to work with someone just like you instead of adapting yourself to others. It's basically bro culture. It's kind of what's wrong with technology culture.

Give me someone who is talented who makes great code so I can be home at 4:30pm and I don't care what their personality is like. Additionally someone who tells me when something is an issue even at my ego's expense is extremely valuable, over back patters and schmoozers who just want to keep everyone happy. That leads to a terrible product. I would not like to see whatever product you're working on is like.

You all should take a long look at yourselves and ask why you have to work with people who are just like you instead of being adaptive to other walks of life, personality, and backgrounds. Try getting out of yourselves for a minute. You might even learn something now outside of your own tiny tiny worlds!


That’s a pretty unfortunate interpretation of my comment, and not entirely logically consistent either.

I mean, if one person who rejects bro culture only wants to collaborate with other people who also reject bro culture, does that mean they are now proponents of bro culture?

I also find it frankly a bit weird for you to make grand sweeping assumptions about who some strangers on an Internet forum choose to associate and collaborate with. How do you know people here don’t work with people from other backgrounds?


I found your interpretation of my original 'snide' comment pretty unfortunate.

And not a single thing you just said makes any logical sense.

I do know I would never want to work on any project that you're in charge of because I guarantee they're nightmare environments.

Best of luck to you nonetheless.


For some of those companies it would be "drink a La Croix with"


That would be quite the dystopian interview nightmare.


If only the answer to "how" was as simple as "writing a web service for searching GitHub repos with regexes," even though the problem is probably in itself non-trivial if there's this much interest in search at all. At least the specification is clear enough.

I guess what I mean to ask is, how would people know this is a "correct" answer to the "how" question beforehand? Is the answer literally just "search" because that's simply what's trending right now?


It also probably goes without saying he should be careful with what details to share.


I'm surprised as well, think why big tech companies didn't have this awesome search already.


If this were to be offered by an actual company (a first party solution), there are some features that'd be expected that make the problem space a lot harder. Here's an "intro to search" article that's a good read, and I'll use it to highlight some of the things that'd be different in a first party solution - https://medium.com/startup-grind/what-every-software-enginee...

(See the "Theory: the search problem" section)

Size: This is only indexing ~500k public repos. A first party solution would be expected to index all of it, public and private.

Indexing speed: This can take up to a few days to index. A first party solution would be expected to have a much lower index latency - seconds to minutes.

Query language: This can (and does) have its own simple query language. A first party solution would need to have support embedded into and not break backwards compatibility with the current query language.

Context-dependence: A first party solution would be expected to index private repos as well, and now the query context (logged in user) becomes another variable in an already multi-variate problem space.

Latency: Gets harder with scale, and a first party solution would likely provide a SLA/SLO around latency.

Access control: Same issue as context-dependence, with private repos being included.

There's also unknown but likely considerations around compliance and internationalization, which are quite tricky problems.

Note - I don't mean for this to be critical of the author at all. This is an awesome and useful tool, with a fantastic UX. I just want to make it clear that search at scale is a lot harder than it seems at first glance, especially as the feature requirements increase.


Engineering manager for code search at GitHub here... this is an excellent summary of many of the concerns we have as we work on code search at GitHub scale!


For GitHub, I would have to imagine only being able to search public repos with regexp would be good enough. GitHub has many strategies, but the main one is, they want to maintain, if not, expand their open source mind share.

The more reasons you give people to go to GitHub, the better off they will be in the future. So I do agree with you that as a commercial solution, this may not be viable, but for GitHub's public repos, this can turn into a very positive thing.


That might well be true but to scale this type of service to all public repos with decent latency and update ratio is a major technical challenge and likely very costly to maintain.


This is my personal observation, but GitHub appears to be a much more ambitious company, now that they are part of Microsoft. With a CEO that understands both the open source and the enterprise world and with Microsoft cash at hand, I don't think spending money to make search better would cause any concerns.

Doing technical things that GitLab, Bitbucket, etc. can't is quite valuable. It also helps with recruiting, since smart people want to work on difficult problems.

It may well be costly to maintain, but I think the operating cost would be well within the realm of an incumbent that wants to maintain and expand their reach. I've been studying the code hosting space for quite sometime and GitHub, from an outsiders perspective, appears to be much more focused and ambitious, which should cause serious concerns for GitLab.


Also by the co-creator of Django: https://news.ycombinator.com/item?id=22397023


This is really cool. What are you using it for? Usage examples, debugging, etc.?

I'm the CEO at Sourcegraph (universal code search for companies to use on their internal code). Our product is really optimized for searching a company's internal code right now, but soon we'll start working on offering much better search for public and open-source code as well. If you'd like to help out or just chat, please reach out! sqs@sourcegraph.com


Sorry, but his code search covers far more languages than yours the last time I tried yours :)


Doesn't sourcegraph allow to just search regex over any files in a repo? This is textual search, so how are languages relevant to it? I didn't seem to have problems with that


Sorry, maybe I have confused SourceGraph with https://searchcode.com, but last time I tried, it supports only most widely used languages such as Java, Python and so on, but not the language I use (Delphi/Object Pascal). I'm sorry if I'm wrong.


Sourcegraph CEO here. You can definitely search all languages (and all files, and cross-repo, and all commits, etc.) with Sourcegraph.


Great! Do you have a live demo? like the one being Showed HN?


Here ya go: https://sourcegraph.com/search?q=open+repo:edwinyzh+lang:pas... (search) and https://sourcegraph.com/github.com/edwinyzh/EditBone@d9ec56a... (find references in Pascal)

Sourcegraph.com is universal code search and navigation across all public repositories. To use it on private code inside your company, run a self-hosted instance at https://docs.sourcegraph.com/#quickstart.

We've been so focused on internal code search for companies. See https://about.sourcegraph.com for some of the logos of well-known companies whose devs all use Sourcegraph. Because of that, our "public demo" site at Sourcegraph.com has a few limitations that we're working on lifting, such as only searching across a subset of popular repositories by default (unless you specify a specific subset with `repo:` in the query).


This is amazing! One thing that allows me to do, which I wasn't before, is to do a search for the repos that use some of my open source.

While there were some tools for this, they fail sort for older projects where using a library meant copy/paste it into your project, which is not reported in the CDN stats, npm installs or github "uses".

Now I can run a search with a bit of code that is only present in my library and find reliably those who copy/pasted it. While I publish my code under the MIT, this would also be very useful for those publishing under the GPL to detect bad actors.


Wow. This is incredibly helpful. You can use it to see how someone may have used a function with named parameters:

  my_function(label=x, option_1=2)
  my_function.*option_1 # search


That was my first thought. I’ll have to wait until tomorrow to try it, but I have one super rarely used function ima rare package I’d love to see how other people are using.


to grep specific repos locally, I use a tool called Hound, https://github.com/hound-search/hound developed by a couple of engineers at Etsy while I was there, but never released officially.


Amazin, why Microsoft hasn't built this for GitHub yet is beyond me.

Can it grep on individual repos?


I built https://grephub.com for that. It doesn’t maintain an index so it’s not super snappy, but it’s good enough / better than you’d expect in many cases!


Why would you want to use this tool to grep individual repos? If you know the repo you're interested in, you can just clone it and then grep it locally...?


So you don’t have to clone it and grep it locally


I like to grep a code pattern through out all repos. I use gstreamer, and sometimes i just don't know how to use it to do a specific thing. So i search substrings to find out usage patterns by other people.


Some things can take a while to clone. On the top end, repos like blink, webkit, and gecko can take half an hour or more.


Even with --depth=1?


A tangent, my biggest gripe with GitHub code search (within a repo) off the top of my head is the inability to blacklist directories or only search whitelisted directories. Often times I want to look up the implementation of a function, and bam, three pages of results from tests.


I'm glad I'm not the only one. It's very common that I'll be searching for a keyword that only appears in the actual code a handful of times but hundreds of times in tests. GitHub's search is practically useless in those cases.

I almost always just resort to cloning and searching with ripgrep, which can be annoying if I have no other reason to have the codebase on my machine or it's just a one-off.


yeap .. having this issue as well, trying to easily find where a method is defined in JS/TS I'd so much want to be able to exclude `*.(spec|test).(js|ts|jsx|tsx)`


This is really cool! Awesome work. I assume you've seen https://sourcegraph.com/ as well? This to me seems much clearer and a bit more intuitive (though I've only spent a little time in sourcegraph). Really really cool. Does it also search code comments?


last time I tried sourcegraph doesn't cover the language I use, so it's useless to me.


For regex?? how's language relevant?


Sorry, maybe I have confused SourceGraph with https://searchcode.com, but last time I tried, it supports only most widely used languages such as Java, Python and so on, but not the language I use (Delphi/Object Pascal)


Excellent work!

I am the CEO at SerpApi. If you need a job, shot me an at julien _at_ serpapi.com.


I wonder how this compares to Debian Code Search (https://codesearch.debian.net/about) and Russ Cox’s code search tools (https://swtch.com/~rsc/regexp/regexp4.html).

Obviously the source material is different (Debian packages vs GitHub repos) and grep.app also uses re2, but that is all I can see from a look at the “about” blurb.



Hey Dan, if you ever wanted to come on my podcast to talk about your tech stack (how your site is developed / deployed, lessons learned, etc.), I'd love to have you on.

That podcast is at: https://runninginproduction.com/, drop me a line at nick.janetakis@gmail.com if you're interested.


@danfox, Without revealing your tech/business secretes, I wonder if you can share some tips about building such a search app :)


How did you pick the 500k repositories to index out of the 28 million or so which are public?


It was based on the number of stars/forks and the size of the repository.


There must be something else or something wrong, because you indexed one of my small repo (~100 stars, ~20 forks, ~20Mb) and not the bigger ones (~500 stars, ~100/150 forks, ~150Mb)


Maybe he is limiting it to repositories of 50 MB or less, for example.


Looking around at repositories I'm familiar with this seems to be the case.


or possibly there's a "time decay" element where more recently "popular" repos are prioritized, not just based on absolute start/fork count


I do not have a great example to try on my phone, but are results deduplicated? That's my big peeve with GitHub search is getting 5 pages of the same forked repo.


There isn't any deduplication, although that will hopefully be less of an issue at this point since there's a limited number of repositories in the index.


You have no idea how often I've wanted something like this for GitHub. Thanks so much!


GitHub confirmed to me that their search is not able to find in substrings; this is annoying because if you want to find all affected code among all possibly involved repositories, before a change, you need to clone them and grep locally. In the end this means you need to clone absolutely everything you work with, because otherwise you might miss changing that one repo you didn't think of:

https://stackoverflow.com/questions/43891605/search-partial-...

I've used Sourcegraph and it was cool; will have a look at this new tool too. But, GitHub pretty please add plain food old grep abilities to your search!


Amazing feat!

Something I found when testing the regexp: the highlights seem to be off sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing that came to mind to try out the regexp), the second highlight in the first result seems to be in the wrong location:

https://grep.app/search?q=%3C.%2A%3F%40gmail.com%3E&regexp=t...

https://imgur.com/a/VyUXhcF


Seems to be good for stuff like

api_key="[a-z0-9]+"

Ty


"We didn't find any matching results."


You need to enable regular expression.


I would say this needs a list of indexed repos and mainly an explanation of how it exactly works to be usable (how's the index build and how often it's refreshed, what types of files are being indexed, etc.). Otherwise, there's no much value in searching in an unknown data, is it?

Anyway, to not only criticize, good job! It's definitely one of GitHub's missing features. And I can imagine it's not an easy job to build something like that. But as I wrote, it really has to be well explained to be actually usable.


> there's no much value in searching in an unknown data, is it?

So you know exactly how Google's index works?

I think "best effort", whatever it is, is useful even if I don't know the specifics of what it captures or misses. As long as it returns useful results.


Superb work. You built a better code search than Github (well with some of its features missing sure) with a lot less resources. Shows how stagnated the progress in big companies is after a service is deemed "good enough". Good for you kicking them in their butts to lead the way. Hope you get out of this something else too than HN karma.

Really like the minimalistic design, not too designy but still easy on my eyes. Just the way I want it to let me focus on the task at hand


Any plans to include backrefs? I'd like to see how many examples of /(\w+) && \1\./ are out there in .js/.ts compared to /(\w+)\?\./


The about blurb mentions it uses RE2. So backreferences aren't likely. See https://github.com/google/re2/issues/101


Ripgrep is based on RE2 and supports backrefs. Wonder why they didn't use that.


Not quite. ripgrep uses Rust's regex engine, not RE2. Rust's regex engine is descended from RE2, but there is no code sharing.

Rust's regex engine does not support backreferences. RE2 does not either. ripgrep does however have a -P/--pcre2 flag which causes it to use PCRE2 instead of Rust's regex engine. PCRE2 supports backreferences and other things, like look-around. (ripgrep also has an --auto-hybrid-regex flag, which will automatically enable PCRE2 for you if you write a regex with backreferences or look-around.)

The reason not to use an engine like PCRE2 for a project like this is because it would be trivially exposed to ReDoS: https://en.wikipedia.org/wiki/ReDoS


Perhaps to protect against ReDoS the client should use an extended finite automata (1).

https://www.arl.wustl.edu/~pcrowley/a25-becchi.pdf

(1) Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions.


Nope. That still supports backreferences, and resolving backreferences is an NP-complete problem.[1] And I don't see anything in that paper that addresses that. Note that there may be some versions of the problem that maybe aren't NP-complete[2], but again, not addressed by that paper.

Besides, that paper was published 12 years ago. Where is the productionized version of it? Or are you suggesting the the OP go spend a few years writing a regex eninge? :-) Doesn't seem like a particularly practical suggestion.

[1] - https://perl.plover.com/NPC/NPC-3SAT.html

[2] - https://branchfree.org/2019/04/04/question-is-matching-fixed...


In the paper there are some bounds about the number of states in the automata as a function of the length of the input. So one could limit the length of the input when using back references to bound the complexity of the algorithm. They have used their algorithm for snort (network intrusion detection) using asic. The author could contact the authors of the paper and ask for (or pay for) an implementation.

By the way, good work ripgrep and rust.


Thanks for the clarification. As an aside, how difficult do you think it would be to compile ripgrep to wasm? In VS Code we use ripgrep for full-workspace search and Node's regex library for in-memory searches, but this leads to discrepancies and issues such as catastrophic backtracking in the in-memory search.


I've never tried to compile to WASM. It really depends on how much of the OS APIs need to be fixed. e.g., I don't think WASM supports memory maps as one example. In that case, ripgrep could be made to compile without support for memory maps with a bit of work. But that's an easy case. What other things does WASM not support? What about typical file/directory APIs? I don't think it does, or it least, it looks like Rust's standard library doesn't implement anything for them: https://github.com/rust-lang/rust/blob/master/src/libstd/sys...

At that point, it would be hopelessly difficult to build ripgrep. The right path then would be to build a new application that uses whatever of ripgrep's libraries make sense.

Popping up a level though, why would you want to compile to WASM? If you're using Node, then surely you can build an FFI bridge to Rust's regex library. At least at that point, you'd be using the same regex engine. I even maintain official C bindings for them: https://github.com/rust-lang/regex/tree/master/regex-capi

EDIT: Oh, and not sure if this is useful, but the regex crate itself should compile to WASM just fine. I know I've seen people run it in the browser before. If there's a problem here, then please file a bug!


Thanks for all the advice. We'd just be running it on single buffers so I agree it makes more sense to start from the rust regex library than ripgrep. We do however need to continue supporting backrefs and lookaround, so we'll need to add `--auto-hybrid-regex` functionality to fall back to either Node's engine or a webassembly PCRE2.

As for wasm vs FFI, it would ideally work in browser (Monaco), which makes wasm the best bet I believe.


Ah yeah, for backrefs you'll need to find a way to use PCRE2. Not sure what the WASM story is there. But at that point, if your only problem with Node regexes is catastrophic backtracking, then you might as well just stick with Node. PCRE2 will have the same problem.


It’s interesting how it took so many years for such an obviously useful tool to emerge. I guess hosting this is finally getting cheap enough.


I've been wondering the same thing for many years. And I don't know why Google killed Code Search


thank you so much for doing this! i hope it continues to open more doors of opportunities to you!

primo, this is a crazy snappy proof that shows that github search can be done. next, the UI is amazing. and finally, all my queries worked!

i am now going to remove "github search sucks" from my to-be-published rants because this post demonstrates that 1. people care 2. github was already working on it.


Very similar to https://news.ycombinator.com/item?id=18565239

Backend for codegrep was Play framework + Elasticsearch and you could search by programming languages.

Screenshot: http://archive.is/0mFML


Awesome! To me it looks like the come back of "Google Code Search" which I've been missing for many years!


Curious that I found many "secret forks" of my stuff, but none of my repos is directly indexed.


Can you elaborate how you found them?


I looked for strings that I am sure only appear in my code, and I found several copies of them, but not mine.


Can you provide detailed steps to reproduce? What strings did you search? Two examples of repos that appeared in the results? What is the link to your repo that did not appear in the results?

Details like this would help the OP to track down the exact cause of why it has indexed the forks but not the original repo.


The authors are quite explicit that this site only includes a fraction of all github repos. Thus, this is not a "bug" that needs to be corrected.

In my case, I am not talking about forks but about people who copied my files into their repositories (with proper attribution and respecting the license). I just searched for my surname and was happily surprised to see it in major projects like ffmpeg, pytorch, bytedeco, scikit and opencv.


Can I search only additions/deletions? Recently when searching GitHub I wanted to find if anyone had replaced the usage of a deprecated method with the new one, because the docs for that library don't mention the non-deprecated method name.


Do you index the default branch of every repo? Or do you just index the master branch?


It indexes the default branch of each repo.


Cool. Keep up :) definitely gonna share with my co-workers.

Can't wait for filename filters which would make this the perfect solution


Thanks :) If you type into the path filter box, that'll match against the full path for each file, so you can use that to filter on a filename.


The interface for this is really clean and nice - did you use a theme or framework?


Thanks! It's using Elastic's Search UI (https://github.com/elastic/search-ui) and Ant Design (https://github.com/ant-design/ant-design).


I was going to say that I didn't want javascript on this.

But it's actually pretty #neat. It's all tidied up into a single app without any dependencies.

This rocks and, so far, seems way way WAY better than Github's own search tool.


This is cool, reminds me of the vulnerability search too.

https://shhgit.darkport.co.uk/


^(.)'(.)'(.)$

I got a tooltip say:

Error: JSON.parse: unexpected character at line 1 column 1 of the JSON data

Update:: Oh ^(.)"(.)"(.)$ works and fast.


I think that error was just because the server was overloaded - sorry about that.


I wish there was something this fast, but for searching error outputs instead (along with discussions/solutions).


Feels like magic to me! Lets me easily see who's working on similar topics. Thanks!


Can you share your search string? Thanks.


This is one of the fastest, most responsive searches i've ever used. Great work!


might be a good idea to have some sort of clickable "demo" search or "try these" example on the frontend page to show off the capabilities of this.


How is it that fast?


How do you handle expensive regex statements?


My last name(Ament) is really rare where I come from, so I've used the tool to find other people with the same last name. Was not disappoint. Thank you!


this is awesome stuff, thank you! great work!


Hello world


[flagged]


There's no need for personal attack. We ban accounts that do that, so please don't.

Cherry-picking one post from a statistical cloud and calling it typical is dodgy. Even the distribution in this thread doesn't match your description. Actually, even the comment you're picking on doesn't match your description.

We detached this subthread from https://news.ycombinator.com/item?id=22397156.


[flagged]


This seems unrelated.

I hope u/dang sees your comment history; you are basically just spamming nerdydata.com


Why regex still exists? It is unintuitive, requires mastering an obscure syntax, it is very hard to debug, and very difficult to explain to others how it works. It feels like we are trying to write intermediate code by ourselves, while we should have a human readable language that generates regex.


You might be interested in “Eggex”, which aims to be a human-readable language that generates regexes. It’s currently written as a feature of the Oil shell, but in theory any tool could support them. Eggex docs: https://www.oilshell.org/release/latest/doc/eggex.html. Recent blog post about their development: https://www.oilshell.org/blog/2019/12/22.html.

However, Eggexes are a thin, mostly-syntactic layer over regexes. You still have to understand the regex engine to use them. If this sounds useless to you because you don’t currently understand any flavor of regex or parsing, I encourage you not to give up on learning regexes. (https://www.regular-expressions.info/ was how I learned; it’s a great tutorial.) Text-parsing engines, including regex engines, are a powerful concept that can be used in many situations, and I think it’s worth spending the effort learning them until, to paraphrase another commenter, regexes become the human-readable language you were searching for. Or Eggexes, at least.


The investment into learning regexes is worth it if you write or read enough of them. They become the human readable language you speak of, eventually. The question is where the threshold lies.


Do it! You will find that it's very easy, but the result will either be extermely verbose or just like regex. Since most regexes (at least for me) are meant as one-time-use, the extra verboseness has no added benefit. If you have complex needs, you should probably be using something other that regex, anyways.


Extremely verbose is right. Here's one such approach in java that I found last year - https://github.com/sgreben/regex-builder.

Yeah, regex can be a bit clunky at times and has a steeper learning curve, but they're pretty industry standard at this point, and portable across languages with a few caveats.


"Why regex still exists?"

Is there an alternative that is clearly superior?


Your mileage may vary, but to my taste, the lpeg flavor of Parsing Expression Grammars is clearly superior.

It uses operator overloading to build patterns from component parts. I don't think anything can replace the terseness of regex for command line use, or vim searching, cases like that.

But for a program, give me lpeg every time.


Because it's really powerful, and some people actually like it (I'm one of them).

I can understand that a complex pattern might look scary if you're unfamiliar, but if you work with it long enough, you can put patterns together with relative ease.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: