Where was this 10 years ago when I was reverse engineering the Google robots.txt parser by feeding example robots.txt files and URLs into the Google webmaster tool? I actually went so far as to build a convoluted honeypot website and robots.txt to see what the Google crawler would do in the wild.
Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.
Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.
My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.
With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.
If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.
(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)
Accidentally deleting someone's entire website because they don't understand the difference between GET and POST requests is virtually a right of passage when writing a web crawler.
That's why it's good to crawl twice, so if the site got deleted by the first crawl, you can check and then discard the results. Saves a bit of disk space.
There's a similar problem with AV automatically unsubscribing all of your customers from their spam (and newsletters) the first time you start scanning emails for malicious links. It's a little less than completely solvable, in fact.
I’m trying to find a link to it but there was an incident based on this issue somewhere around 1999-2001 where Microsoft added a sort of prefetching thing to IE (or was it Netscape?!) and it would effectively click all the links on the page in order to get all the content in the cache.
Lots of us really didn’t know what we were doing and we’d made all the action buttons in the listing screens regular links. As you can imagine, pandemonium ensued.
Hey, at least we’d figured out that sql injection was a thing.
1997 we had this crazy notion of web "channels" (like RSS feeds), and offline viewing, where a client on a painfully slow dial-up connection could download and cache the resources required to display complex web pages.
Microsoft did this via an explicit web manifest; the web page author needed to list all of the resources they wanted to use in offline or pre-cache mode.
Netscape tried to do this by urging web authors to Be Very Careful with the links on a page, which usually required a specially-crafted offline-crawler-only version of the site. Predicably, hilarity ensued.
The term of art at the time was "push technology" or "web push", the irony of which was not lost upon those tasked with making it work.
It's an easy fix if Google cared. Have an online tool that validates if the robots.txt is correct, and send out an announcement that files that don't meet spec will be penalized in terms of SEO.
I've been in disagreements with SEO people quite frequently about a "Noindex" directive for robots.txt. There seem to be a bunch of articles that are sent to me every time I question its existence[0][1]. Google's own documentation says that noindex should be in the meta HTML but the SEO people seem to trust these shady sites more.
I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.
Yuck though! Imagine if you were writing a compiler. Would you make it accept “unsinged” “unnsigned” “unssined” and “unsined” as keywords, just to catch spelling mistakes? Not sure I like that pattern.
It's a little different in that case, since the person using the parser is also the person writing the input to the parser. So if the input fails the parser, the author of the code can simply correct it. As I understand it, there's no single standard that captures how all robots.txt files are formatted, so there's no "standard parser" that the authors of these files could be expected to pass.
Google has been very clear lately (via John Mueller) regarding getting pages indexed or removed from the index.
If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.
In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).
> However, you will not see any information like the meta description on these blocked URLs.
True, but that's not the only thing. If it ever was in the index, it takes forever to be removed, if it gets removed at all. Send 404 or 410, Disallow it or set it to noindex - you may get lucky or you may not. You can of course "hide it from search results", but that only works for 90 days (iirc, may be 120, something in that range). Those leftovers will typically lose rankings, but they often stay indexed, easy to spot with a site: query.
Reindexing a page is dynamic based on noteworthiness and volatility iirc, but individual links can be reindexed on the fly since the Percolator index. The 90d number was from an old system when indexes were broken into shards that had to be swapped out wholesale.
I don't mean reindexing, I mean "hiding from the index" ("Remove URLs" in GSC). It works instantly, but only for a limited time, after which it will re-appear in the index if you haven't gotten it out of the index (via 410, noindex or disallow). Since these other ways don't always work, if you're unlucky and want it to stay gone, you need to hide it again (and again and again). I've had clients that were hacked and had spammy content injected into their site and it took (literally!) years for that to get removed (we tried combinations of 404, 410, noindex and disallow).
Exactly, there is no guaranteed way to remove anything, HTTP status, meta-tags, headers, and robots.txt only have advisory status. They are usually followed when a resource is hit first, but once it's in the index, "keeping the result available" seems to be a top priority. I do understand the idea - it might still be a useful result for a user, but otoh if it's 410 (or continuously 404), it won't be of any use because the content that was indexed is no longer available (especially in case of 410).
Granted, these are edge cases, in most circumstances, 410 + 90 day hiding means they are hidden instantly and don't resurface. These edge cases do make me take Google's official statements on how to deal with things with a grain of salt though: bugs exist, and unless you happen to know somebody at Google there's no way to report them.
No, disallow means that you are not allowed to crawl the page. You have to crawl the page to no you cannot index it. But how do you index it if you do not crawl the page, well if another page that you can crawl and index points to the page you cannot index as authoritative on a keyword then it be in the index with that keyword, even if you do not have the actual crawled content of the page.
Not if you buy the government first. I think they’re already a victim of their own greed though.
It’s bad enough I started using DDG for search because the results are now more relevant. Google’s advertising algorithms are designed to subtly nudge sites into paying for placement — which means there’s a “non-content” element to the search results that makes it into the user experience. I feel like there was a tipping point a year or two ago where the results just stopped being useful — The best analogy I can find is how search engines used to be in the days before AltaVista. Then AltaVista came out and the results were far more relevant (if not perfect). Google -> DDG feels like that in 2019.
That “non-content” element will only grow over time as Google seeks revenue growth — growth across all of Google’s non-advertising revenue streams combined are not enough to move the needle compared to the scale their ad business has — of which search ads are by far the most profitable. So they will further try to monetize search; it’s their cash cow but I think a small player like DDG could easily overtake them as the quality of Google’s search results (to the end user) continue to decline.
Agreed re: DDG search quality. It's my own default and preferred choice. Google remains useful for Scholar and Books, but relevance and deceptive ads on SERPS is rapidly declining.
It's like recommending a book you haven't read, and newspapers do that every day.
Basically Google finds the link in other places -> oh that must be interesting, I'm indexing it, without even reading it. So they don't have the actual content, and just use the texts from the sites that link to it.
But they do have the actual content since they show the meta title an description on top of what I assume is heavy NLP to drive the search engine itself.
Usually, creating "410 Gone"[0] response for the URL and running the URL through the URL Inspection Tool [1] can help make things a bit faster. But yeah, it does take a while to get these 404s removed.
That distinction exists in many systems. E.g. for cloud events, 404 is considered with skepticism because it could be a race condition in provisioning or transient issue whereas 410 requires data streams to be cut off.
5xx means that the server made a mistake. 4xx means that the caller made a mistake. Sending a request to a GONE url is canonically classified as a “user” or sender error.
> but the SEO people seem to trust these shady sites more.
It makes more sense when you realize that the SEO people (with a few exceptions) are usually pretty shady as well. You rarely hear them recommending that you write better content to get better results, it's always nonsense like "put nofollow on everything so your score doesn't leak".
But I understand, there's a lot of snake-oil and "one weird trick to rank first" that brings a bad name to the SEO world.
I've seen people go on Fiverr and expect to find top-notch SEOs there.
There's more to SEO than just writing good content. There's a lot of technical stuff that can bite you and your awesome content will never rank.
Stuff like improving site structure, canonicals, learning to deal with multi-language versions of your content, implementing proper redirects, etc,etc is something that a good SEO should be able to fix and improve.
I mean if you're hiring a SEO person isn't this literally what you're paying for -- tricks to increase your search ranking without changing your content?
I'd argue most of those are necessary for good content (if we don't view content separately from presentation)
> Putting important text inside of images
I'm sure the reason for this is that it's hard to parse text from images, and while Google could use their AI to figure it out, they don't bother. But it also prevents blind people from being able to read the text, so it does worsen the experience.
> Duplicate content
This makes the site harder to navigate for users as well.
> Page performace issues
Quite obviously makes the experience worse.
> Broken mobile support.
-..-
SEO should be a bridge between technical and non-technical people that build out sites.
No site's output is 100% because of the tech team - content writers can put in weird code, marketers can add all sorts of stuff to say Tag manager, the robots.txt is likely from 2008. And a site built with code as the primary goal is likely lacking in some marketing oomph somewhere.
Someone who's job it is to find the right balance, and aim to maximise the returns from the single largest source of traffic, is pretty valuable.
It looks like it's too late for me to edit my comment, but I've been proved right. Putting a Noindex directive directly in robots.txt is frequently suggested, but this seems like definitive proof that that does nothing (at least with Google).
As far as I can tell the inception of this idea was that it was briefly mentioned by some Google employee in an interview. Maybe it was supported in the past or maybe he just misspoke, but I bet even now we'll see people still using this tag.
I'm not sure I understand your reasoning, why should Google honor noindex everywhere but on .gov websites? What about other countries' government TLDs? What about publicly traded companies? What about personal websites of elected officials? What about accounts of elected officials on 3rd party websites?
That seems like a can of worms not really worth opening.
This might be controversial but everything is fair game everywhere. If you can crawl it, tough luck. It's there and everyone can get to it anyways, why not a crawler?
Because the rules a well-functioning society runs by are more nuanced than "Is it technically possible to do this?"
If you'd like a specific example of why people might seek this courtesy, someone might have a page or group of pages on their site that works fine when used by the humans who would normally use it, but which would keel over if bots started crawling it, because bot usage patterns don't look like normal human patterns.
A society is composed of humans. But there are (very stupid) AIs loose on the Internet that aren't going to respect human etiquette.
By analogy: humans drive cars and cars can respond to human problems at human time-scales, and so humans (e.g. pedestrians) expect cars to react to them the way humans would. But there are other things on, and crossing, the road, besides cars. Everyone knows that a train won't stop for you. It's your job to get out of the way of the train, because the train is a dumb machine with a lot of momentum behind it, no matter whether its operator pulls the emergency brake or not.
There are dumb machines on the Internet with a lot of momentum behind them, but, unlike trains, they don't follow known paths. They just go wherever. There's no way to predict where they'll go; no rule to follow to avoid them. So, essentially, you have to build websites so that they can survive being hit by a train at any time. And, for some websites, you have to build them to survive being hit by trains once per day or more.
Sure, on a political level, it's the fault of whoever built these machines to be so stupid, and you can and should go after them. But on a technical, operational level—they're there. You can't pre-emptively catch every one of them. The Internet is not a civilized place where "a bolt from the blue" is a freak accident no one could have predicted, and everyone will forgive your web service if it has to go to the hospital from one; instead, the Internet is a (cyber-)war-zone where stray bullets are just flying constantly through the air in every direction. Customers of a web service are about the same as shareholders in a private security contractor—they'd just think you irresponsible if you deployed to this war-zone without properly equipping yourself with layers and layers of armor.
Honestly that is the site owners problem. If it can be found by a person it's fair. I genuinely respect the concept of courtesy but I don't expect it. People can seek courtesy but they should have expectations of whether or not it will happen.
Techies forget the rule of laws. A dos has intent. A bot crawling a poorly designed website accidentally causing the site owners problems does not have malicious intent. They can choose to block the offender just like a restaurant can refuse service. But intent still matters.
This thread is about what behavior we should design crawlers to have. One person said crawlers should disregard noindex directives on government sites, and you replied that they should ignore all robots.txt directives and just crawl whatever they can. If you intentionally ignore robots.txt, that has intent, by definition.
Not intentionally ignore it by going out of their way to override it, just not be required to implement a feature to their crawler. Apparently parsing those sounds tricky with edge cases. Ignoring that file is absolutely on the table. People of course can adhere to but it's not required and in my opinion shouldn't even be paid attention to.
In my younger years the only time I ever dealt with robots.txt was to find stuff I wasn't supposed to crawl.
If you don’t want something public, don’t allow a crawler to find it or access it. The people you want to hide stuff from are just going to use search engines that ignore robots.txt
The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.
For instance it explicitly says "To exclude all files except one: This is currently a bit awkward, as there is no "Allow" field."
And the behavior is so different between different parsers and website implementations that, for instance, the default parser in Python can't even successfully parse twitter.com's robots.txt file because of the newlines.
Most search engines obey it as a matter of principle but not all crawlers or archivers [1] do.
It's a good example of missing standards in the wild.
> The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.
Yeah, my first reaction to Google heading yet another standard was to cringe but this is one of the situations where I think it makes a lot of sense. They're dominant in the search industry and most other engines tend to take their cue so having them spearhead it seems like a good move.
When I read "This library has been around for 20 years and it contains pieces of code that were written in the 90's" my first thought was "that commit history must be FASCINATING".
From Google's perspective it's probably too much work. I would assume this was a part of the cralwer code and extracted over time into a library, while part of the monorepo, so changesets probably didn't only touch this code, but also other parts and this code probably depended on internal libraries (now it depends on Google's public abseil library) publishing all that needs lots of review (also considering names and other personal information in commit logs, TODO comments and their like)
Not only that, code libraries that weren’t designed to be open source often have things in them that Google might want to show: codenames, profanity, calling out specific companies…
Also, even if it is authoritatively managed in git now, the whole 20 year history certainly wasn't (since git is only 14 years old, and Google probably didn't adopt it on day one), and it's quite likely commit history wasn't converted,so it's quite possible Google couldn't easily make the whole history available when publishing it to GitHub even if they wanted to.
I assume the authoritative version is still in Google's Piper-based repo and previously was in perforce and I assume that was for a while ... so if there were interest Google's could dig deep. But I assume there are other projects where this is even more interesting. (how ranking changed over time; how storage formats for the index changed; ...)
I can attest to this. I work in a very large monorepo with tens of thousands of commits. Even files that aren't changed often have regular updates - usually repo-wide CodeMods. This makes the blame less useful and the history quite noisy. I figure the robots.txt parser's history would be in a similar state - not very useful or interesting to read.
// A user-agent line is expected to contain only [a-zA-Z_-] characters and must
// not be empty. See REP I-D section "The user-agent line".
// https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1
So you may need to adjust your bot’s UA for proper matching.
(Disclosure, I work at Google, though not on anything related to this.)
The strictness is in what may be listed in the robots txt, not the User-Agent header as sent by bots. the example given in the linked draft standard[0] makes this abundantly clear that it's on the bot to understand how to interpret the corresponding lines of robots.txt.
Of course, in practice robots.txt tend to look less like [1] and more like [2].
Sorry, I mean for matching, and I did try to imply it was a limitation of the standard and not the library. Though to avoid confusion, I do personally think keeping the user agent minimal is wise, since users might have difficulty guessing what value to use if it differs sufficiently from the real user agent that's sent.
I wonder how much noindex contributes to lax security practices like storing sensitive user data on public pages and relying on not linking to the page to keep it private. I wonder how much is in the gap between "should be indexed" and "really ought to restrict access to authorized users only".
If I recall correctly there was a large company several years ago who tried to prosecute a whitehat who discovered their user account pages included the users' e-mail addresses and that changing the address to that of a different user would drop you right into that user's page with all their personal information listed.
> how should they deal with robots.txt files that are hundreds of megabytes large?
What do huge robots.txt files like that contain? I tried a couple domains just now and the longest one I could find was GitHub's - https://github.com/robots.txt - which is only about 30 kilobytes.
They enumerate every page on the site sometimes specifically for different crawlers.
Or they have a ton of auto generated pages they don’t want crawled and call them out individually because they don’t realize robots.txt supports globing.
I was actually trying to find an example when I made my initial comment, but was unable to. It's been a long time since I did web scraping. Since then there are a lot more frameworks that help you build a website (and a correspondingly sane robots.txt), so there may not be as many as before.
I doubt there's any vulns in the code seeing as its job for th last 20 years has been to parse input from the wild west that is the internet, and survive.
can this been seen as a initiative to make google robots.txt parser the internet standard? every webmaster will want to be compliant with google corner cases...
There is a difference between robots.txt blocking a page and noindexing a page.
Blocking in robots.txt will stop Googlebot downloading that page and looking at the contents, but the page may still make it into the index on the basis of links to that page making it seem relevant (it will appear in the search results without a description snippet and will include a note about why).
To have a page not appear in the index you need to use a 'noindex' directive [1] either in the file itself or in the HTTP headers. However, if the file is blocked in robots.txt then note Google cannot read that noindex directive.
Also, in the StackOverflow response you linked to that the user agent is listed just as 'Google', but it should be 'Googlebot' as per the 'User agent token (product token)' table column listed in [2].
That's actually nice and straight forward and relatively simple. I had expected something over engineered with at least parts of the code dedicated on demonstrating how much smarter the code writer is than you. But it's not. Just a simple parser.
I expected the same (complex project structure, too many files, difficult to read, etc), but I love everything about this library. Easy to read, concise code, in two simple files. Very well tested, both by automated tests and the real world. Sticks to the Unix philosophy: does one thing and does it well.
Can you imagine how many billions of time this code has been executed? I love software like this.
You're very much taking that out of context. Read the entire section. There's a grand total of 7 sentences in there. It literally says:
If you're going to have to explain it at the next code review, you should comment it now. Complicated operations get a few lines of comments before the operations commence. Non-obvious ones get comments at the end of the line.
The section you're quoting says:
On the other hand, never describe the code. Assume the person reading the code knows Python (though not what you're trying to do) better than you do.
I think you might be misreading the meaning of this? If the person reviewing knows the language better than you, they they are hopefully less likely to tolerate "clever code", not more.
By "clever code" we're talking about weird unidiomatic tricks and hacks that maybe writes things in a slightly shorter or in a fractionally more (unnecessarily) optimised way, and makes you feel clever, but makes it harder and more time consuming for anyone else to understand what your code is doing, or verify that it's actually doing what it's supposed to.
I'm not sure how you could over-engineer code on a whiteboard? There isn't enough space. Instead you'd expect people to be extremely good at writing short and concise code which is still correct and simple enough for an interviewer to understand.
Seems strange to get excited about a robots.txt parser, but I feel oddly elated that Google decided to open source this. Would it be too much to hope that additional modules related to Search get released in the future? Google seems all too happy to play the "open" card except where it directly impacts their core business, so this is a good step in the right direction.
I don't understand the entire architecture behind search engines, but this seems like a pretty decent chunk of it.
What are the chances that Google is releasing this as a preemptive response to the likely impending antitrust action against them? It would allow the to respond to those allegations with something like, "all the technology we used to build a good search engine is out there. We can't help it if we're the most popular." (And they could say the same about most of their services: gmail, drive, etc.)
This is the sort of code you write a binding to and call it a day, since the entire point is to absolutely precisely match the behavior of this code, which is basically a specification-by-code. You can never be sure a re-implementation would be absolutely precisely the same in behavior, so it's not worth doing.
The c++ implementation is <1000 lines. Doesn't seem like a correct port would be particularly difficult, especially with a reasonably large test corpus.
I mean, I get it; it feels that way to me intuitively too. But I'd still recommend against trying it, because I've learned the hard way the intuition here is, if not wrong, at the very least very badly underestimating the cost, especially in the "unknown unknown" department.
I'm not saying that isn't true for some things. I don't think its true here given that this is a nice narrowly scoped library that does a single thing and has well defined semantics.
Adding a cgo dependency is generally something that isn't done lightly by teams. Having a port to go instead of a wrapper around go would be much more likely to see widespread adoption.
Do you even need to match Google's robots.txt parsing behavior? With less than 1000 lines you can be pretty sure they are not doing it right and are breaking plenty of people's assumptions about it. Either way you have to test it on real world data.
The point of this code release seems to be to release Google's precise logic. That you may incorporate it into something else is, IMHO, less interesting; we've got plenty of other solutions that "do robots.txt" well enough. If it was just about that, Google's release of this would not be worth anything. The point is so that non-Google parties can see exactly what Google is seeing in your robots.txt.
That's why I'm saying there's no point trying to re-implement this. If you were going to re-implement this, there's probably already a library that will work well enough for you. The value here is solely in being exactly what Google uses; anything that is a "re-implementation" of this code but isn't exactly what Google uses is missing the point.
If they formalize it into a spec, others may then implement the spec, but they can and should do that by implementing the spec, not porting this code.
As I understand the point about Go complaint is to parse actual real world robots.txt. For which you don't need to behave exactly as this library does.
> Do you even need to match Google's robots.txt parsing behavior? With less than 1000 lines you can be pretty sure they are not doing it right and are breaking plenty of people's assumptions about it.
This seems like a weird assertion. The specification isn't particularly complex (ignoring the implicit complexities of unicode). There are ~5 keywords and like 3 control characters. Why would you expect to need all that much?
I'm not talking about the formal specification, but the implicit specification of what people have been using for decades. That only has 5 keywords and a couple control characters. The formal spec is based on that informal spec, which again, isn't that complicated.
To be more direct: what are all of these assumptions you assume google's parser is mishandling?
It definitely feels excessively risky for a third party to port it, but Google can either canary it or run both parsers in production and compare results to accurately assess confidence in the port's correctness.
Is Golang significantly slower than c++ ? I thought Google had invented Golang to solve precisely these kinds of code for their internal use.
I had thought most of the systems code inside Google would be golang by now. is that not the case ?
the code doesnt look too big - I dont think porting is the big issue.
Quite the contrary, three Google employees that are very vocal against C++, and well known personalities, got fed up using it and created Go, eventually they got support from upper management.
Why do it in the first place? Just because you can? The code works and it's written in a popular language which plenty of people know. What's the upside?
I would much prefer a library such as this be done in C/C++ so it could be packaged up as a library that could be called from other languages. Pretty much every major language has some form of FFI to call out to C/C++ code. This way, you can get consistent behavior if you need to parse robots.txt in python vs ruby vs java vs etc.
I know it's a meme to say "C is not C++" but in this context C is really not C++. Calling into C through FFI is significantly easier than calling into C++. Very few languages have decent FFI with C++, while many have great support for C.
> I had thought most of the systems code inside Google would be golang by now.
Google has gazillions of lines of system code already built. Why rewrite everything in go? There is so much other stuff to do. All rewriting achieves is add additional risk because the new code isn't battle tested.
Depends the context, but in general, yes. C++ is very close to C on this aspect, trading memory safety for performances.
Concerning google, as far as I know the codebase is mostly C++, Java, and python. Go will surely eat a bit of the Java and Python projects but it’s unlikely to see C++ being replaced any time soon.
> Is Golang significantly slower than c++ ?
> Depends the context, but in general, yes.
I don't believe this is the case. Most optimized, natively compiled languages all perform similarly. Go, C, CPP, Rust, Nim, etc. I'm sure there are edge-cases where this isn't the case, but they all perform roughly the same.
The performance rift only starts when you introduce some form of a VM, and/or use an interpreted language. Even then, under certain workloads their optimizations can put them close to their native counter parts, but otherwise are generally slower.
The real reason Google didn't re-write this in Go is likely because the library is already finished, it works, a re-write would require more extensive testing, etc. Why spend precious man-hours on a needless re-write?
This argument to me is usually comes from people who have not done projects of significant scale or that required high performance, which is fine not everyone works on that level of a project. But the small difference of 10ms per operation when having to do a million operations is nearly 2.7 hours of extra time. Even 1ms is an extra 0.25/hr in time. These things start adding up when you are talking about doing millions of operations. And there is nothing wrong with Go or Rust or Python, just they aren't always the right tool in the toolbox when you need raw performance. Neither is C/C++ the right tool if you don't need that level of control/performance.
When doing distributed systems or embedded work you generally learn these rules quickly as one "ok" performing system can wreck a really well planned system, or start costing a ton of money to spin up 10x the number of instances just because of one software component isn't performant.
Rust was created by/for systems programmers who are in exactly that situation - where performance and control are not optional - and thus have been stuck writing C++ for decades. Although C++ has evolved over the years, there are pain points, particularly regarding modularity, that persist and may require a clean break.
It's still somewhat early but I do already see software being written in Rust with best in class performance (take ripgrep for a prominent example), so lumping it in with Go and Python is really a category error in my opinion.
Personally, I'm still writing C++ for the platform support, etc. but not pretending to like it.
Yea, I see your point, lumping Rust in with Python or Go isn't really fair nor accurate.
Totally agree C++ definitely has pain points still, but I do love the fact C++ is getting pretty regular updates so it is getting better and less painful generally. Rust is something I want to use in production but haven't seen the right opportunity to do it where the risk to reward ratio was right, yet.
I concede that I mostly work on web applications that don't require that level of performance.
You're certainly right, there IS a performance difference, and in high-computing workloads, such as the one this parser is used for.
From a "regular" web developer perspective (ie. where you only a few servers/VPS's MAX) a lot of newcomers often worry about performance, and usually for most web development the answer with performance is "Yes language [here] is faster then Python/Javascript/Ruby/etc. But those languages/frameworks allow us to develop our application far faster, and ~10ms isn't an issue." Only after performance bottlenecks are discovered would we consider breaking out pieces into a lower level language.
You're completely right though, in HPC it is totally worth worrying about every millisecond, I took the wrong perspective with the implications of the performance differences.
To be fair, most people do not work on applications that need that level of performance.
Most of the time, and to your point, that level of performance isn't necessary so using a language that is less likely to let you take your foot off is generally the best & most correct choice. I only resort back to C/C++ when I need the pure raw performance like this parser would, or when doing embedded work. Otherwise I reach for other tools in the tool-bag that are less likely to let me maim myself unintentionally.
“Most code” is debatable, but one of golang’s goals is ”Go compiles quickly to machine code” (https://golang.org/doc/). Because of that, I can see it being slower than C++ on code that benefits a lot from optimization. That makes it not the right choice for code that runs a lot, as this code likely does (I expect this code runs for many CPU-years each day)
I didn't realize how far off Go's performance was from C/CPP, I have a feeling a lot of it is because of the 25+ years of optimization the c/CPP compilers have gotten.
Go has a “stop the world” garbage collector, and some language features also have performance penalty (defer is well known for being slow). Just to say that’s not only a question of time, even if you wait and invest a huge amount of time and money you will see differences in performances because of language design choices.
“Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.”
The amount of arrogance in this sentence is insane.
In terms of “what should a robots.txt file look like to be parsed correctly,” yes, because they’re the ones who are going to be doing most of that parsing. Yes, ideally it would be an entirely independent standardization process, but it’s not arrogant of them.
Never before has a company stood on such a mountain of open source code, achieved so much money with it and contributed solittle
No really. Microsoft? BSD TCP/IP stack for win95 maybe saved them but there was trumpet winsock and probably would have survived to writing their own on the next release.
Google doesn't get off the ground and has literally no products and no services without the GPL code that they fork, provide remote access to a process running their fork and contribute nothing back. Good end run around the spirit of the GPL there and that has made them a fortune (they have many fortunes, that's just one of them).
New projects from google? They're only open source if google really need them to be, like Go which would get nowhere if it wasn't and be very expensive for google to have to train all their engineers rather than pushing that cost back on their employees.
At least they don't go in for software patents, right? Oh, wait...
At least they have a motto of "Don't be evil" Which we pretty much all have personally but it's great a corporation backs it. Corporate restructurings happen, sure, oh wait, the motto is now gone. "Do the right thing" Well this is fine and google do it, for all values of right that equal "profitable to google and career enhancing for senior execs".
But this is great a robots.txt parser that's open source. Someone other than google could do something useful for the web with that like writing a validator, because google won't. Seemingly because it's not their definition of "do the right thing."
"Better than facebook, better than facebook, any criticism of google is by people who don't like google so invalid." Only with more words. Or none just one button. Go.
So you aren't wrong that google is built on the shoulders of giants, but I will point out that every single company today running their SaaS offering on top of linux/BSD is doing the exact same thing.
The only reason Linux is as mainstream as it is today, is exactly because of this freedom to leverage the code. You even point out that the cause for Golang's success is for precisely the same reason. Overall opensource isn't about making money, it has never been about making money. Its been about making an impact, and bettering the world around us all by giving a piece of technology to be freely used by everyone. There are a variety of opensource licenses that can/will protect your code from any/all closed source uses, for example AGPL explicitly states if your application so much as interacts with the code over a TCP connection or furthermore a single UDP packet it must be opensource as well. However you will rarely see libraries/applications using this license. Why you might ask? The answer is simple, it reduces the impact that code can have.
Really at the end of the day, it comes down to a choice of the developer(s), do you want to make money? i.e. go the Microsoft/Apple route? or do you want to make an impact? i.e. go the Linux/BSD route?
Let me ask one final question, which of the above operating systems do you think are more widely used, or have changed the world in a more dramatic manner?
I could care less about other companies that have existed for 5 minutes in the SaaS space in my comment that nobody has ever derived more value and given that, contributed less back.
Google is built on an end run around the spirit and intent of the GPL. "Don't distribute software, distribute thin client access to it! No GPL! Hurrah! Money!"
Decide for yourself what you think of that but it happened. Without it, no google.
But hey, list anyone you think derived more value and contributed less back. It's a reasonable thing to do. Doesn't affect criticism of google.
Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.
Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.
My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.
With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.
If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.
(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)
Here's the Perl for the Blekko robots.txt parser. https://github.com/randomstring/ParseRobotsTXT