Hacker News new | past | comments | ask | show | jobs | submit login
Most used words in programming languages (anvaka.github.io)
194 points by kissgyorgy on Jan 20, 2017 | hide | past | favorite | 182 comments



In most languages, the top words are self/this and function/def. On the other hand, in Go, the most common word is "err", and "if err != nil" are three of the top four words. I really wonder how big part of the code in Go is just propagating errors.


One day, maybe half a decade from now, Rob Pike will wake up from a nightmare and suddenly realise they got this one wrong. Maybe it'll occur at the same time as the generics one.


There are some was to make error handling much less annoying in go. Since error is just an interface it's pretty easy to make monadic constructs that can carry the error information and let you write pipelined code. IMHO it makes the code much cleaner and easier to read, but a lot of golang enthusiasts haven't really adopted the idea because it's quite different from the established idioms around error handling. Hard-core gophers love them some verbose and stupidly explicit code; but I'm hoping that as the language gets adoption in the wider community the voice of reason will win here.


Interfaces require casting and hurt performance, though.

You could come up with `Either` types, but due to the lack of generic structs you'd have to have a whole lot of those.

Interestingly, Rust has basically the same error handling idiom, but verbosity is considerably reduced by the '?' operator (previously the try! macro).

When I was coding in Go, i would have loved that feature. The mind-numbing error checking in go is hugely annoying.


> When I was coding in Go, i would have loved that feature. The mind-numbing error checking in go is hugely annoying.

Ugh. Yes. I really want to love go. No, I really love go. But this is so annoying.


Could you link to some instructions?


I've been working on putting together an in-depth series of blog posts about it. I'm presenting on the topic of better error handing in go at the Agile Tech Conference this year, and I'd like to start getting everything in order soon- I'll probably post here on HN.

In the short term I have a really rudimentary library I've put up here: https://github.com/asteris-llc/gofpher and a presentation on it here https://github.com/rebeccaskinner/presentations/tree/master/... (should build with pdflatex, I need to get an actual pdf built soon)

Rob Pike has a blog post that looks at a similar, although ideologically different, approach to doing it: https://blog.golang.org/errors-are-values

I think the latter is more go-ish, although also less generic than the monadic approach- it's also more in keeping with the ideology of go, but in practice I still never see it being used much in production code.


Same with many node apps. Every other statement is:

    if (err !== null) {
      return callback(err);
    }


With es2015+ this is no longer the case, it's back to good old try/catch statements if using async/await, and .catch(...) if using promises.


I was thinking about forking the language just so I could add a Maybe monad.

I now typically deal with this via a Must() function that takes to values and return the non-err value or panics if there was an err.

That means you can call it like con := Must(db.Connect(...)).(db.Connection)


This is not super-conventional, but some of the standard library uses panics in a similar way.

I think as long as panics don't cross library boundaries you're fine. And panics can cross library functions if the function is called MustConnect(). (Which means it'd be neat to have a pre-processor that generated MustX from X if it wasn't already there...)


To be honest, part of the reason for this is that in Go, 'self' is any name you see fit in the particular context, otherwise it may still be self.


This made me smile as well. While I'm now used to it, I have troubles saying there's nothing wrong in having `err` a more common word than `if` :)


Letting errors surface is a good thing.


having 3 out of 4 lines in your codebase dedicated to letting errors surface is a bad thing.


3 out of 4 most common words are related to error handling. How does that imply 75% lines are error handling?


    err := someOperation(args...)
    if err != nil {
         return err
    }
that's 4 lines, 3 of them are error handling


"err". Because "error" is too hard to type, right?


No, because error variables have extremely limited scope and the length of a variable name should be proportional to its scope.


Heh, I always use longer name variables for broader scope somewhat naturally, but only after reading your comment realized that.


    if theErrorReturnedByThePreviousFunction != nil {
         panic(theErrorReturnedByThePreviousFunction)
    }


A real brogrammer can write COBOL in any language.


From which it follows that you should use the most terse names possible, in order to make it harder to write too-long functions.

(I am only half trolling).


I figure it's probably proportional to its importance and difficulty of understanding, as a longer more descriptive name aids in understanding, but things with broader scope tend to be more important.


Well, `error` is the type, and `err` is the most common variable name. It's not so unusual, just convention!


Also, it cuts the characters you have to type and read in halve, while not being obtuse.

That's an odd thing to complain about.


I wonder the same thing about "elif" in Python. It's two extra characters to make it the more pleasant, readable "elseif". I don't understand why dropping those letters was felt necessary.


A lot of older parts in Python inherit the naming style common in C, which always seems to me they have OCD to shorten everything, e.g. ls for list, mk for make. Newer Python parts are more “healthy” in this regard. But this also results in some wierd inconsistencies like mkdir vs makedirs.


Lots of uses of "the" in C++ code, massively skewed by giant projects that have the massive section of boilerplate copyright info in a comment at the top of every single file. I didn't realise until I looked at this, but although that's really common to see in C, C++, and Java you barely ever see it in the languages that I spend time in (Rust, Haskell, C#, Python, various flavours of Lisp, JavaScript).


Wait another fifteen to twenty years or so until Python and Javascript are "enterprise ready" and this will change.


At least nobody is keeping a revision history at the top of the file any more.


You'd be surprised...


In C# the first word is "summary" which is massively influenced by the way Visual Studio formats comments. I wonder though if this is an indication that a lot of people write comments.


Would be nice to see Haskell included; would ".", "$", "<$>", "<*>", ">>=", etc. count as words? ;)

Made even better by the use of the language's logo as the word clouds shape; the Haskell logo is ">λ=" https://www.haskell.org/static/img/haskell-logo.svg


Added: https://anvaka.github.io/common-words/#?lang=hs

Unfortunately the symbols will not show up, because I'm ignoring them: https://github.com/anvaka/common-words/blob/master/data-extr...


> Added: https://anvaka.github.io/common-words/#?lang=hs

> Unfortunately the symbols will not show up, because I'm ignoring them

Makes sense. I notice that some funky unicode stuff has still managed to come out quite high, e.g. ⊇ ("superset of or equal to") :)


The layout algorithm for the word cloud is awesome! How is it made?


This is explained in the README of the project's Github repository: https://github.com/anvaka/common-words#how-are-word-clouds-r...


There's a description of the method in the Readme

https://github.com/anvaka/common-words#how-are-word-clouds-r...


Unfortunately it's not using size as a metric like mouse word clouds. This confused me at first. Look at the size of 'err' in the go layout.


It uses it as long as there is enough space. Once it fails to find a rectangle to fit a new word, it tries to reduce the size of the word.

In general, word clouds are bad for comparing sizes. For that reason I used plain list in the sidebar on the left (or at the bottom if you are on mobile)


It's humorous that in Go err/nil happens so often that it's "unbelievable" in a literal sense


It looks to me as if it is. There just are a lot of uses of "err" in typical Go code.


The slow adoption of modern C++ is very apparent. Some things not event listed: forward, unique_ptr, shared_ptr, tuple, constexptr. nullptr is much lower than NULL and move is quite low. These features are now 6+ years old! ;)


Smart pointers realy need a native way to be specified, similar to how we use & to declare references and * for pointers. Filling your code with shared_ptr<Something> and the likes is just not going to win the majority over, even if it's useful.


A common way is having a "using FooPtr = shared_ptr<Foo>" somewhere and then only use FooPtr. This also reduces the occurrences in that word cloud.


> This also reduces the occurrences in that word cloud

Ah, yes, very good point.


Nowadays, you can do:

    auto sh_ptr = make_shared(....); // Since C++11
or

    auto uniq_ptr = make_unique(....); // Since C++14
It saves you from having to explicitly name the type parameter and you just need to pass in the parameters of the constructor. It's not as terse as & or *, but it's not that bad.


I agree it's made C++ a bit verbose, though something like the using keyword helps (which is 30 spots below typedef!)


This might just be my Python upbringing, but... am I the only one to be troubled by Go's single-letter words? I've always found Go code very hard to read because it isn't self-descriptive at all.


Depends, if it's self evident in the context what it means, like in a method which is declared as (p *player) Attack, p makes perfect sense (to me) rather than typing 'player' everywhere in the function, just like typing i instead of index.

In short, I don't think there is a particular rule that applies, that said my impression is that a lot of variable names in Go code are typically three letters or at least two, like err, buf, src, dst, ok etc.


Go is a language in the almost-forgotten algebraic tradition (loosely descending Fortran → Algol → BCPL → C → Go), in which people prefer E=mc² to MULTIPLY REST-MASS BY SPEED-OF-LIGHT-IN-A-VACUUM BY SPEED-OF-LIGHT-IN-A-VACUUM GIVING ENERGY.


Did you mean single-syllable words?


When you have a strong static type system like Go has, you don't need descriptive names that much, because even single letter variable names are evident, because the declarations are there, close to the variable names. It needs a bit of getting used to if you only programmed dynamic or scripting languages.


> you don't need descriptive names that much, because even single letter variable names are evident

That's an annoying lie, and makes codebases in languages with genuinely strong static type systems (e.g. Haskell) much harder to read than necessary.

Hell C#'s type system is stronger than Go's, yet there is no such prevalence of meaningless variable names (first one comes in at #23, Go has 7 single-letter variable names in the top 23)


Still, I think GP's comment is interesting. I do agree that type name and variable/argument name can be redundant in statically typed languages, such as c++ or java. (I've never used Go)

For example,

    int getAge(Person p);  
is clear enough, whereas

    int getAge(Person person) ;
is redundant.

Haskell does type inference, so even if types are static they are not explicit. That's why you still need explicit variable names.


I think that depends on the length getAge(). If it's 3 lines, 'p' probably is fine. If it's 200 -- which, perhaps it shouldn't be but that's another issue -- then person is probably a better choice because you may lose the original context as you scan the method.

Also in Java if you use intellij then you'll probably get 'person' as an autocomplete, which actually makes it roughly as easy to type out as 'p', (and a better choice if your entire team has standardized on intellij.)


> I do agree that type name and variable/argument name can be redundant in statically typed languages, such as c++ or java.

Except neither exposes that pattern (they essentially have only one such variable in their top list, and it's "i"). And I already mentioned C# which is very similar to Java.

> is redundant.

I'm not saying single-letter variables (or no variable at all which is also possible in Haskell) is universally bad, I'm saying it is disturbing to find how absolutely ubiquitous it is in Go. Variable names are sometimes redundant, that is not a universal constant. That redundancy is a factor of the expressiveness of the type system, that's hardly a claim of fame of Go.

> Haskell does type inference, so even if types are static they are not explicit.

Go has local type inference, and leveraging Haskell's global type inference is usually recommended against.

> That's why you still need explicit variable names.

No, it is not.


My background is scientific and systems programming in C and C++. Currently I write a lot of Python for my job and a lot of Rust for my spare time hobby programming, which consists of things like writing a BLAS, reimplementing other C and C++ projects, etc. So I haven't only programmed in dynamic or scripting languages.

I strongly disagree with the assertion regarding single letter variables. I would gently correct a junior programmer who tried to do that and I would chew the hell out of a senior programmer if he tried it. In any language.


> It needs a bit of getting used to if you only programmed dynamic or scripting languages.

I'm an embedded C developer. Our shop doesn't let single-letter variables pass in code review.


Sounds like a good shop! :)

Even for loop variables I tend to use `idx` these days in languages without smarter constructs. Just slightly better and more readable for no effort.


I disagree, we're so used to reading i that the (actually less phonetic) idx is really slightly harder to read.

I take my bikesheds green, thank you.


I am quite surprised that "self" is so much more used than the next word in the list "if". I know many languages where "self" is not a keyword at all, but I cannot think of a single language where "if" is not a keyword.

EDIT: Ops, I missed that there is a language filter. Ignore my comment :)


This is broken down by language, the linked list is for python, so it just means that self is more frequently used than if in python.


I don't think if is a keyword in Smalltalk. IO might not have it either.


> I don't think if is a keyword in Smalltalk.

It's so not a keyword it doesn't even exist. At least in the Smalltalks I've used, they provided ifTrue: and ifFalse: (and compositions thereof).


assignments are probably way more common than if-driven program flow control


Besides "return", C/C++ doesn't have a particular word that stands out. Probably because you just write write things. eg you don't put "function" in front of a function.


Yes, I also thought that a list for each language shows exactly what's wrong with that language. For some it's self/this, Java is import and return, etc. for C++ there's no clear winner, because it is minimalistic by nature (mostly due to C heritage of course).


> for C++ there's no clear winner, because it is minimalistic by nature

I presume I'm missing the sarcasm here...


Cool stats about Java. "import" is the most frequently used word. It looks like everything is already exist in Java, so just import all the things, some clue code and you are done.


Then why isn't this the case for python. It's as equally as kitchen sink, right?


Java doesn't have syntax for importing more than one member of a package in one statement.

Either you import a single member, or all with *.

Combine that with IDEs that auto-create the import statement, and there you go.


I just remembered: also, each class in Java has to be in it's separate file. That might very well be the biggest factor.


In Python you usually import a module and use that as a symbol, or `from module import Symbol1, Symbol2, Symbol3`.

Java doesn't have the former, and the latter requires an import per symbol (it also has the ability to import every symbol in a namespace, and so does Python, but that's usually discouraged)


I think if you grouped the import statements by package the numbers would be similar but typically people import individual classes.


The world would be a much better place if self(Python) and this(Javascript ES6 classes) were implicit.


I disagree heartedly. Implicit code leads to easy to write hard to read. I'd much rather have where you are getting a value from be very explicit to not cause hard to see bugs as well as making the code easier to understand from a fragment.


I find it unnecessarely bloats code and in most cases makes it harder to read due to redundant text, e.g.

    # 1
    def length(self):
    	return math.sqrt(self.x*self.x + self.y*self.y + self.z*self.z)
    	
    # 2	
    def length(self):
    	return math.sqrt(self.x**2 + self.y**2 + self.z**2)
    	
    # 3
    def length():
    	return math.sqrt(x*x + y*y + z*z)
    
I prefer version 3 by far. Unfortunately, Javascript decided to take the same path with ES6 classes which forces you to use this in the body. Fortunately, it does not force you to use this in the argument list.


I think that goes against the python dogma of "Explicit is better than implicit." In #1 and #2 there is no question about where 'x' comes from, while in #3 it could be a class variable or a global variable or from just about anywhere.


Agree. Even better, though:

    # 4
    def length():
    	return math.sqrt(.x**2 + .y**2 + .z**2)
Ambiguity and "pseudo-keyword" are both gone!

[EDITED code for better readability]


If you mind with self being too bloated, you can already change it to what you like. The `self` is merely idiomatic.

  def length(s):
    	return math.sqrt(s.x**2 + s.y**2 + s.z**2)
The beauty with self in Python is that self is not at all magic : it merely indicates that the object instance you're using will be passed as first argument of the class method.

Also you can get a custom font with typographic ligatures (e.g. for self, lambda, and so on) in order to make it more visually appealing. For instance (self > 圖) :

  def length(圖):
    	return math.sqrt(圖.x**2 + 圖.y**2 + 圖.z**2)


Huh, I experience python as quite the opposite of "Explicit is better than implicit":

- No static types

- A variable might belong to the scope of a method, object or class, depending on where it was first set and changed afterwards

- Implicit execution of code on import of a module (__init__.py, including parent packages)

- Any object is truthy or falsey, i.e. conditional statements don't require an explicit boolean


You can get the "Python dogma" with `import this`. "Explicit is better than implicit" is part of it, but so is "practicality beats purity" (and "There should be one-- and preferably only one --obvious way to do it, although that way may not be obvious at first unless you're Dutch").


I suppose "dogma" was the incorrect word in that case.


For JS, you can do

    function length() {
      const { x, y, z } = this
      return math.sqrt(x * x + y * y + z * z)
    }
I think it's pretty short, readable and explicit.

EDIT: formatting


And make every one-liner a two-liner?


In rust you can use pattern matching in arguments:

    struct Foo { x: f64, y: f64, z: f64 }
    fn length(Foo {x, y, z}: Foo) -> f64 {
        (x*x + y*y + z*z).sqrt()
    }
...Doesn't work with `self` because that would be implicit again, though. Anyway, I think that it is a minor inconvenience that isn't important for oneliners and extremely helpful when reading larger functions.


> Doesn't work with `self` because that would be implicit again, though.

AFAIK it doesn't work on self because the [&[mut ]]self parameter is more or less a keyword determining ownership interaction with the call subject.

The UFCS RFC would have made it sugar for `self: [&[mut ]]Self` (IIRC) but I believe that floundered.


It actually uses less characters, since you've eliminated all the repeated uses of 'this.'. As a general rule I'm in favor of a turning a one liner into a two liner if it improves readability, especially when it reduces actual typing.


I agree that it's mostly noise.

Python's self in the arglist is a C struct pointer sneaking in from the 80's, which is the time when Python has been designed. It could be excused but there should be a deprecation PEP by now. Make it a keyword and let us type it only when we need it.


That might seem nice for your limited example, but it breaks down when you realize that in example #3 there would be no easy way to differentiate between scopes.

  y = 100
  x = 15
  
  class MyClass(object):
    x = 50
    def __init__(self, x):
      self.x = x

    def length():
      return x * y
What does that mean? Does MyClass(10).length() raise an AttributeError because MyClass doesn't have an attribute named y? Does it automatically recognize there's a y in the outer scope and use that, or does it call __getattr__ first (i.e. method that gets called when a missing attribute is accessed)? Furthermore, how do I specify that I want to access the class attribute x, or that I want to access the nonlocal x?


In my opinion, there are far bigger problems with pythons scoping. Why is for/if/etc not it's own scope? You can define a temporary variable in there and mistakenly use it somewhere down the line. In a "for i in ..", the i should be cleared after the loop but it isn't. Why can't you create a new scope within a function? This allows you to bundle related stuff together and be sure the scope is cleared afterwards, so as to not mistakenly reuse variables. The thing that you're promoting as an advantage of explicit self is already broken by having a single scope for the whole function.

Blocks like this are an imensely useful way to keep scopes clean:

   int someMethod(){
   	...
   
   	{ // some small stuff that doesn't warrant a new method but you don't want it to bleed into the remaining part of the function, e.g.:
   		float x = readNextFloat();
   		float y = readNextFloat();
   		float z = readNextFloat();
   		float length = sqrt(x*x + y*y + z*z);
   		cout << length;
   	}
   	
   	// do some other things without worrying about potentially initialized variables
   	...
   }


Technically "self" could be anything you want it to be called, it's typically "self" by convention. I think it's fine, but that's just my opinion.


Yes, I usually call it "s" to reduce it's impact but I'm still not happy with it.


Don't call it "s", because then you break consistency basically with the whole Python ecosystem. That's not very nice for fellow programmers who already used to read and understand "self".


I look at #3 and ask "What are x, y, and z? How did they get into this scope?"


from all the code i've seen javascript developers have more troubles with explicit "this" and bindings than ruby developers have with implicit self.


Javascript 'this' binding behaves in quite unexpected ways if you're coming from something like python on ruby, which is a major source of trouble for people.


Do people really use implicit selfs in ruby? Note that @foo is not an implicit self.


Literally all the time.

In fact, our coding style demands that we avoid using instance variables in preference of adding `attr_accessor` to explicitly encourage the use of implicit self.


self.attribute = attribute in an ActiveRecord class when the left side is a field of a table and the right side is a variable. A different name for the variable fixes it without using self.


> self.attribute = attribute in an ActiveRecord class when the left side is a field of a table and the right side is a variable

That seems to be an example of not using an implicit self.

> A different name for the variable fixes it

There's nothing to fix, it works just fine.


every time you call a method without explicit receiver you're using implicit self


I would have liked a shorthand, like prefixing `@`. E.g, `self.member` would be `@member`. It's a lot less to type.


For example in C++ where it's implicit most code styles require adding prefix "m_" or something similar to indicate that the variable is a class member.

So then I expect the same would happen in Python and JavaScript. I don't think it would be better that way.


Exactly. Implicit 'this' also complicates the lookup rules. explicit 'this' should have been a requirement in c++ (and a reference, not a pointer).

edit: Also, it is already required for dependent names in templates.


"Explicit is better than implicit."

https://www.python.org/dev/peps/pep-0020/


"Beautiful is better than ugly."

https://www.python.org/dev/peps/pep-0020/


Point of order: this comment has been downvoted into greyness. My understanding is that we should downvote when a comment is low-value, not when we disagree with it, but i suspect the latter has happened here. Could anyone who has or is tempted to downvote this instead make a comment to present their critique?


Don't complain about down votes, especially after only 15 minutes. There is inherent randomness in votes and it only takes one or two votes to grey out a comment. Most of the time these things work themselves out within an hour or two. In this case the comment was black within 12 minutes of your post.


This would improve a lot if they filtered out comments.


I was thinking the opposite, I would like to see a version with all of the language reserved symbols removed.


Optional filtering that can filter out either code or comments would be nice.


Really? I think it's interesting to see "TODO" feature quite prominently in Python, for example :)


Perhaps both?

I also think filtering out comments would improve it - especially because so many source files include a copyright statement at the top, and the same licenses (MIT, GPL, Apache, etc) are found repeated in many different files and it distorts the results somewhat.


The copyrights are filtered out, because indeed there was a lot of them.


How interesting is it to see "summary" is the top C# term?


In that particular case, I'd say it's quite interesting indeed!

Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation (I've never used a Microsoft programming language, but a quick search brought me to https://msdn.microsoft.com/en-us/library/z04awywx.aspx ).

In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.

That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.

Such terms clearly have an effect on a system's documentation, even if they don't have an effect on the CPU instructions being executed. But I'm a programmer, not a CPU; text files containing source code are my main I/O interface, and they most certainly do contain such markup, and hence I find it interesting to see statistics about. In comparison, I don't step through very much assembly day to day, so I don't really care very much about the compiler output (the part which the comments don't affect). I prefer to reason at the level of the language I'm using, where not only do comments appear, they're very useful!


> Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation

Yes, and the IDE will auto-generate a doc comment with a <summary> because that's pretty much the most basic doc comment you can get.

> In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.

My issue is not that it's a comment, it's that it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically.

> That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.

<summary> is not conventional, it's the primary tag used by the C# documentation system and shown by IntelliSense. <precondition> is not that.


> it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically

Just because an IDE will write boilerplate automatically, that doesn't mean the boilerplate wasn't written, checked into version control, presented to developers, etc. Even if such boilerplate were added by an IDE, and hidden from developers (e.g. using code folding), it's still there in the language.

In this case, the language is C#, not e.g. some "C#-like" language which gets preprocessed/transpiled by an IDE into C# by scattering boilerplate around.

Whilst tooling can help us live with a language's deficiencies, they don't remove those deficiencies ;)


Well, there is a sense in which the language you write (which may not be the language you read) is defined by how you interact with the development environment to produce code.

Which is I prefer a language where I just need to learn one language, and not a separate input language because the language-as-read is to unergonomic to write so a different language needs to be defined for productively writing code.


I think it speaks well towards the ferocity and forced completeness of Go's error handling that err is the most used word.


I would say it shows that this is a place where the compiler could help :D


As much as I hate all the typing I do for error checking in go, I just don't think a compiler can handle errors that well on their own yet.

The explicitness of go's error handling forces you to handle every error specifically. There isn't a chance (for the most part) that you'll get an error from deep in the program that you can't easily handle.

Forcing you to handle them everywhere and anywhere forces error handling to be a part of your architecture.


Erlang took this other way around by designing a system that enable you to not handle all these errors.

I am more from this school of designing a system around the reality insteas of trying to patch it everywhere, praying we have enough fabrics to catch it all.

But it would meam rethinking how we build stuff. That was not at all a goal of Go.


I do appreciate this, but word clouds really are a terrible visualization method for text data. With regard to the python example, I cannot grasp at all if the frequency of self and None are similar or drastically different. The table on the value is more informative and less likely to misread.


Looking at Scala, the difference between val and var is huge, with val being at 2nd, and var at 38.


Usage of `var` as idomatic Scala is an oft used trolling mechanism.

I wonder what percentage of `_` usage is value/type discarding in pattern matching and type signatures vs. function application.

Would be nice to somehow ditch `case`:

    adt match {
      Foo(x) if cond x => ...
      Bar(x) => ...
    }

    pairs.map{ (a,b) =>
      ...
    }


And else is used more than if...


One can make very interesting conclusions based on purely this. Examples:

- Python developers does not follow Clean Code (ala Uncle Bob) as much as Ruby , because if statement is more frequent than def and return.

- Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.

- People don't really care about good variable names ("a" is a terrible variable name in scripting languages like JS and Python, still top 11)

- PHP developers might practice "return early" in functions (more return than function keywords) OR their functions just do too much :)


'a', 'the' etc seem to be from comments, not 'actual' code

(click them and you can see examples of how each word is being used)


- Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.

I instinctively think you might be right (at least with your second statement), but what are you using as your metric here?


I am not sure if ruby devs like to do much functional style, the parent mentioned Uncle Bob and his influence on Ruby community with Clean coders. I think they are more into OO.


Well Python also doesn't have a Match/switch flow. So that would bump ifs up


This and the fact that the page doesn't filter out comments which may affect results as well.


> One can make very interesting conclusions based on purely this.

With a large grain of salt

1. generally, the thing seems to mix words from all context, for instance #6 in Ruby is "should", click on the word and it's mostly comments, same for Rust's #4 "the".

2. "if __name__ == '__main__'" (seriously, TFA counts >600k of those)

3. also unclear what codebases are parsed, lots of django-isms in the conditional examples (if context is None, if request.method == 'POST', if form.is_valid)

See also: Go, where most every function call requires an `if`

> - Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.

Don't confuse writing more functions/methods and writing in a more functional style, they're very different things.


Yep, the tool should skip strings and comments.


Or provide for classification/filtering.

Comparing word clouds of comments would actually be interesting. But then you hit the issue of "what is a comment" e.g. many language use comments for item documentation, but python uses (doc)strings.


I was really amused by the logos / names in the word clouds. Nice project!


Thank you :)


The Rust compiler seems somewhat overrepresented in this data set. I see "ccx", "fcx", and "CrateContext", which are only used in the Rust compiler itself.


It is probably a sign of a good language that the words used most should be similar in frequency to their use in pseudo code. When words like "end" (Ruby), "self" (Python), "import" (Java), "err"/"error" (Go and Node) are over-represented, it's likely a sign that the language is introducing accidental complexity. By this metric Swift looks astonishingly sane.


It would be interesting to see the frequency of words found in comments—TODO, FIXME, LATER, OMG—foreach language too.


Also, there is the glorious phrase "Should never happen", with 24 million results on Github: https://github.com/search?q=should+never+happen&ref=simplese...


Perfect time to throw an UnreachableCode exception or something, hah


Pretty cool! Some unexpected results, or at least not what I guessed. "summary" as the top for C#, "SELECT" all the way down at #43 for SQL, "err" as the top for Go (I'm sure that will spawn some pleasant discussion).


I think for SQL their sample includes just ".sql" files, which tend to contain schema definitions and data dumps, hence CREATE and INSERT. Most of it not handwritten also.


I was surprised by the SQL thing too; your explanation makes perfect sense!


This is pretty neat! I was surprised to see that for SQL, SELECT was so far down the list.

I also wonder what the criteria is for which languages to analyze? There are a few other languages I would like to see, but maybe on github they aren't well represented...


Thanks!

I'm using file extension to differentiate between extensions.

You can request other languages here: https://github.com/anvaka/common-words/issues/4

As long as language's extension is unique, I think I can make a visualization of it.


Good site.

What did you use to get and analyze code and how much code did you analyze?

I thought about something like this but about variables, methods e.t.c. most used words or even variables name generator based on markov-chain.



This is really nifty, unfortunately it includes comments and so with thousands of files all including copyright notices, 'the' is the 3rd most popular word in c++ files.


I tried to exclude copyright lines as much as I could. I used "license markers" for that, but I might have missed something.

Here is more information about it: https://github.com/anvaka/common-words#how


That's good to hear. I didn't look in to it in too much depth, I just thought it was strange that 'the' was so high for c++ so clicked on it to see example usage and got things like:

   ** use the contact form at http://qt.digia.co/contact-us.

   furnished to do so, subject to the following conditions:

   * This file is part of the LibreOffice project.

   // with this library; see the file COPYING3. If not see
    
So assumed licenses had not been excluded.

Having a brief look at the source, I think with the licence marking approach it's still leaving in quite a few lines from each licence (see above for examples).


And PHP has "div", which seems to be HTML tag name.


Contrary to popular opinion neither `s' nor `t' are words. At least not in English anyway. :/ Or do they mean that these characters appeared as variable names?


I totally expected "should" to be at the top in ruby. Code bases with 2:1 or even more test code to implementation code ratio = standard.


So happy I couldn't find "goto" :)


You wish. It's #325 after you switch to cpp.


Well, Lua's word list starts with index 0...


anyone else find it at all ironic that the <em>least</em> common word in JS appears to be "validate"?


Where do the language files get pulled from? GitHub API or web scraping? Or is it not file parsing and some other method?

Super rad project :)


Thanks! The data comes form GitHub snapshot, stored on BigQuery. Here is more details about it: https://github.com/anvaka/common-words#how


I don't know whether I could believe this study as "foo" and "bar" aren't on the list.


Well I will slay my developers before they put the metasyntactic variables on master. I allow i to pass though. Go to j and you have O(n^2) and I will slay them again.


summary is the most used word in C#! Only from comments! Amazing. EDIT: it looks like this is reading files that are common to all projects... which is why the sentence "// The following GUID is for the ID of the typelib if this project is exposed to COM" appears 459k times!


> EDIT: it looks like this is reading files that are common to all projects…

Yeah, autogenerated files/comments feature extremely prominently e.g. the top two items for "should" in Ruby are autogenerated Rails comments.


At least in C#, we could probably exclude auto-generated files by looking for the whitespace keywords "partial class|interface|struct".


Java: funny to see that "if" is used a lot to point to "visit oracle.com if..."


Interesting that for Ruby the most used word is "option", is not even a Ruby but a Rails word.


Interesting that in JS, `let` is ranked 536, far lower than I expected.


Is anyone investigating how well these numbers fit a Zipf distribution?


"that didn't work"


Word clouds may look cool but they are horrible for conveying information. this should have been just a simple ordered list.


There is an ordered list on the page. If you're on a phone you can tap "show list" on the bottom.


django > try (20 vs 21) try again


In my django projects, I have many lines like:

    from django.<path> import <stuff>
for every try block.


this study is worthless without a section on frequency of swearwords and virulence of same.


and the most used word in .cs files is "summary"


A lot of meaningful and legible words out there

/s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: