it'd be interesting to see these heatmaps in some sort of normalized way. for ex...

alextgordon · on Sept 8, 2011

Just did it for 28000 C files. Here's the results:

    a  0.772163
    b  1.2679
    c  1.78209
    d  1.1195
    e  0.881398
    f  1.47252
    g  0.924242
    h  0.358954
    i  1.06756
    j  0.835313
    k  1.41458
    l  0.981729
    m  1.08955
    n  0.9156
    o  0.73849
    p  1.74468
    q  4.2497
    r  1.21577
    s  1.05023
    t  1.03627
    u  1.2967
    v  1.77662
    w  0.396003
    x  13.7292
    y  0.47566
    z  3.78748

The numbers are (relative frequency in C) / (relative frequency in English). So "b" is slightly more common in C than English, but "w" is a lot more common in English than C.

The raw counts for symbol characters:

    _  22890057
    ,  10895692
    )  10749798
    (  10745839
    *  9211904
    ;  8187969
    -  6628768
    =  5878296
    >  4428291
    /  3468260
    .  3011078
    {  2212412
    }  2211783
    "  2120264
    &  1647188
    :  1032587
    +  962554
    #  909859
    [  889538
    ]  888722
    <  839910
    |  643903
    %  583092
    !  561462
    \  540456
    '  454201
    @  131199
    ?  112488
    ~  84629
    ^  19064
    $  17922
    `  7272
    [space] 74199965

dredmorbius · on Sept 8, 2011

What would be interesting here would be a difference analysis or regression giving the preference for any given key in a given language. E.g.: '|' is highly predictive of shell, '$' of perl, '()' for lisp. Might be fun to do in R.

pwnguin · on Sept 9, 2011

I really need to do reading and research on this, but I'm pretty sure that's what Hidden Markov Models are for. You could watch a webpage go from HTML to javascript and back!

zokier · on Sept 9, 2011

Could I ask for one more data: the total number of characters and maybe lines? That way symbol/alpha/line ratios could be compared to other languages.

alextgordon · on Sept 9, 2011

Yeah, when I get a chance I'll gather together some stats on all the languages I have data on (about 40).

Finder reports 630,942,867 bytes for the whole directory. Assuming most files will be plain ASCII, that should give a good approximation for the total number of characters.

zokier · on Sept 9, 2011

Based on those numbers I gathered some stats about keyboard layouts:

* 18% of all characters are symbols, 12% are spaces and 70% are alphabetic

* 20% of all non-space characters are symbols and 80% are alphabetic.

* US kb layout users need to use shift for 64% of symbols

* Finnish/Swedish kb layout users need to use shift for 73% of symbols and AltGr for 7% of symbols.

* Fi/Swe layout users thus need to use 25% more modifer keys for symbols.

Conclusion: Fi/Swe layout sucks.

edit: https://gist.github.com/1205728 python script used to get these numbers (percentages calculated with OOo Calc).

troxy · on Sept 9, 2011

Do you know why you have slightly different numbers of (){}[] characters? In C or C++ shouldn't those all be paired up to match?

zeteo · on Sept 9, 2011

It can probably be accounted for by comments (e.g. people sometimes comment out half a block). Although the comments should be left out, so as not to mix the C and the English.

alextgordon · on Sept 9, 2011

Also string and character literals. It's common to write something like

    if (c == '{')

and not need to test for the matching one.

zokier · on Sept 9, 2011

Heatmap for symbols only: http://i.imgur.com/yc6fe.png

wccrawford · on Sept 8, 2011

For this reason, I think it would have been more interesting to ignore the alphabet keys and just heatmap the rest.

delinka · on Sept 8, 2011

But then you lose the impact of the entire set of reserved words. At the point of ignoring the entire alphabet, you're looking at developer preferences for spacing and operators. Might be a nice sidebar to the existing heat maps.

pyre · on Sept 8, 2011

  > you're looking at developer preferences for spacing
  > and operators

Not really. For example, in languages that use $ (e.g. Perl, PHP) to denote a variable, that's not developer preference. I'm actually surprised that there aren't more operators being used in Ruby. Though I'm not a Ruby programmer, it does not look as devoid of punctuation characters as the heatmap suggests.

cpeterso · on Sept 8, 2011

What would a programming language look like if it was optimized so its reserved keywords used mostly home row letters (and, in Unix tradition, preferably alternated left/right hands) and operators without shifting? This would be tough, since the home row only has one vowel: a.

aangjie · on Sept 8, 2011

I was wondering the same thing and noticed that the vowels almost always feature in the top 10 and DVORAK has them all in one hand. yay..Infact, in a very casual observation, i think only 'r' seems to be the letter out of DVORAK layout.. I looked across languages though. Guess this makes me a DVORAK evangelist.:-) And to complete that image i will add this DVzine link.http://www.dvzine.org/

zokier · on Sept 9, 2011

http://i.imgur.com/yc6fe.png

Based on alextgordons numbers (see http://news.ycombinator.com/item?id=2974381 )

kevindication · on Sept 8, 2011

Except for Lisp, of course, where ( and ) are more common than e.

swannodette · on Sept 8, 2011

This is not true. See my other comment on this thread. Dominance of ( ) only reflects a particular coder's naming convention.