I actually think raw byte-count is a pretty good metric. Documentation size and ...

gruseom · on Sept 26, 2011

But this penalizes programmers who like to use long readable names. I'm not one of them (though I used to be), but they have a strong case here.

Take any program. Replace all the names with the smallest possible character sequences. Have you made the program simpler? Or smaller in any meaningful way? Surely not. I'd say what you've done is left its logical structure precisely intact (another way of saying that token count is a good metric) while reducing its readability.

anon_d · on Sept 26, 2011

This metric relies on the assumption that people are trying to produce readable code. IMHO long variable names are much more helpful in complex codes than simple ones.

gruseom · on Sept 26, 2011

Ok, but now I'm wondering if we have opposite views of code size. In my view, code size is bad bad bad. More code means more complexity. Any time you add code, you're subtracting value; it's just that (if it's good code) you're adding more value than you're subtracting. So a higher score in a code size metric is a bad thing to aspire to, and we should greatly favor approaches to writing software that -- all other things being equal -- lead to smaller programs. I don't think that programmers who use long names for readability should have their programs discounted as longer (and thus more complex). Just because their names are longer doesn't mean their programs are.

anon_d · on Sept 26, 2011

No no no. My logic is this: Take tight, readable code with short names a replace them with long names, and you'll have worse code. The converse isn't true because complex (bad) codes are more readable with long variable names.

Complexity -> Code Size Code Size -> Long Variable names (win for big codes) Complexity is bad

Therefore long variable names are a symptom of a problem, but not the problem themselves. Long variable names aren't bad, but they are still a good predictor of badness. Since size metrics are meant to predict badness, long identifiers should increase size metrics.

gruseom · on Sept 26, 2011

Oh, I see. You sound like an APLer. We have similar tastes, but many good programmers disagree, so I doubt that long variable names are a predictor of program badness. Not every long name is FactoryManagerFactoryManagerFactory.

Consider a language like K, in which variables usually have one-letter names. The real code-size win for K is not that. It's that the language is so powerful that complex things can be expressed in remarkably compact strings of operators and operands. (Short variable names, I'd argue, are an epiphenomenon. It's because the programs are so small that you don't need anything longer, and longer names would drown out the logical structure of the program and make it harder to read.) Token count is a good metric here. Both line count and byte count come out artificially low, but token count can't.

gruseom · on Oct 2, 2011

I came back to say I've thought about your argument a couple more times and I think you're on to something there. The idea that long variable names, even when they add to readability, are a secondary indicator of code badness (because the code is too complex not to be able to get away with short names) is a subtle and interesting way to frame the problem. I'm surprised it didn't get more pushback from the 95+% of programmers who take the opposing view. I suppose this little corner of the thread is a quiet enough backwater that nobody noticed.

But I still don't see how you get around the objection that, according to your preferred metric, if you replace all the names with arbitrarily small character sequences, you get significantly smaller code - yet clearly not better code.

icefox · on Sept 26, 2011

Also reduced its maintainability.

jacques_chester · on Sept 26, 2011

One metric I've seen is gzip-compressed size, which has the nice property that it identifies the size of the incompressible elements -- ie it discounts repetitive boilerplate.

Another interesting set of metrics is Halstead's "software science" metrics[1]. They fell out of favour because initially they were hard to count and didn't seem to correlate with anything else.

[1] http://en.wikipedia.org/wiki/Halstead_complexity_measures

anon_d · on Sept 26, 2011

I never understood the gzip one. Repetitive boilerplate is bad; why hide it?

jacques_chester · on Sept 26, 2011

You're trying to understand the "true" size of the software in spite of the idiosyncrasies of a given language.

As I noted somewhere above, "size" is an abstract, dimensionless quality. It can only be approached through proxies. The more the merrier, I reckon, especially if they turn out to correlate with different things.

jcromartie · on Sept 26, 2011

In the case of most projects, copy/paste code is not just because of the language. It's because of lousy programmers. I've seen large codebases which are made up of a full 40% duplicate code. There's no way to blame that on the language.

pyre · on Sept 26, 2011

You're missing the point of the parent: There is no 'one true' metric. If you use different metrics (actual lines, logical lines, gzip'd size, etc) you may well find different correlations.

gruseom · on Sept 26, 2011

But repetitive boilerplate is exactly the last thing that should get away scot-free in a measurement of code size.

jacques_chester · on Sept 26, 2011

It depends on what you want to know. Pure physical lines is one thing, "size" is a another.

gruseom · on Sept 26, 2011

I want a way to measure how complicated a program is that's independent of language and obviously extraneous things like line length.

jacques_chester · on Sept 26, 2011

You may find that the Halstead metrics I mentioned are closer to what you're after.

gruseom · on Sept 26, 2011

I've changed my mind. I'm interested in what I originally said: what's the best way to measure code size, and what are those studies (if they exist). Otherwise we get into debates about size vs. complexity, which is actually less interesting IMO. Size as a proxy for complexity is good enough for me.