Not that I'm a big fan of Jython, but including startup time in a benchmark is only useful for very short running command line tools. It says nothing about the speed of the JIT or the code that's being tested.
I thought about that, but decided against subtracting out startup time. This is a real-world benchmark of how long it takes to run some actual code that I care about. Startup time counts, but the cumulative execution time will be dominated by time taken to run the slower programs, not startup time in the trivial ones, so I don't think I'm counting it too much.
But I think next time I do this I'll increase the max runtime from 1 minute to 3. That will keep more of the slower programs in the benchmark, and make startup time count that much less. Without actually removing it, because that seems too artificial and contrived to me.
Sure, and I did mention command line utilities. The issue I see with the benchmark is that it compares apples and oranges. For some programs it tests JIT performance and for others it tests startup time. If I want to test startup time, I just use a hello world program. To test JIT/interpreter performance, I try to exclude startup time.
Hello world programs only ever rely on the bare runtime, though; real utilities have to load libraries after the runtime loads, which incurs further delays (which might depend on JITing/interpreting speed if they're not pre-compiled.)
The same can be said about benchmarking numeric vs io code. Is it apples and oranges and bananas then? Most programs have a mix of everything. Whether Eurler's problems are a good mix that represent your workload is up to you to decide.
In addition to what fauigerzigerk said, if you run a lot of command line utilities where startup time is relevant, there are established solutions to that (nailgun for example) so it still isn't a fair comparison. If you want a comparison of state of the art solutions, which this post obviously is, put nailgun into the mix.
those utilities are often used shell scripted to run many times in a row. at least in interactive use, the multiplied startup times can grow annoyingly long.
You could use one of those Java background daemons to do that, but anyway, what I was trying to say is just that this benchmark doesn't test JIT or interpreter performance in the case of Jython.
Also Hotspot VM needs warmup to achieve maximum performance since it takes some time to detect and JIT-compile performance-critical parts of code with all optimizations.