Huh? What? Several points: \* Who said the 10MB is all used at once? \* I don't ...

geocar · on Feb 2, 2017

> Who said the 10MB is all used at once?

The parent was suggesting that this was all that was actually needed out of the 100mb or so downloadable. If you think the JVM is smaller, how small is it exactly?

> If you were referring strictly to CPU cache, then I'm even more confused, because the entire existence of that stuff is predicated on it being faster than memory, so... (and even still, if your total CPU cache isn't 10MB, it likely isn't that much smaller).

http://www.intel.co.uk/content/www/uk/en/processors/core/cor...

I don't have anything with 10MB cache.

> It's not like the whole package would sit in RAM the whole time anyway. By your same assertion, I could say that one of my CPU registers is only 64-bits wide, so I imagine all programs larger than 64-bits can't run faster than L3 cache...

If you get into L1, you get about 1000x faster.

http://tech.marksblogg.com/benchmarks.html

> I'm not sure why you'd say it is too big.

Maybe I have a different perspective? If a 600kb runtime is 1000x faster, I want to know what I get by being 10x bigger. I'm quite surprised that there are so many responders defending it given that these benchmarks were just on Hacker News a few days ago.

adrianN · on Feb 2, 2017

Unless you linearly scan the whole binary all the time, your CPU makes sure that only the stuff you're currently using is in the cache, so only the data your hot loop is touching.

You could easily see that your assumption is wrong by observing that a typical C application is not 1000 times faster than a typical Java application.

geocar · on Feb 2, 2017

> Unless you linearly scan the whole binary all the time, your CPU makes sure that only the stuff you're currently using is in the cache, so only the data your hot loop is touching.

Cache fills optimize for linear scans, and have nothing to do with eviction.

> You could easily see that your assumption is wrong by observing that a typical C application is not 1000 times faster than a typical Java application.

What assumption are you talking about?

Where do you find your typical applications? Spark is supposed to be one the fastest Java-implementations of a database system, and it's 1000x slower than the fastest C-implementation database systems, but this is clearly a problem limited by memory.

What about problems that are just CPU-bound? C is at least 3x faster than Java for those[1], so just by being "a little bit faster" (if 3x is a "little" faster) then as soon as we introduce latency (like memory, or network, or disk, and so on) this problem magnifies quickly.

[1]: https://benchmarksgame.alioth.debian.org/u64q/compare.php?la...

cbsmith · on Feb 2, 2017

> Spark is supposed to be one the fastest Java-implementations of a database system, and it's 1000x slower than the fastest C-implementation database systems, but this is clearly a problem limited by memory.

Wow.. so much wrong, I'm not sure how to unpack it all.

a) Spark is Scala, not Java, though both do use the JVM, so I'll give you that.

b) Spark is not a database system, though it is a framework for manipulating data

c) Spark is generally considered to be much faster than Hadoop, and does it's job well, but I'm not sure it qualifies as the fastest anything.

d) By any reasonable interpretation, the fastest Java database system is definitely not Spark. You will find that benchmarks of Java database systems generally don't even include Spark (as an example https://github.com/lmdbjava/benchmarks/blob/master/results/2...)

e) Fast is an ambiguous term... usually you are looking at things like latency, throughput, efficiency, etc. I'm not sure which you mean here.

f) If you know anything at all about runtimes, you'd know that if you've found a Java based system that is 1000x slower than a C based system, either your benchmark is extremely specialized, broken, or you are comparing apples & oranges.

Look, Java certainly has some overhead to it, and sometimes it significantly impacts performance. Before you get too excited about attributing it to runtime size, you might want to look at the size of glibc...

geocar · on Feb 3, 2017

> By any reasonable interpretation, the fastest Java database system is definitely not Spark

What database would you recommend for solving the taxi problem using the JVM?

> Spark is Scala, not Java, though both do use the JVM, so I'll give you that.

What does JVM stand for? I was under the impression that we were talking about it's size (10mb v. 100mb).

> You will find that benchmarks of Java database systems generally don't even include Spark

And? What are we talking about here?

> If you know anything at all about runtimes, you'd know that if you've found a Java based system that is 1000x slower than a C based system, either your benchmark is extremely specialized, broken, or you are comparing apples & oranges.

Why?

We're talking about business problems, not about microbenchmarks.

If this is a business problem, and I solve it in 1/1000th the time, for roughly the same cost, then what exactly is your complaint?

> Fast is an ambiguous term... usually you are looking at things like latency, throughput, efficiency, etc. I'm not sure which you mean here.

It's not ambiguous. I'm pointing to the timings for a specific, and realistic business problem.

> Look, Java certainly has some overhead to it, and sometimes it significantly impacts performance. Before you get too excited about attributing it to runtime size, you might want to look at the size of glibc...

Does Java include glibc?

What exactly is your point here?

cbsmith · on Feb 3, 2017

> What database would you recommend for solving the taxi problem using the JVM?

You have me at a disadvantage here... The only taxi problem that comes to mind is a probability problem that I'd not likely use a database for at all...

> If this is a business problem, and I solve it in 1/1000th the time, for roughly the same cost, then what exactly is your complaint?

If you came to the conclusion that your business problem runs 1000x faster because of differences in the runtime... you've made a mistake. It is far more likely your benchmark is flawed, or there are significant differences in the compared solutions beyond just the runtimes.

Seriously, I've spent a career dealing with situations exactly like that: "hey, this is 1000x slower than what we are doing before... can you fix that?". Once you are dealing with optimized runtimes, while there can be important differences between them, there just isn't that much room left for improvement.

> It's not ambiguous. I'm pointing to the timings for a specific, and realistic business problem.

The problem is perhaps not ambiguous to you, but you haven't described it in terribly specific terms. More importantly though, you haven't described what you mean by "faster"? That's the ambiguity.

> Does Java include glibc?

> What exactly is your point here?

C programs do. Lots of very efficient, high performance C programs.

geocar · on Feb 4, 2017

> The only taxi problem that comes to mind

It's the problem that I linked to previously.

http://tech.marksblogg.com/benchmarks.html

Finding good benchmarks is hard: Business problems are a good one because these are the ways experts will solve problems using these tools, and we can discuss the choice of tooling, whether this is the right way to solve the problem, and even what the best tools for this problem is -- in this case, GPU beats CPU, but what's amazing is just how close a CPU-powered solution gets by turning it into a memory-streaming problem (which the GPU needs to do anyway).

> If you came to the conclusion that your business problem runs 1000x faster because of differences in the runtime...

I haven't come to any conclusion.

There are a lot of differences between a JVM-powered business solution and a KDB-powered business solution, however one striking difference is the cache-effect.

However the question remains: What exactly do we get by having a big runtime? That we get to write loops?

cbsmith · on Feb 4, 2017

> what's amazing is just how close a CPU-powered solution gets by turning it into a memory-streaming problem (which the GPU needs to do anyway).

Yes, it turns out the algorithmic approach you use to solve the problem tends to dwarf other factors.

> There are a lot of differences between a JVM-powered business solution and a KDB-powered business solution, however one striking difference is the cache-effect.

Wait, you looked at those benchmarks and came to the conclusion that the language runtimes were the key to the differences?

> However the question remains: What exactly do we get by having a big runtime? That we get to write loops?

There is absolutely no intrinsic value in a big runtime.

Now, one can trivially make a <1KB read-eval-print runtime. So I'll answer your question with a question: why do people not use <1KB runtimes?

geocar · on Feb 9, 2017

> Wait, you looked at those benchmarks and came to the conclusion that the language runtimes were the key to the differences?

At the risk of repeating myself: I don't have any conclusions.

> There is absolutely no intrinsic value in a big runtime.

And yet there is cost. It is unclear if that cost is a factor.

> Now, one can trivially make a <1KB read-eval-print runtime. So I'll answer your question with a question: why do people not use <1KB runtimes?

Because they are not useful.

We are looking at a business problem, think about the ways people can solve that problem, and cross-comparing the tooling used by those different solutions.

Is there really nothing to be gained here?

The memory-central approach clearly wins out so heavily (and the fact we can map-reduce across cores or machines as our problem gets bigger) is a huge advantage in the KDB-powered solution. It's also the obvious implementation for a KDB-powered solution.

Is this Spark-based solution not the typical way Spark is implemented?

Could a 10mb solution do the same if it can't get into L1? Is it worth trying to figure out how to make Spark work correctly if the JVM has a size limit? Is that a size limit?

There are a lot of questions here that require more experiments to answer, but one thing stands out to me: Why bother?

If I've got a faster tool, that encourages the correct approach, why should I bother trying to figure these things out? Or put perhaps more clearly: What do I gain with that 10mb?

That CUDA solution is exciting... There is stuff to think about there.

cbsmith · on Feb 13, 2017

> At the risk of repeating myself: I don't have any conclusions.

For someone who doesn't have any conclusions, you're making a lot of assertions that don't jive with reality.

> And yet there is cost. It is unclear if that cost is a factor.

It's a factor... just not the factor you think it is.

> Because they are not useful.

I think you grokked it.

> The memory-central approach clearly wins out so heavily (and the fact we can map-reduce across cores or machines as our problem gets bigger) is a huge advantage in the KDB-powered solution. It's also the obvious implementation for a KDB-powered solution.

KDB is a great tool, but you are sadly mistaken if you think the trick to its success is the runtime. That its runtime is so small is impressive, and a reflection of its craftsmanship, but it isn't why it is efficient. For most data problems, the runtime is dwarfed by the data, so the efficiency that the runtime organizes and manipulates the data dominates other factors, like the size of the runtime. This should be obvious, as this is a central purpose of a database.

> There are a lot of questions here that require more experiments to answer, but one thing stands out to me: Why bother?

Yes, you almost certainly shouldn't bother.

Spark/Hadoop/etc. are intended for massively distributed compute jobs, where the runtime overhead on an individual machine is comparatively trivial to inefficiencies you might encounter from failing to orchestrate the work efficiently. They're designed to tolerate cheap heterogenous hardware that fails regularly, so they make a lot of trade-offs that hamper getting to anything resembling peak hardware efficiency. You're talking about a runtime fitting in L1, but these are distributed systems that orchestrate work over a network... Your compute might run in L1, but the orchestration sure as heck doesn't. Consequently, they're not terribly efficient for smaller jobs. There is a tendency for people to use them for tasks that are better addressed in other ways. It is unfortunate and frustrating.

Until you are dealing with such a problem, they're actually quite inefficient for the job... but that inefficiency is not a function of JVM.

Measuring the JVM's efficiency with Spark is like measuring C++'s efficiency with Firefox.

> If I've got a faster tool, that encourages the correct approach, why should I bother trying to figure these things out? Or put perhaps more clearly: What do I gain with that 10mb?

If you read the documentation, the gains should be clear. If you are asking the question, likely the gains are irrelevant to your problem. I would, however, caution you to worry less about the runtime size and more about the runtime efficiency. The two are often at best tenuously related.

adrianN · on Feb 2, 2017

If your assumption that a 10MB JVM kills the cache were true, then the alioth benchmarks you have posted wouldn't show a speed difference of ~3. I suggest you learn a bit more about how CPUs work and what benchmarks mean before posting bold claims.

geocar · on Feb 3, 2017

Why not? Those problems fit into cache.

cbsmith · on Feb 7, 2017

Because they are 333x slower than you'd expect.

cbsmith · on Feb 2, 2017

> I don't have anything with 10MB cache.

The link you provided was to three distinct models of i7 processors... all with 8MB of L3 cache. I would argue that 8MB isn't much smaller than 10MB, but I will understand if you disagree. However, even the slowest of those processors also has 1MB of L2 cache and 256KB of L1 cache, not to mention other "cache-like" memory in the form of renamed registers, completion queues, etc. At most, we're talking <800KB shy of 10MB in cache.

> If you get into L1, you get about 1000x faster.

I think you are making my point for me.

> Maybe I have a different perspective? If a 600kb runtime is 1000x faster, I want to know what I get by being 10x bigger.

You are assuming that at all times all of that 10MB must be touched by the processor at once. You can have a 10MB runtime where most of the cycles are being spent on a hotspot <4KB of data.... Having a hot spot that is orders of magnitude smaller than the full runtime is totally unsurprising. It's particularly true when your runtime has a JIT in it. With a JIT, most of the time, the bytes that are being executed aren't part of that 10MB, but rather are generated by it. Are you going to penalize your 600KB runtime for the size of the source code? ;-)