So why is the TypeScript bench slower?

hermitdev · on March 30, 2019

My best guess: probably written by someone that doeant know how to write performance TS. I say this with no skin in the game. I write neither JS nor TS. What I have observed in language benchmarks over the years, is that the benchmarks are rarely written by an expert, but usually by someone with cursory knowledge of the language. E.g. just enough to be dangerous.

Often times, these sorts of benchmarks are done with prejudice (not necessarily malice). The benchmarks are written by someone with something to prove: my chosen tech stack performs better, and let me show you why. A favorite of mine is Perl vs Python comparisons, where you see an idiomatic Perl implementation vs a non idiomatic Python implementation (or other way around). Typically in a head-to-head comparison, the benchmarks are developed by the same individual whom likely has above average knowledge in their favorite and below average in the target they're trying to show as inferior.

You'll see this time and time again in internet benchmarks comparing performance. Unless you can see the code from all benchmarks involved, my suggestion is to avoid them. I mean, for all I know, the author of the benchmark was unaware of the built in sort and instead bubble sorted.

dahart · on March 30, 2019

This is unfortunately pure speculation on top of pure speculation, which is the problem I have with the top comment. You’re assuming incompetence when you could just go look it up. Why assume it’s someone who doesn’t know? Why use that to wander off into rant land about prejudices and make broad claims that internet benchmarks are bad, when you admit to having zero idea what the actual specific problem here is?

The test that lowered TypeScript’s score in the paper is called fannkuch-redux, and here are the sources in question:

https://github.com/greensoftwarelab/Energy-Languages/blob/ma...

They are both contributed by the same person, and there is no bubble sort involved. So now you know.

I don’t see an obvious reason one would be slower, but they’re also quite different. Maybe the algorithmic complexity is different. Maybe the cross-compilation is doing something bad with memory allocation. Note the input sizes for this test are very small, it would be easy for a difference in temporary variables the compiler injects to cause a serious problem.

What is not obvious is any prejudice, malice, or incompetence.

kbenson · on March 30, 2019

> Why use that to wander off into rant land about prejudices and make broad claims that internet benchmarks are bad

What are you talking about? Where did that happen?

> when you admit to having zero idea what the actual specific problem here is?

This is the comment section for a submission about an article referencing the paper. I brought it up for discussion. It is perfectly valid to bring up a question that you don't know the answer to.

> What is not obvious is any prejudice, malice, or incompetence.

Please stop.

Edit: From another comment, and some deeper digging of my own from that, you might find the archived results of the fannkuch-redux interesting. From 2017-08-01[1] to 2017-09-18[2], the benchmark changed from a running time of 1,204.93 second to a running time of 131.39 seconds. The paper was released in October 2017.

1: https://web.archive.org/web/20170901020804/http://benchmarks...

2: https://web.archive.org/web/20170918163900/http://benchmarks...

dahart · on March 30, 2019

> What are you talking about? Where did that happen?

I was responding directly to @hermitdev. Did you get your threads crossed? What I'm talking about happened immediately above in the parent comment, beginning with "Often times, these sorts of benchmarks are done with prejudice" https://news.ycombinator.com/item?id=19527057

"You'll see this time and time again in internet benchmarks comparing performance."

> It is perfectly valid to bring up a question that you don't know the answer to.

I agree. It's a bummer that's not really what happened here.

>> What is not obvious is any prejudice, malice, or incompetence. > Please stop.

The parent comment explicitly stated an assumption of both incompetence and prejudice and I responded directly to that.

From the HN guidelines: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."

If you'd like me not to call out speculation, then please assume good faith and don't speculate next time.

> From 2017-08-01[1] to 2017-09-18[2], the benchmark changed from a running time of 1,204.93 second to a running time of 131.39 seconds.

Yes! Now we are getting somewhere. It appears that would change the outcome of the paper. Perhaps it was a mistake. That might mean it was nothing more than an oversight that already got fixed. It doesn't mean there is any other coloring of the study at all, nor that there was any intention or agenda to make TypeScript look bad, right?

kbenson · on March 31, 2019

> The parent comment explicitly stated an assumption of both incompetence and prejudice and I responded directly to that.

Perhaps I misinterpreted what you said. You started the paragraph referring to the top level comment, which is me. I took the "you're" in "You’re assuming incompetence when you could just go look it up." to be a general "you", and commentary on my original comment.

> From the HN guidelines: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."

I actually looked this up before the GP comment, and almost included it myself. I can see now that you were implicating the comment you replied to. I didn't think that was the case, because I apparently didn't interpret that comment remotely in the same way you did.

> Perhaps it was a mistake. That might mean it was nothing more than an oversight that already got fixed. It doesn't mean there is any other coloring of the study at all, nor that there was any intention or agenda to make TypeScript look bad, right?

I never implied it was. For that matter, I didn't really interpret the comment in question as stating that either. The more charitable interpretation is not that they are trying to make another language look bad, but that they are trying to make their favorite language look good. That doesn't require purposefully tanking one benchmark, it just requires them to be much better versed in optimizing one language than another and a lack of awareness about this. As they say, never attribute to malice what can be explained by incompetence. In fact, if you read the comment carefully, they even call out to this with the "not necessarily malice" remark.

dahart · on March 31, 2019

> The more charitable interpretation is not that they are trying to make another language look bad, but that they are trying to make their favorite language look good.

The project doesn’t talk about favorites or seem to want to make certain languages look good. Jumping to the conclusion that bias is involved isn’t the good faith interpretation, even if you state with a positive sounding framing. The good faith interpretation is to take the stated project goals at face value, and assume that the participants have done a good job.

kbenson · on March 31, 2019

> The project doesn’t talk about favorites or seem to want to make certain languages look good. Jumping to the conclusion that bias is involved isn’t the good faith interpretation

I didn't see anywhere that the comment in question called any project bias into question, but instead noted that in a situation where work is crowd sourced, people with their own intentions and motivations will put out bad benchmarks, either in the case of the benchmarks game, or a specific benchmark or comparison put forth in an article or blog. I've personally been witness to the latter multiple times just from HN submissions.

I just want to end with, as someone that's brought up viewing comments in an uncharitable light, you seem to have done a lot of that in this discussion. You've repeatedly taken your interpretation of a comment, rephrased it in a harsher way, and the stated it as what the other person was saying as fact, and then responded to that. I would think actually trying to find a charitable interpretation should at least include a question at the beginning to confirm whether what you think is being said is entirely correct. Note that I started with that when I thought you were attributing statements to me that I did not say. My first words were a solicitation "What are you talking about? Where did that happen?" to confirm what was going on. You've been doing this from your first response to my top level commend, when you stated "But you’re using that assumption to cast slippery-slope doubt on the whole project without knowing anything specific." That's a very uncharitable rephrasing of what you think I was doing, and it certainly wasn't my intention. I've already outlines in specific exactly what I was trying to do and why, and in doing so I also stated that I felt you were misinterpreting me. There's a clear trend here as I see it, and you repeatedly bringing up good faith assumptions just puts it into clear highlight.

I think we've covered about all there is to say on this (these) topics. I'll let you have to the last word if you wish. I'll read and promise to consider any points you raise, but I don't think me responding would be very fruitful, and this discussion has digressed far enough.

igouy · on March 30, 2019

Do you agree that those very different times were measurements of the same TypeScript fannkuch-redux program?

5 July, Node 8.1.3, TypeScript 2.4.1

https://web.archive.org/web/20170715120038/http://benchmarks...

1 Sep, Node 8.4.0, TypeScript 2.5.2

https://web.archive.org/web/20170922144419/http://benchmarks...

----

How should we now assess your "suspiciously like entirely different algorithms were used in each implementation" comment?

kbenson · on March 31, 2019

> Do you agree that those very different times were measurements of the same TypeScript fannkuch-redux program?

Yes.

> How should we now assess your "suspiciously like entirely different algorithms were used in each implementation" comment?

The suspicion was incorrect. That's why it was presented as a suspicion, not as fact. I have no reason to defend it if it's incorrect, but I still defend that it was valid to raise questions, given the facts on the ground. We've now shown there was something that changed very drastically at that time, and while it's less likely it's the benchmarks themselves (unless one or both of those are fairly out of date Node versions)[1], it still points towards something to be aware of in the results presented. Namely, they rely on a lot of underlying assumptions which should be looked at if you care about the numbers.

1: Also, I imagine the V8 devs probably considered the performance of TypeScript in that case to be a bug, given how horrible the performance regression from JavaScript is and that it's still javaScript running. It's possible that TypeScript was doing something really odd, but given the exposure and Microsoft's backing and developer time, I think that's a less likely scenario than some optimization that should have been triggered was missing, which happens quite often.

igouy · on March 31, 2019

Please add a correction to your original comment, to prevent readers from being misled. (If it's closed to edits, I'm sure HN staff will open it when you ask).

> I still defend that it was valid to raise questions

Of course, it's valid to question a measurement that looks strange but your comment went further than that -- your comment, without evidence, assumed a cause; and, without evidence, implied that assumed cause led to widespread problems with the analysis.

In other words -- innuendo.

kbenson · on March 31, 2019

> Please add a correction to your original comment, to prevent readers from being misled.

Corrections are for facts. I put forth a theory. People being misled by a theory are not something I have limited power to affect. People representing theories read on the internet as fact have larger problems that that will solve.

This discussion is the correction, and a better one than someone would be willing to read. Were it within the 2 hour edit window, I would through in an edit, I've done so numerous times in the past. I will ask Hn to amend it's rules so I can correct a statement I made about something I suspected.

> Of course, it's valid to question a measurement that looks strange but your comment went further than that -- your comment, without evidence, assumed a cause

This is incorrect. I had evidence, I had numbers that did not line up with my understanding of how things should have been given my knowledge of the subject. I presented that as a theory, by using the word "suspect". All I implied is that if that theory was correct, which I made sure to not assert as fact, then it might affect some other languages. I did not assume a cause, I assumed a possible cause, and presented it as such.

I am very particular with my language. I try not to state things as fact when they are not. I try my absolute hardest (and I believe I succeed) to always speak in good faith, where I'm trying to raise a point I think is worthwhile or ask a question where I think there is benefit. I'm actually rather bothered by how some people interpreted my words and intentions, and that includes you. I'm bothered by how you've interpreted my words. Since you're not the only one (although I do believe you're in the minority), I'll assume there's something I could have done better to represent my point. I don't think all the blame lays with me though. There should be some way for me to posit a question and advance a theory without people assuming bad faith, so my question to you is, what way is that? How could I have expressed concern over the results without triggering that interpretation from you? Because I don't think doing personal research on a problem is an acceptable prerequisite for raising a question. In this case, I could have spent hours looking into something I was unfamiliar with and come away with more answers, but many people may not have the knowledge to do so but have enough to think something is wrong. Should they just keep their mouths shut? Are we in a time where raising a concern that turns out to be unfounded (or in this case, just more complicated and slightly misdirected) is unacceptable under any circumstance? I refuse to accept that.

igouy · on March 31, 2019

The honest concern is that the reported time measurements for those JavaScript and TypeScript fannkuch-redux programs seem too different.

The honest question is -- Can someone please confirm that those programs implement the same algorithm?

hopler · on March 30, 2019

I thought the benchmark game is set up so every language's advocates can tune their language's programs. The only chance at a fair comparison is if every language gets the best implementation it can find for the challenges. There's nothing else that approaches fair comparison of apples and oranges.