I remember the moment I realized how fast computers are at uni. I was in an algorithms course, and one of our projects was to make a program which would read in the entire dataset from IMDB of films and actors, and calculate the shortest path between any actor and Kevin Bacon using actors and movies as nodes and roles as edges.
I was working in C, and looking back I came up with a quite performant solution mostly by accident: all the memory allocated up front in a very cache-friendly way.
The first time I ran the program, it finished in a couple seconds. I was sure something must have failed, so I looked at the output to try to find the error, but to my surprise it was totally correct. I added some debug statements to check that all the data was indeed being read, and it was working totally as expected.
I think before then I had a mental model of a little person inside the CPU looking over each line of code and dutifully executing it, and that was a real eye-opener about how computers actually work.
>> The first time I ran the program, it finished in a couple seconds. I was sure something must have failed
At one of my first jobs I was a DBA supporting a CRUD app in the finance industry. The app had one report that took forever and usually timed out, I was told to take a look at it. The DB query was just missing a couple indexes so I added those.
After I added them, my boss told one of the users of the app to try out the report and she said it was still broken. He asked what she meant and she said she clicked the button and the page with the results came up right away. She thought it was broken because it didn't take forever.
If I recall correctly, the first few ATMs near Wall Street had the same issue. They were too fast and people were suspicious. They had to add in a delay so folks would feel alright using them.
That's pretty funny, and is literally what I did with a CLI tool I made once. It was supposed to loop through something that was over 10,000 entries long. It finished in under a second.
I decided to add a small fraction of a second every X iterations and output some garbled data to the terminal. I got paid a nice little sum because of that. Sometimes, knowing how to make something look complicated is as important as doing something complicated.
I was working at my UNI library around 97 when a fresh t3 (~45MBPS) line was just installed... We also got brand new top of the line Micron computers as well there. I was the first person to test the connection and after years of working on 56k modems I couldn't believe how everything I clicked suddenly worked at the speed of light. Videos I clicked on (on MTVs web site back then) loaded instantly, almost felt as if they loaded before I clicked on links... I have never had anything load as quickly since, even on my home Internet, which is directly connected to my router & 250MBPS plan.
I blame all the ads, tracking, and bloatware that is prevalent now most of all.
It’s unclear that’s the case, because there are other limitations at work! Merging is a huge time cost and slows down two lanes; the more lanes, the more merges are necessary to use the new lanes. There’s some work suggesting it makes more than three lanes in one direction essentially useless in urban areas - and that’s where the traffic is…
I mean it in the literal sense, past a certain point the city no longer exists as the lanes replace all other land uses and by definition without any source of traffic, the traffic will cease to exist as well.
For most of my work CPUs form the last decade will work just fine. It’s the memory and, especially, disk IO that kills the performance. SSDs have helped big time.
I'd argue that SSDs have done more harm than good. Since the worst-case is now far superior that it used to be (HDDs), most developers see no need to optimize any further. For example, plenty of video game engines will stream copious amounts of data from disk instead of optimizing memory usage, asset size, and in general more creative solutions (i.e. shader effects instead of GBs of redundant assets). If hitting the disk slowed everything to a crawl, then maybe software would've been designed in much more efficient ways. Good (enough) is the enemy of great.
It's not about (in)efficiency but creativity and development budgets.
If you have "unlimited" fast storage, the most technically efficient way to render highly-detailed realistic assets is to underpay a bunch of artists to make a metric ton of highly-detailed realistic assets, then stream them in off disk.
If you don't have that storage, the most efficient way might be to make a smaller number of assets modulated by some technical work, which is more accessible to smaller teams who have one top-shelf programmer but no army of contract artists. Or, the team is forced into a non-realistic art style which gives artists and the industry as a whole more space to design in.
It also means that when you do blow your technical budgets for whatever reason (e.g. nobody upgrades their SSDs for 2 years due to a chip shortage so your median performance projections for release were way off), it starts getting much worse very fast.
> I'd argue that SSDs have done more harm than good
I can't imagine the thought process that would lead to a statement like that. SSDs are the single best thing to happen to personal computing in the past 15 years. Absolutely no question and not even arguable.
Games are a bad choice as an example. (Some) Games are always trying to squeeze the most out the latest hardware. You can't have a massive world with 4K textures and no loading screens using an HDD and 8GB of RAM without performance degradation.
Games are a good choice as an example. *Some* games are trying to squeeze ot the most of the latest hardware but most just target a minimum acceptable framerate for common hardware and then move on. Good enough to ship is pretty much the game industry's entire mode of operation.
Also, games generally don't make good use of extra resources you have. Have 128 GiB ram and plenty of VRAM? In almost all games you're not going to see any less loading screens in most games than someone with 8 GiB ram even in really simple scenarious like going back to the area you just came from.
Has that even happened yet? PS5 and Xbox Series are the first consoles to use SSDs instead of HDDs and they're only beginning to gain serious steam. PS4 games are still being released and they have to cater to that stock 5400 RPM HDD.
It's not "the C part" that makes code run fast, but memory access patterns. C just happens to not get in the way between the coder and the machine when it comes to explicit control over memory layout. In the late 60's and early 70's this was probably an "accidential feature", but with the widening CPU/memory performance gap it turned out that later languages (from the late 90's and early 00's) had bet on the wrong horse by trying to abstract memory away. More recently this trend is reversing again and plenty of C alternatives are starting to appear with explicit control over memory layout (Zig, Nim, Odin, Rust, ...).
So a good language, should not abstract that memory pyramid away, but instead make you painfully aware of it, while developing. Rewarding DOD, punishing OO, but that results in more education time for programers, which no company is willing to pay for.
What instead is needed is a intermediate language, that takes the constructs of object orientation and the instruction flow and allows to rearrange them for maximum memory efficieny. Like, strip those OO-bjects into arrays, or directly pack them into hot-loop structs that have little regard for the Objects they started out with.
Can someone from the old guard tell me, how often we have been here in the language design cycles through the desert?
I believe that the direction of a managed language with value types is a good goal.
I especially like Java’s plan of introducing them: according to the latest design iteration, there will be 3 buckets of objects: current identity-having ones, value classes and primitive classes.
The second category would drop identity, but will keep nullability (for example two DateTime instance of the same value will be considered equal on a VM-level, allowing for optimizations like allocating DateTime’s inside an array serially, or stack-allocation. Nullability is important because there is no sane default value choice for for example a DateTime. But this value can be encoded cleverly similarly to what Rust does with optional)
The third category will loose nullability as well and current primitives will be migrated to it. So a ComplexInt “class” will be possible to implement with “zero” overhead.
My point being that there are two ways of improving performance, either showing more knobs to manipulate, or to raise the abstraction level, allowing cleverer optimizations. C# does the former, while Java never went that route, and I think that the latter approach better fits a managed language and can easily push it for like 90% better performance at a fraction of developer complexity.
I was going to say OOP aside from the basic animals and calls example, takes years of indoctrination for people to find it the simple default way to do things.
Functional programming is much simpler, but we don't spend years hammering the concept into people's brains.
With helping acquaintances learning programming from 0, I have a sample size of n = 2 that starting with FP got them to grok basic programming within a very short amount of time.
And IME moving from FP to OOP is MUCH easier than the other way around.
Don't really have the knowledge to compare iterative vs functional for a complete beginner, though I suspect that even there moving from FP to iterative is easier.
I program Java to earn my living and any hierarchy deeper than two is a bad smell for me. Not that it cannot have its place, but most of the time you're right, some find so fascinating to inherit everything from everything...
I remember sitting in these lectures thinking, why not just use a function without all the boilerplate? And a decade later the programming world finally came back to some semblance of sanity.
Doesn't Rust mostly abstract the memory management away as well? It tends to be low overhead, and has sensible defaults with respect to memory management, but it's built around RAII and for instance if you use a lot of reference types as far as I know there's nothing keeping you from having fragmented memory the same way you would with another high level language.
I know Rust also offers arenas and other purpose built tools for more optimized allocation strategies, but Rust doesn't seem like the language you would reach for if your number one priority is memory performance.
It seems like there is a necessary trade-off between truly top-end memory performance and memory safety.
One of the big differences memory-wise between the C/C++/Rust family and the Javas of the world is having first class support for by-value object types (and collections thereof).
Yes, you can trash your cache in all these languages if you choose to do everything with references to a multitude of individual heap allocations... but in Java-likes you don't have the choice not to do that
Well yeah, you can write slow code in any language if you don't think about memory layout ;) I see Rust roughly in the same bucket as C++. You can abstract the details of memory management away - which mostly also means giving up control over how things are arranged in memory - but if needed the low level explicit memory management features are there.
I guess the difference is that C++ still gives you pretty much direct access to memory (i.e. pointer arithmetic). Rust tries very hard to keep you at arms length from the actual memory as a rule, and forces you to work through a safe abstraction unless you use an escape hatch.
It's not even just memory access patterns. It's any and all abstractions - C doesn't provide any, so you write things manually, and thus won't do things that are not required for your use-case (whether it be separated loops for actions, pre-initialization/zeroing, multiple allocations where one or none could do, and higher-level stuff like no need for iterator stability, a vector push that assumes reserved space, etc).
Well, lack of abstraction can easily hinder performance as well. Just compare C’s string management story with that of C++. (Also, for very performance-oriented workloads, C++ will be preferred). In case of C you only have dumb c strings, on which you will iterate many many times completely needlessly. It is both error prone, and less performant than C++’s strings, which can do small string optimizations (storing the string inside the structure if it fits, and being a pointer to it otherwise).
I'd say the case of C strings is more a case of a bad abstraction C has (at least historically), rather than lack of one. But the things that expect them are just the standard library (which doesn't have many useful general-purpose things anyway), so you can write your strings as a pair/struct of char*+length or whatever else you may want just fine.
Small string optimizations, while nice (and probably do average out to being beneficial), aren't always needed, and the extra generated code for handling both cases could make it not worth it if you've got a fast allocator, and can even make some operations just outright slower. (and if your code doesn't actually have strings anywhere near hot loops, all you get is a larger binary). File paths, for example, are often large enough to not fit the small case, but still small enough where even the check for whether it is small can be a couple percent of allocation/freeing.
Being error-prone, though, is something that I can agree with. That's the cost of doing things manually.
(I'd also like to note that malloc/free are a much more important case of a bad abstraction - they have quite a bit of overhead for being able to handle various lengths & multithreading, while a big portion of allocations is on the same (often, only) thread with a constant size, and said size being known at free-time, which is a lot more trivial to handle; not even talking about the cost of calling a non-inlined function spilling things to the stack)
Well the reason this one may be a good example is that you couldn’t do this optimization in C even if you wanted to. You would have to call a function at every use-site to handle your “abstraction”. And the same thing applies in other cases.
Also, I’m not sure the added conditional branch will increase the binary too much, and the reason it is inside the c++ stdlib is that it was likely measured and proved beneficial.
But I do agree that maybe your allocation example is a better one, though the solution to that is perhaps a full GC, which does have a few tradeoffs (which are worthy to take more often than not).
Right; making abstractions in C when you really know you should can be messy, but still possible (not with the same syntax, but I personally like the fact that a[b] is 100% guaranteed to be, at most, a single load; similarly for all operations other than function calls; makes it easier to reason about performance at a glance).
Recommending a GC is hard for me though; over your system malloc/free, maybe, but alternative allocators can be very fast, without the drawbacks of pauses (or slower execution as a result from a non-pausing GC).
I've done video4linux stuff in Go, and passing an unsafe.Pointer to a Go struct in an ioctl() worked fine, which tells me that Go structs are isomorphic to C structs. Even though Go has garbage collection, it allocates everything it can on the stack, so only long-lived shared-between-goroutines objects are subject to garbage collection.
Go abstracts concurrency, completely removing all concurrent features from a language except for the "go" keyword (that launches a goroutine - which is basically a tiny virtual thread), channels (which are selectable queues) and "select" keyword that waits for the first "input" from a static set of channels.
Go strikes me as one of the best "good enough" languages we have right now. You're not going to do HPC in Go, but it's performant enough to run circles around a lot of high level languages and dynamic languages. It abstracts away the stuff that's super error prone about manual memory management, and it's so brutally simple that it's hard for one of your colleagues to write code you're not going to be able to understand.
There's a few features that could improve it, like proper ADT's, and it's a bit lacking in expressiveness for me to choose it for personal hobby projects, but I would recommend it any time for general-case professional software development.
Go can be faster at small, specific programs where memory lifetimes are deterministic and you can use value types. Otherwise, Java will beat every other managed language by a huge margin when it comes to GC-related workflows. Sure, it does so at higher memory usage, but that is a good tradeoff for many use-cases (especially server).
So all in all, for bigger programs it is hard to do a good comparison, but there is exactly where JIT compilers shine and the memory tradeoff and the like brings their return.
Isn't a lot of server code these days small programs that just pull data out of a database and feed it to an HHTPS response?
It seems like it's a pretty good option to have your infrastructure code implemented in a systems programming language like C or Rust (probably AWS or GCP is doing this for you), and just implement your business logic in Go as the type of small, well-defined programs you're talking about.
But then why go? You can also implement that logic in a likely much more readable way in Python (as at that point performance doesn’t matter), or just write the whole thing in Java/C#/Scala/Kotlin whatever, which in my opinion are more expressive for business logic.
Go is much faster than Python, uses less memory, and compiles down to a statically-linked native binary, making containerization trivial. And (IMHO) it's even more readable than Python - nowdays Python code is as easily turned into an unreadable mess as Java or C# code. Just try reading Python standard library and Go standard library - the difference is monumental.
We are talking about business logic. The infrastructure is already in a lower level language, so the performance is not a concern.
And we will have to disagree on C#/Java/Python being unreadable mess. In my experience all 3 can be written in a really well maintainable way. I don’t have much experience with Go, but out of these, I would vote for it as the least maintainable (as just because each line is trivial to understand, doesn’t make the whole program flow easy to read. Otherwise why not just write assembly, every line is even more trivial there)
> In my experience all 3 can be written in a really well maintainable way.
That's true, but that is generally true of any (non-toy) language. But in the modern world of rapid development, it matters how hard is it to write code in a non-maintainable way - i.e. how well it tolerates modifications by different people. And to me, it seems easier to write readable code in Go than it is to write unreadable code.
It seems to come from the lack of features - Java, Python and C# have too many features, and any problem can be solved in N different ways, each one with its own warts. If you want to work on a wide range of codebases, you have to know each one of the approaches and their warts and footguns.
Meanwhile, Go feels like it really reached the "there should be one obvious way to do it" ideal of Python, while Python has over the years evolved into something more Perl-like. Want to build a concurrent application? Chose your tradeoff - either you get CPU scalability (multiprocessing) but lose memory sharing, or you get a simple concurrent model (threading) that isn't scalable, or you get I/O scalability (asyncio) at the cost of function coloring, error-proneness and a single-threadedness. Go solved the whole thing with the goroutine model - internally it multiplexes coroutines onto a set of OS threads, but all blocking calls are wrapped by Go runtime which makes every coroutine behave and feel like an ordinary thread, without the massive memory use of OS threads.
The number of ways to write something is only very loosely correlated with maintainability. The ease of maintenance is IMHO more a function of how much information about the properties of the code you can easily read from the text, and how well the abstractions in the code map to the abstractions you'd use when describing the solution to a friend. Lack of features doesn't help in that regard. That's why probably Go has just added generics, despite the long tradition of Go promoters claiming "lack of generics is a good thing" ;)
Languages with very little type information, e.g. dynamic ones, tend to be quite hard to maintain, unless the original developers kept the discipline of good naming and verbose commenting. Go and Java with their somewhat static, but limited typing and elements of dynamism (interface{}, Object, reflection), sit somewhere in the middle between PHP/JS and Rust/Scala/Haskell.
Languages with little expressive / abstraction power, so the ones limited in features or low-level are also often hard to maintain, because you have to reverse-engineer the high-level stuff from all the details you see. Take assembly as an example - while it may be quite obvious what the program is doing at the bits and bytes level, understanding the sense of that bit-level manipulation may be a much harder task. The assembly language might actually be very simple, but that does not help. I remember when we had a MIPS class, the whole specs was just a few pages, could be learned in an hour.
Could you expand a bit on why do you think Java has too many features? It is a very small language, that is often berated because it picks up features way too slowly if anything.
I would go as far to claim that Java is an easier language than Go, or at least in the same ballpark.
> Could you expand a bit on why do you think Java has too many features?
I think you're asking for technical details, but I'm afraid I don't know Java well enough to do an objective comparison. I'll try with a subjective explanation or why I think so.
I've learned Go in 20 minutes following the Go Tour. Few months later, I feel like there isn't a single thing I don't know about Go. It's dead simple. When I open a Go repository, it's easy for me to get into the codebase, as all code is more-or-less the same.
I've learned Java back in high school, and to this day I don't feel like I "know" the language. I've tried reading some Java repositories, and every time I feel like there's some kind of friction - some implicit knowledge about it that I just don't understand.
Maybe it's just me, and I haven't spent enough time learning Java. But then again, I've spent even less time learning Go, and yet I have a much easier time using it. That's what I mean by "a very small language".
Performance is always a concern. For instance, if running a python interpreter introduces latency for each request, that can add up to perceptively worse performance when applied throughout a product.
In my experience, Python is far less readable than languages like go. The information density and semantic whitespace of Python really hurts readability.
Having read the rules it's difficult to know what's considered "fair" for this test - all GC tuning is off the table, sure. But what's bugging me is "Leaf nodes must be the same as interior nodes - the same memory allocation." So what constitutes "the same memory allocation" - literally the exact same call to some opaque internal allocator? If so, shouldn't Java also have to disable JIT to be fair?
Let me offer an alternate interpretation: I will do the same memory allocation if I need to allocate a node, but if my language lets me not allocate a node yet still use that node why should I? Or an alternate argument if you don't like that one: Why must my "node" be `Tree`, rather than `*Tree`?
A central idiom of Go is that zero-values of a type can be useful; a two-line change, no new special-cases, no pooling or such gauche hacks:
// Count the nodes in the given complete binary tree.
func (t *Tree) Count() int {
if t == nil {
return 1
}
return 1 + t.Right.Count() + t.Left.Count()
}
// Create a complete binary tree of `depth` and return it as a pointer.
func NewTree(depth int) *Tree {
if depth > 0 {
return &Tree{Left: NewTree(depth - 1), Right: NewTree(depth - 1)}
} else {
return nil
}
}
I'm sure someone will tell me I "optimized away the work" - but in the end I believe I'm making exactly the same number of method calls on the same type of receiver. If that's not the work, what is?
And then the Java etc programs are re-written to do the same and the test value is increased (to compensate for the reduced memory allocation) and we're back where we started?
- Practically, Java would rather have to `return 3` when it detects a null child, effectively precomputing the penultimate level.
- Semantically, Java could no longer distinguish between an absent child and a child with no children.
Honestly, I have other variants that still don't use pooling but are less idiomatic; I find this exercise is begging the question hard. Any tools, however idiomatic, the language is giving you to reduce the effects of allocation seem to be off-limits for GCd languages. Whereas then e.g. C can just throw them all in a third-party pool library. And JIT languages are presumably allowed to fuse anything they want.
The answer to "… but if my language lets me not allocate a node yet still use that node why should I?" is — Because allocate a node is the basis of comparison with the other programs!
Change that for the Go programs and you change that for all the other programs; otherwise just special pleading for Go lang.
Wow, I had no idea that java was so fast. One thing I like about go is that you can cram many thousands of concurrent requests into the same process. But it looks like java has some pretty robust async tools... So I bet you could do something similar
The JVM is a real beast, which makes sense as a good chunk of the whole internet runs on top of it (almost every big corp has plenty of infrastructure running Java), so it had plenty of engineering time poured into it.
Regarding concurrency, I wouldn’t choose existing reactive frameworks and what not for a new system. Java will soon get Project Loom, which will introduce Go-like virtual threads - so that one can write a web server that spawns a new thread for each request as well. Since the Java ecosystem is very purely written almost exclusively in Java itself (no FFI), basically everything will turn automagically non-blocking.
He said "easy to read" not "filled with weird syntax choices that make anyone from a C background barf". What the hell are those channel arrows and why do they point the wrong way.
Channel arrows are basically just read/write operations.
If `ch` is a channel, then this expression means "value obtained from reading from the channel":
<-ch
And this expression means "write value x into the channel":
ch <- x
Both expressions can be used as a case inside select statement:
select {
case val := <-ch:
// ...
case ch <- val:
// ...
}
Which will execute exactly one case, depending on which channel becomes "ready" first - channel is ready for reading if there is another goroutine blocked on a write operation, and ready for writing if there is another goroutine blocked on a read operation.
You say you're from C background - if you've ever worked will file descriptors you will notice that channels are basically userspace file descriptors. Channel reading and writing is isomorphic to read() and write() syscalls, and select keyword is isomorphic to select() syscall.
Hopefully this clears up the whole channel syntax thing. I just hope you weren't trolling.
> Channel reading and writing is isomorphic to read() and write() syscalls
This is a very bad mental model because channels operations cannot be canceled (without using select on two channels) or return any error status (at all).
While technically true, I don't really see the impact of the difference. It is idiomatic to use contexts for any kind of cancellation during channel operations, and it works well.
The syscall comparison was made to give intuition about general behavior of channels to someone with a background in C. Of course it's not completely identical.
The ability of read/write to communicate via errors as well as the actual data transferred is significant - there's a reason Go's i/o model is io.Reader/io.Writer and not chan []byte.
You might as well explain channels in terms of any blocking operation if the bar for "isomorphic" (now backtracked to "intuitively" I guess) is that low.
I don't know what's the bar for "isomorphism", but I know that the word literally means "same shape", so just because some nerd that got killed in a duel over a girl used the same word in a mathematical context doesn't give him dibs over its general use.
Unless, of course, system calls can be modeled as operators over sets. In which case, please tell me how.
Well, in my case I used it to signify same behavior in terms of process/goroutine communication. I think it's "specific enough" to warrant the use of word "isomorphic".
Rust is that if you keep the code simple without too many trait objects and macros and type magic. But then it becomes even more constrained than it actually is. It'll also be boilerplate heavy and hard to write.
Julia is fast, easy to read, and easy to write. But it's not easy to maintain. There is a direct tradeoff between dynamism on one hand making things easier to read/write and static enforceability on the other making it easier to maintain.
Haskell has an elegance, and can be written as simply as C. Usually fast enough, and can be optimized as well. The downside is that you will be sucked into a rabbit hole of academic type theory and wonder how best to express your system as a Free Monad instead of bashing it out like any sane C programmer. Just kidding, someone already figured out those hard parts for you, you just forgot to browse for it on Hackage.
It's a fair criticism of Haskell that you can fairly easily blow up your time/space complexity without realising it though. I think in many ways it's better but Haskell specifically demands a lot even for a functional language.
Pascal dialects, Modula-2 are of a similar age, while other like JOVIAL are a decade older but they did not come with a very big killer feature, an OS like UNIX.
Then there were BLISS, Mesa and PL/I, but the OSes that made use of them lost to UNIX, so.
With exception of Mac OS, written in Object Pascal and later ported to a mix of Object Pascal and C++.
Having said this, plenty of alternatives with AOT compilers exist nowadays.
The only thing C has going for it, is historical weight, UNIX/POSIX ecosystem, and some domains that are closed to any alternative suggestions, due to tooling or cargo cult against alternatives.
Because "easy to read and maintain" is about humans, and "speed of C" is about machines, and there is a vast gulf between the 2 that always force you to compromise in one or the other to get the 2 together, and usually both.
Code being easier to read and maintain is a function of how close it is to human semantics. The more the algorithm is presented in terms and notations humans like and find familiar, the easier. Code being performant is a function of how close it is to machine semantics, the more the algorithm is presented as steps that the machine likes and finds familiar, the faster it will run, as the machine is doing less to execute each step.
There is a fundamental tension between the 2, even if compilation from high level languages might, at first glance, give us the illusion that we can have both. We can't, not in general. We can only do it for a class of human semantics that C++ folks call "Zero-Cost Abstractions", the set of abstractions that can be completely erased without a trace by the time you get to the executable.
But otherwise, there is a fundamental cost to making code more readable by humans: making it less readable by the machines that will execute it. This is a reflection of the fundamental alienness of computers, what they find quite easy you find quite hard and vice verca. Optimizing for huamans means generality and ruthless hiding of details, optimizing for machines is all about special cases and ruthless exploitation of assumptions.
(Incidentally, C is not all what it's cracked up to be. Generic containers, off the top of my head, resort to using void* pointers for data and function pointers for operation, which has a runtime cost besides being unsafe and error-prone. Templates in C++ can aggressively inline types and operations for you, on the other hand as if you haven't written generic code at all, no wonder templates is the poster boy for C++'s 0-Cost abstractions. Another example I hear often is how pointer semantics in C and C++ makes it extraordinarily difficult for the compiler to optimize array and memory operations, whereas a language like Fortran make it easier by not having pointers.)
For many things, I have found this easy language is C++.
I use JavaScript and C++ for different things, sometimes in the same day. (And python and PHP and others, but this is not relevant.)
Believe me, JavaScript can be a real head scratcher compared to C++.
And now for the purists: No, I don't use all features of C++, only the minimal necessary ones for the problem I have to solve. This ridiculous idea that you are not using C++ if you are not using every single language feature is what makes programs difficult to write and maintain.
>Why can't we have a language easy to read and maintain but also have the speed of C?
Swap out { } for Begin End, and make a few other changes, and you've got Pascal. Single pass pascal compilers have been faster (at compiling) that almost anything out there since Turbo Pascal 3.0 for MS-DOS.
Modern versions, such as Free Pascal, Delphi and Lazarus also deal with strings in a manner that totally avoids needing to manually manage memory. The GUI builders are awesome as well.
You can't just "transpile to C" to get "C speed". C speed comes from low overhead. Naive transpilations will just include that overhead, like garbage collection or many layers of pointer indirection, but written in C. The unsavory answer is that you must use less abstractions if you want fast code. Compilers just aren't good enough to compile all the abstractions away.
The CPU is the CPU regardless of what language you are running on it. C will still give you the best performance on a modern CPU.
If anything, C has an even bigger advantage on modern CPUs because it has easier access to things like vectorize/SIMD intrinsics. It is also easier to tweak your data dependencies to help the branch predictor.
> If anything, C has an even bigger advantage on modern CPUs because it has easier access to things like vectorize/SIMD intrinsics. It is also easier to tweak your data dependencies to help the branch predictor.
Have you seen the ridiculously complex optimizations that C compilers do to maybe turn some shitty for loop into vector instructions? C is terribly bad fit for this use case and hardware tries to get closer to C than the reverse.
C++, Rust but even C# and Java has much better SIMD support than C has.
EDIT: It really doesn’t help C (and some of the listed languages as well) that they are very imperative. SIMD is exactly the place where pureness and some form of FP is much better at allowing these kind of optimizations (a map of a pure function can be “trivially” optimized into vector instructions, while it is really hard to decide whether this for loop is safe to convert, and it is really up to the heuristics of the compiler. A bit of rearrangement can cause a failure to optimize, resulting in a huge drop in performance)
Yeah I think Zig in particular is trying to be just exactly a "better C".
I don't know if transpiling will get you there, because for instance if you're transpiling a dynamic language, you're going to have to output C that is essentially emulating all those dynamic language features, so it might be faster than say, the original Python, but it's not going to be as fast as a pure C implementation.
If you transpile something to C it does't mean it will be fast.
You can write slow C code (or transpile something to C that will be slow). The compilers are not the issue here.
> If you hold up a sign with, say, a multiplication, a CPU will produce the result before light reaches a person a few metres away.
The latency on multiplication (register input to register output) is 5-clock ticks, and many computers are 4GHz or 5GHz these days.
5-clock cycles at 5GHz is 1ns, which is 30-centimeters of light travel.
If we include L1 cache read and L1 cache write, IIRC its 4 clock cycles for read + 4 more for the write. So 13 clock ticks, which is almost 70 centimeters.
------------
DDR4 read and L1 cache write will add 50 nanoseconds (~250 cycles) of delay, and we're up to 13 meters.
And now you know why cache exists, otherwise computers will be waiting on DDR4 RAM all day, rather than doing work.
Back in the days, an integer division took something like 46 clocks (original Pentium), and now on Ice Lake it's just 12 with a reciprocal throughput of 6. Multiply that by the clock speed increase and a modern CPU can "do division" about 300-400 times faster than a Pentium could. Then multiply that by the number of cores available now versus just one core, and that increases to about 2000 times faster!
I used to play 3D games on Pentium-based machines and I thought of them as a "huge upgrade" from 486, which in turn were a huge upgrade from 286, etc...
Now, people with Ice Lake CPUs in their laptops and servers complain that things are slow.
And things are slow as we waste all that processing power on running javascript one way or another. And everything requires a slow blocking connection to the mainframe. Nowadays the “always connected” mindset is really slowing us down.
Ok fair enough, but the mindset is spreading: json (javascript) parsing is what caused GTA Online loading times to balloon and I dread playing Call of Duty online as it wants to download and install dozens of GBs every time I launch it.
It wasn’t json parsing per se, but a buggy roll-your-own implementation that used a function (sscanf iirc) with a surprising nontrivial complexity on a long string. Fun part is, if they just outsourced that load to javascript and its JSON.parse, they’d never encounter that exponential slowdown. Javascript is a nice target to blame, but it isn’t the problem. CPUs got hundreds of times faster, javascript only divides it by N, which stays low and constant (at least) through decades. Do you really believe that if browsers only supported MSVC++-based DLLs without scripting, sites would run faster? That would be naive.
However there is definitely still less intrinsic optimisation from a dev perspective I think - people will iterate over the same array multiple times in different places rather than do it once.
I guess our industry has decided moving faster is better than running faster for a lot of stuff.
Part of the reason computers seem slower than they did is that most programs (and most programmers) only use one of those cores. Most of the reason, though, is that programmers buy new computers, and also that programmers only optimise code that’s slow on their computer.
Those insrtruction latencies are in addition to the pipeline created latency. (They are actually the number of cycles added to the dependency chain specifically). The mult port has a small pipeline itself of 3 stages (that why 3 cycles latency). Intel has a 5 stage pipeline so the minimum latency is going to be 8 for just those two things.
Sorry, I dropped all the 1s in that message when i typed it (laptop keyboard is a little sketchy right now). That should have been 15 and 18. I think the recent Intel microarchs take 14 plus howewever long at uopd decoding that minimum 1, so 15-20 or close to that.
> The dependency chain length is what is normally intended as instruction latency.
Yes, the way I read the original post and others was that you actually your response back in 3 cycles, which isn't correct. It doesn't get comitted for a while (but following instructions can use the result even if it hasn't been committed yet). You're not getting a result in less than 20 cycles basically.
> Sorry, I dropped all the 1s in that message when i typed it
It makes sense now! :D
> You're not getting a result in less than 20 cycles basically
But the end of the pipeline is an arbitrary point. It will take a few more cycles to get to L1 (when it makes it out of the write buffer), a few tens more to traverse the L2 and L3 and hundreds to get to RAM (if it gets there at all). If it has to get to an human it will take thousands of cycles to get through the various busses to a screen or similar.
The only reasonably useful latency measure is what it takes for the value to be ready to be consumed by the next instruction, which is indeed 3-5 cycles depending to the specific microarchitecture.
> which is indeed 3-5 cycles depending to the specific microarchitecture
I assume you are talking about from fetch to hitting the store buffer? That would be the aabsolute min time before the data could be seen elsehwere I would think. It can still potentially be rolled back, and that would be higher than reciprocal be way too fast to sustain, but for a single instr burst, I'm not sure. So much happens at the same time, An L1 read hit will cost you 4 minimum, hut all but 1 of that is hidden. can't avoid the multi cost of 3 or add 1. the decoding and uop cache hit, reservation, etc will cost a few. I have no idea.
If you know of anything describing it in such detail, I would be comopletely curiouis.
This reminds me of that “todo” I wrote for myself a long time ago. These days processors come with bigger L1,L2, and L3 caches. Would it be possible for a program that works on a tiny bit of data(few KB) to load it all up in the cache and provide ultimate response times?!
Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?
> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?
I'm probably the worst person to explain this.
Long long ago, I took a parallel programming class in grad school.
It turns out the conventional way to do matrix multiplication results in plenty of cache misses.
However, if you carefully tweak the order of the loops and do certain minor modifications — I forget the details — you could substantially increase the cache hits and make matrix multiplication go noticeably faster on benchmarks.
Some random details that may be relevant:
* When the processor loads a single number M[x][y], it sort of loads in the adjacent numbers as well. You need to take advantage of this.
* Something about row-major/column-major array is an important detail.
What I'm trying to say is, it is possible to indirectly optimize cache hits by careful manual hand tweaking. I don't know if there's a general automagic way to do this though.
This probably wasn't very useful, but I'm just putting it out there. Maybe more knowledgeable folks can explain this better.
There is a data access strategy called cache oblivious algorithms which aim to make it more likely to utilise this property without knowing the actual cache size.
I used that approach once on a batch job that read two multi megabytes files to produce a multigigabyte output file. It gave a massive speed up on at 32-bit intel machine.
> Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?
Not for general purpose programs, because L1 caches change so quickly each year there is no point.
For embedded real-time processors, yes. For GPUs, yes. (OpenCL __local, CUDA __shared__).
This is because Microsoft's DirectX platform guarantees 32kB or something of __shared__ / tiled memory, so all GPU providers who want a DirectX11 certification are guaranteed to have that cache-like memory that programmers can rely upon. When DirectX12 or DirectX13 comes about, the new minimum specifications are published and all graphics programmers can then take advantage of it.
-------
No sane Linux/Windows programmer however would want these kinds of guarantees for normal CPU programs, outside of very strict realtime settings (at which point, you can rely upon the hardware being constant). Linux/Windows are designed as general purpose OSes.
DirectX 9 / 10 / 11 / 12 however, is willing to tie itself to the "GPUs of the time", and includes such specifications.
I don’t think you can generally control the cache with such granularity since modern processors do all sorts of instruction level parallelism and cache coherency voodoo
On CPUs you can't really force data to stay on the cache, but if you access it frequently and there is not too much load, it will stay there anyways.
Some architectures (e.g. GPUs) provide local "scratchpad" memories instead of (or in addition to) caches. These are separate uninitialized adressable memory region with similar access times to a L2/L1 cache.
If the data is contiguous in memory and frequently accessed it will almost certainly make its way into L1 cache and be there for the life of the program.
If the data is not contiguous it could make the CPU's life much harder.
There's also the matter of program size (the amount of instructions in the actual program) and whether the program does anything which forces it to go lower cache levels or RAM.
There are intrinsics for software prefetching such as __mm_prefetch, but those are difficult to use such that they actually increase you're performance.
One of the claimed advantages of the array programming language "K" is that the interpreter is small enough for the hot paths to stay in the CPU cache. It's hard to Google but claims come from people/places like this thread: https://news.ycombinator.com/item?id=15908394
I ran across an animation once that showed graphically the time it takes light to travel between the planets and the sun. It's weird, but light doesn't seem that fast anymore.
The speed of light has really not kept pace with Moore's Law. Engineers have focused overly much on clock speed and transistor density and completely ignored C, and it's really beginning to show.
I recently read a science fiction short story on reddit, where humans had developed faster-than-light communication because they needed to reduce lag in networked games.
So, there's actually a proposal for "real-time" communication between galaxies. Just upload your consciousness to a computer and run it slow enough that a few hundred million years feels like a couple seconds.
Slow time is a thing on a few Egan books were the population (of effectively immortal post humans) of some planets collectively decide to slow their internal clocks to allow some of their members to take interstellar trips without missing out too much of their original life.
The trips themselves of course consist on transmitting the mind to be downloaded on a new body on the other side.
The idea is that you've already dismantled the stars and stockpiled all the energy in the universe, to use at your leisure. (i.e. everyone lives around black holes they can throw mass into to reap the bawking radiation). Since you have control over the last non-entropic systems, you can tune how fast the candle burns.
And you're a mind running on a computer! You were already going to experience the heat death! This is just a scheme to get the most subjective time out of it as possible. (Running slower is more efficient.)
That would have a nice side benefit or making space exploration much more economical. Faster too if you can bioengineer higher G resistance at the same time. Maybe someday the outer planets will be colonized by tiny humans measured in millimeters, with 125mm humans darting around the various moons shot out of repurposed tank canons, all laughing at the slow giants stuck down the gravity well on Earth.
The much more remarkable thing is to consider that that speed of light is also the speed of causality itself. It takes light from the sun about 8 minutes to reach Earth. If the sun suddenly disappeared, we'd still see it shining brightly in the sky, and the Earth would continue revolving around it - all for another 8 minutes until reality finally caught up to us. So we're already computing at a rate on the verge of the speed of reality itself.
It's interesting to consider this paired against how technologically primitive we ostensibly must be, given that digital computers didn't even exist 90 years ago.
Is the speed of light really the speed of causality? Would the effect of gravity (the lack of) affect Earth earlier than us perceiving the lack of light.
Nope. All physical effects are bounded by the speed of light. (As far as anyone knows, anyway.)
The only weird one is quantum entanglement, but even then information transfer doesn't travel faster than light and that's about all I know on that subject.
In short, the only information you gain is about the outcome of the other side's measurement. They cannot introduce information into the particle, and thus can't transmit information. The only thing you learn is what you already knew: the other particle had a 50% chance of being in one state or another, with the added fact that it's now correlated to your particle with a 75% (depending on experiment) chance. This is information that didn't exist until that moment, so it couldn't have been sent out and reach you before you measure your particle, which would inform you with 75% certainty how your particle would act and break causality.
The universe expanding is a property of space itself. And "faster than light" is only because miniscule space expansion in any quantum of space adds up with distance, e.g. 1 picometer per kilometer per second, add enough trillions of kilometers and you get that faster than light expansion. There is no mass, particle or information moving THROUGH space faster than the speed of light, which is what the said limit concerns.
Well the other way to look at it is: gravity travels at maximum possible speed in this universe. Light in vacuum can also reach same maximum speed. I guess we would say the light travels with speed of gravity if we measured them in other order.
The thing that did for me is realizing that people on opposite sides of the United States can't play music together if it requires any rhythmic coordination, even with a true speed-of-light signal with no other sources of latency.
Interesting! Presumably last 12 measures if you're playing a 12-bar blues, etc.
Big downside is you're stuck playing to a metronome, which would be enough for me to skip it, but it depends on the kind of music you're playing.
I could imagine that if the music is rhythmically slow and vague and improvised, big latencies are OK, and actually might yield some pretty interesting creative results.
Another model I've thought about is to structure players in a rooted DAG, and players can hear only people upstream of them.
E.g., you could build an orchestra by having a conductor and section leaders in a room together (or at within very low latency of each other). Other players could hear the leaders and play along, and then an audience could hear everyone. You could also do something more complicated like build things out in linear or power-of-2 layers, where each layer can hear everything upstream of it, and therefore many players would get a partial sense of the orchestral effect.
This could work nicely for improvised music, too, with causality preserved.
How does that work for the one playing ahead of everyone else? He just doesn’t hear anything? Or he hears his own music from 1 second ago? Or worse, other people’s music from 1 second ago.
It is actually the way to go with client-server programs, such as Jamulus. People from distant locations try to chose/run a server closer to their geographical (or, more properly, with a correction on how the fiber runs) middlepoint.
If you're using a normal monitor, the bottleneck would be transferring the results of the calculation to the monitor, which commonly have a latency of 3ms or more. So when the monitor displays the calculations, the CPU has already moved on to other things :)
It takes 10-20ms for the pixels to transition on an LCD display. And on 60Hz, it's 8+/-8ms for the monitor to actually address the row with your information. Luckily, the CPU doesn't need to wait for the monitor. And the slowest part of the chain will almost always be getting it from your eyes to your hands (250ms+).
It can complete many multiplications in that time, especially if you factor in parallelism. An 8-core machine using AVX-512 could do a few thousand 32-bit multiplications in that time. Your GPU can do tens of thousands, maybe hundreds of thousands depending on the model.
Or another way to look at it - the computer can do an absolute insane amount of math in the time it takes to roundtrip a single byte to the datacenter in US-West.
The more generic way I like to put it is that throughput has been improving exponentially for decades thanks to Moore's law, but latency hasn't changed much at all and has a hard limit due to the speed of light.
Hence the ratio between latency and compute has been changing exponentially. Even a linear or quadratic change would be dramatic, but exponential is something people just can't wrap their heads around. They're unable to really internalise it, in much the same way that in the early days of COVID people couldn't quite fathom how it is possible to go from 3-5 cases per day to tens of thousands.
HDD random I/O latencies are about 10 to 100x slower than a network hope. These days local SSD latencies are about 100x better than a typical network hop and this is just going to keep going. It'll soon be 1,000x better, then 10,000x, etc...
Any architecture using "remote storage" or "remote database calls" will be absolutely hamstrung by this. It'll be the equivalent of throwing away 99.99% or even 99.999% of the available performance.
People will eventually wise up to this and start switching over to distributed databases that run in the same VM/container as the application tier. So instead of "N" web servers talking to "M" database servers, it'll be N+M nodes with both components deployed into them.
Whatever argument can be made against this new architecture will become exponentially invalidated over time. Putting everything together is "too many GB of software to deploy"? Bzzt... we'll have 1 TB ram soon in typical servers. The CPU load of both together is too high? Bzzt.. the next EPYC CPUs will likely have 128 cores! Cache thrashing a problem? Bzzt... 1 GB and larger L3/L4 CPU caches are just around the corner.
Eventually you can't stuff enough computing in a small area (power density). Therefore you have to connect multiple CPUs spread out in space. The limit for many supercomputers is about how long it takes light or electrical signals to travel about 20 meters. Latency to first result is only part of the measurement that matters.
Huh, but then I'm pretty sure that there are some paths inside the CPU die that are long enough that speed of light is a consideration at these frequencies. Must require a lot of smart people to design these things, yet it only takes a bunch of junior developers to bog them down.
There is indeed the speed of propagation of electric potential taken into account, that is, how long it takes for the input of a logical gate or a logical subsystem to produce the output (that involves the propagation of electric potential through the chip's conductors). If your clock is too fast for the size of your subsystem, the result will not be correct at the output before the next cycle begins, so your system will just be bogus.
1. There's no real limit to how slow you can make code. So that means there can be surprising large speedups if you start from very slow code.
2. But, there is a real limit to the speed of a particular piece of code. You can try finding it with a roofline model, for example. This post didn't do that. So we don't know if 201ms is good for this benchmark. It could still be very slow.
As a front-end developer, I can't help but notice how much useless computation is going on in a fairly popular library - Redux. It's a store of items, if just one tiny items change in the whole store, every subscriber of every item gets notified and a compare function is ran to check if it changes. Perhaps I'm misunderstanding something and not to bash on Redux - I'm sure there are well-deserved reasons it got popular, but to me that just sounds insane and the fact that it got so much widespread adoption perfectly reflects how little care about performance is given nowadays.
I don't use a high-end laptop and I'm not eager to upgrade is because I can relate to the average user of the software I develop. I saw plenty of popular web apps feeling really sluggish.
>I don't use a high-end laptop and I'm not eager to upgrade is because I can relate to the average user of the software I develop.
Thank you so so much. It's insane how it feels like the speed of much of our software hasn't improved or even regressed despite the gigantic advancements made over the years. People really don't seem to care about this.
I had an argument about it with a senior colleague regarding some industry software.
He figured it wasn't worthwhile to improve the speed of some table fetching and calculations that people actually had to wait on since it would only amount to a bit more than a second or so on top of the regular slowness of it all.
A second that had been multiplied on at least 20 pc's each going trough it at least a 100 times a day of more than 260 times each year over at least 10 years so far.
Turns out more than 5 million seconds is a lot of man-hours which whilst cheaper than ours amount to manyfold what it would have taken to fix it.
Hi, I believe I understand you. If you look at immutable data structures implemented using JS primitives, it will surely look terrible. However, there's a lot of benefit to using a FP approach like Redux.
It's much easier to reason about state updates if all you have is pure functions.
It allows you avoid very annoying and hard to catch bugs. I've seen this personally, when replacing a spaghetti component with a straightforward `useReducer` hook.
Unfortunately, we don't really have a performant way to express this pattern in JS (or even in other languages?). You could use something like elm-lang, but it's not as widespread.
So from your post it follows that if a developer can reason about the state changes of their app without redux, they should do so if there are performance concerns. Right?
I say this as a webdev who has written pure vanilla Js SPAs a decade ago, and someone who often uses Redux now on most projects today. So I know it’s totally possible to have performant mutable state management on a project that isn’t a mess - that’s how we always did stuff before redux.
Can SolidJS reason about collections of items being added and removed? My pseudo-reactive Qt apps do well with structs, but I often resort to recomputing the entire list of items when elements are added or removed (because QAbstractItemModel is hell to work with, and because my "move items" commands are not exposed to the GUI layer). Perhaps even diffing the item list would be faster than telling the GUI that all data was changed. (Though with <100 items, it really doesn't matter.)
> if a developer can reason about the state changes of their app without redux, they should do so if there are performance concerns. Right?
That is correct :)
However, I'm not sure how many developers will be able to maintain the project and keep the invariants implicitly ingrained in the codebase by the smart developer who can reason about mutable state changes.
I think developer speed is more important than optimising clock cycles unnecessarily. Generally writing to dom is much much slower than evaluting a few thousand expressions.
> I think developer speed is more important than optimising clock cycles unnecessarily.
Developer time is spent once.
Users will always have to pay the price of additional run time.
For. Each. Single. User. Always.
It scales!
Due to the scale of, e.g. slow front-ends, with millions of users, this takes a HUGE amount of time. Only to save a few hours or days to develop it better.
Having 1 million users each wait a single second is already 11 days. If they have to wait that single second for each interaction, it quickly adds up.
It is also bad for the environment due to scaled up inefficiency and resulting increase of power usage.
> Having 1 million users each wait a single second is already 11 days.
This will sound like a nitpick, but it's actually worse. 1 million users waiting a single second is 11,000,000 seconds, right? A day has 86,400 seconds. 11 million divided by 86k is 127.31.
That means million users combined just spent 127 days and 8 hours because of the "just one second" delay.
Ah that's what happens when I don't have my cup of coffee in the morning. I went from 1 to 11 million in a typo and didn't even reason. So yeah, it was 12,7 days, not 127. Guess I'll double check having my coffee next time I'm doing back-of-the-napkin math /facepalm
Although I 100% agree with you, the problem is that these costs don't affect the original developer; it is an externality; a lot like carbon pollution. It's cheaper for the organisation to optimise for developer speed, even if the cost of that is borne by all the users.
I’m not claiming that we should not improve that single second, but summing it is a meaningless operation. It doesn’t matter for a user how many other user spent time on it as well.
Redux is not a new pattern. The pattern is many decades old. The reason we use it now is because computers have become fast enough that it's okay now. Nobody thinks this is a performant pattern; it's only "new" because of how terrible the performance is. This is offset by how easy it is to use. All of this also applies to React and Vue.
blaming popular webapps being sluggish after spending a whole paragraph on Redux is a bit of a non sequitur imo. performance issues are multicausal, i hope you can separate criticism of one library from emergent properties of complete products
Don't get me wrong, pandas is a nice library ... but the odd thing is, numpy already has, like, 99% of that functionality built in in the form of structured arrays and records, is super-optimised under the hood, and it's just that nobody uses it or knows anything about it. Most people will have never heard of it.
To me pandas seems to be the sort of library that because popular because it mimics the interface of a popular library from another language that people wanted to migrate to (namely dataframes from R), but that's about it.
Compounding this, is that, it is now becoming an effective library to do things, even if backward, because the network effect means that people are building stuff to work on top of pandas, rather than on top of numpy.
The only times I've had to use pandas in my personal projects was either:
a) when I needed a library that 'used pandas rather than numpy' to hijack a function I couldn't care writing by myself (most recently seaborn heatmaps, and exponentially weighted averages - both relatively trivial things to do with pure numpy, and probably faster, but, eh. Leftpad mentality etc ...)
b) when I knew I'd have to share the code with people who would then be looking for the pandas stuff.
> numpy already has, like, 99% of that functionality built in in the form of structured arrays and records
Respectfully, this is pretty wrong. Pandas does vastly more out of the box than numpy. Off the top of my head: I/O from over a dozen of data formats, joins/merges, sql queries directly to dataframes, sql-like queries on dataframes, index slicing by time, multi-indexes, much more ergonomic grouping/aggregation functions, ergonomic wrappers around common graphing use-cases, rolling windows.
I'm not even really a power user of it, so there's probably a zillion more things it does that numpy can't out of the box, and I don't wanna spend time writing time and validating if an implementation exists.
Pandas does a lot, and often times most of it isn’t needed. Basic functionality like Map, Reduce, GroupBy, InnerJoin, LeftJoin, CrossJoin, row or column generators, and transformations between columnar and row based data structures, are often needed but come with a heavy weight library that is not performant when it counts.
Because I needed these operations, I wanted to work with Numpy directly, and didn’t want to write custom implementations each time, I created a library to do it. It also has constructor methods for Python Dicts, any kind of Iterable, CSV, SQL query, pandas DataFrames and Series, or otherwise. As well as destructor methods to generate whatever you need when done. It tries its best to maintain the types you specify, and offers a means to cast as easily as possible. All functions return a single type to allow static type checking. And for performance, there is a “trust me I know what I’m doing” mode for extremely fast access to the data which achieves about a 10x speed up by skipping all data validation steps.
Everything it does outperforms pandas, except for the Joins. It does allow inequality joins and multiple join conditions, but the general solution used isn’t very fast. Anyone reading this who would be interested in improving these component would be welcome to contribute!
Article says at one point,
"We have reduced the time for the computation by ~119%!", which is impossible. If you reduce it by 100% it is taking zero time already.
People like to talk in percentages when it's obviously unclear what it means, and they frequently get it wrong.
It gets even better when people start switching between percentages and "percentage points" referring to a measure that's in percentages originally.
Unfortunately, most of those things are easier communicated and harder to get wrong if you try speaking in a more natural way. This is now "twice as fast" or "2.1x faster" is much clearer and can't go past zero :)
Similarly, I think it'd help to switch back from percentages to actual factors (119% = 1.19), and saying "we reduced the time for the computation by 1.19 of original time" would clearly show what's wrong (and saying "by 1.19x" would signal how it's a small reduction, so it's wrong as well).
Finally, I am 94.8% certain people will keep using percentages even where inappropriate, and with too much precision too!
I work primarily with optimizations and depending on context I will express them in how much time it shaves off one iteration ("this saves 1 ms!"), the change in frame rate ("went from 20-22 fps to a stable 26 fps"), or the ratio between before and after ("it's twice as fast", "only takes one third of the time it used to!", ...)
If I travel 25 miles in 1 hour, my speed was 25mph. If I go 100% faster, I'm going 50mph and get there in 30 minutes. If I go 200% faster, I'm going 75mph and get there in 20 minutes.
However, the original statement of "We have reduced the time for the computation by ~119%!" is still wrong-seeming, I agree. It should be "We have increased the speed for the computation by 119%" or "We have reduced the time for the computation by <WHATEVER>" :)
> We have reduced the time for the computation by ~119%!
You can say 100% faster when you are talking about speed. You can't say 100% faster when you are talking about duration. "Reduced the time" is talking about duration.
I don't think it's about banning words at all. It's about words making sense.
"What's cheaper? The price is." Now that just doesn't make any sense, since a price isn't cheap or expensive, it's high or low. The thing that is priced can be cheap or expensive, but that's not what's being said.
"What's faster? The speed is." Doesn't make sense either. Speed isn't fast, the speedy thing is. However, "What's faster? The acceleration is." is fine, because you can have slow or fast acceleration (I think?).
I'm an ESL speaker, so please do tell me if I'm wrong and how.
Isn't it crazy how the branch predictor is something like 99% percent correct. Which means a computer is almost deterministic, it almost knows the future. A tiny bit better and we wouldn't need to show up in office.
Of course multiply this by the sheer number of calculations and even that little misprediction results in huge differences. The reality is actually quite sobering: a computer mostly calculates the same thing over and over.
That’s a realization that made me a better programmer.
I think when I was younger, I thought of programming as very open ended. I.e. I wanted to build abstract, general solutions which would be able to handle any future case.
Over time I realized the problem space is mostly quite well defined, and when I started thinking about programming as defining an assembly line for computations my results and time to solution improved.
When you execute a loop 1000 times it is only going to change branches once the loop finishes. If you always predict that the loop didn't finish, your branch predictor will correctly predict 99.9% of branches.
I've been lightly banging the drum the last few years that a lot of programmers don't seem to understand how fast computers are, and often ship code that is just miserably slower than it needs to be, like the code in this article, because they simply don't realize that their code ought to be much, much faster. There's still a lot of very early-2000s ideas of how fast computers are floating around. I've wondered how much of it is the still-extensive use of dynamic scripting languages and programmers not understanding just how much performance you can throw away how quickly with those things. It isn't even just the slowdown you get just from using one at all; it's really easy to pile on several layers of indirection without really noticing it. And in the end, the code seems to run "fast enough" and nobody involved really notices that what is running in 750ms really ought to run in something more like 200us.
I have a hard time using (pure) Python anymore for any task that speed is even remotely a consideration for anymore. Not only is it slow even at the best of times, but so many of its features beg you to slow down even more without thinking about it.
I agree 100%. I wish every software engineer would spent at least a little time writing some programs in bare C and running them to get a feel for how fast a native executable can start up and run. It is breathtaking if you're used to running scripting languages and VMs.
Related anecdote: My blog used to be written using Jekyll with Pygments for syntax highlighting. As the number of posts increased, it got closer and closer. Eventually, it took about 20 seconds to refresh a simple text change in a single blog post.
I eventually decided to just write my own damn blog engine completely from scratch in Dart. Wrote my own template language, build graph, and syntax highlighter. By having a smart build system that knew which pages actually needed to be regenerated based on what data actually changed, I hoped to get very fast incremental rebuilds in the common case where only text inside a single post had changed.
Before I got the incremental rebuild system working, I worked on getting it to just to a full build of the entire blog: every post page, pages, for each tag, date archives, and RSS support. I diffed it against the old blog to ensure it produced the same output.
Once I got that working... I realized I didn't even need to implement incremental rebuilds. It could build the entire blog and every single post from scratch in less than a second.
I don't know how people tolerate slow frameworks and build systems.
Yeah, I've written static site generators in Go and Rust among other languages (it's my goto project for learning a new language). Neither needed incremental builds because they build instantly. The bottlenecks are I/O.
I've also worked in Python shops for the entirety of my career. There are a lot of Python programmers who don't have experience with and thus can't quite believe how much faster many other languages are (100X-1000X sounds fast in the abstract, but it's really, really fast). I've seen engineering months spent trying to get a CPU-bound endpoint to finish reliably in under 60s (yes, we tried all of the "rewrite the hot path in X" things), while a naive Go implementation completed in hundreds of milliseconds.
Starting a project in Python is a great way to paint yourself into a corner (unless you have 100% certainty that Python [and "rewrite hot path in X"] can handle every performance requirement your project will ever have). Yeah, 3.11 is going to get a bit faster, but other languages are 100-1000X faster--too little, too late.
Python is slow in many things like pure looping and arithmetic, even though there are workarounds to make that 1-10x slower rather than 100-1000X (eg. C-based implementations, including all the itertools stuff).
I am sometimes frustrated that I can't just loop over a string character by character and not get crappy performance, but the "problem" you (and me) are seeing in existing codebases is that Python is very inviting to beginners, and they are not frustrated with this because they don't know it :)
But as you note, bottleneck is the I/O, and a program waiting for I/O in Python and I/O in C will wait the same time after the computation is done.
If you are writing software that can parallelize well independently (eg. web apps) and your memory pressure is not the most important thing, you simply run multiple Python processes to max out the CPU (this avoids the GIL unlike async Python). And you keep your dependencies low.
> even though there are workarounds to make that 1-10x slower rather than 100-1000X (eg. C-based implementations, including all the itertools stuff).
These only apply for specific problems, and very few applications are purely CSV parsing or purely matrix math operations. In the real world, you often spend more time marshaling your Python data to C than you save by doing your computation in C.
> But as you note, bottleneck is the I/O, and a program waiting for I/O in Python and I/O in C will wait the same time after the computation is done.
The bottleneck in a static site generator is I/O. The fact that Python, Ruby, etc based implementations take tens of seconds or more while Go and Rust finish instantly for an I/O bound problem is pretty damning.
> If you are writing software that can parallelize well independently (eg. web apps) and your memory pressure is not the most important thing, you simply run multiple Python processes to max out the CPU (this avoids the GIL unlike async Python).
The goal isn’t to saturate the CPU as much as it is to complete requests in a timely fashion. If it’s just some light translation between HTTP and database layers, Python is fine, but if you have to do anything computationally significant at all, it can range from “a huge pain” to “virtually impossible”. I gave the example earlier of a web service that was struggling to complete requests in even 60s (despite using Numpy under the hood where possible) while a naive Go implementation completed in hundreds of ms.
> The bottleneck in a static site generator is I/O. The fact that Python, Ruby, etc based implementations take tens of seconds or more while Go and Rust finish instantly for an I/O bound problem is pretty damning.
My point was that if this was the case, your Python code is probably suboptimal.
Sure, you are comparing against naive implementation as well, but if performance is a concern, don't do naive Python :)
> I gave the example earlier of a web service that was struggling to complete requests in even 60s (despite using Numpy under the hood where possible) while a naive Go implementation completed in hundreds of ms.
Yes, it's easy and sometimes even idiomatic to write non-performant Python code. Getting the most out of pure Python is hard and it means avoiding some common patterns.
Eg. simply using sqlalchemy ORM (to construct rich dynamic ORM objects) instead of sqlalchemy core (tuples) to get 100k+ rows from DB is 20x slower, and that's still 2x slower from pure psycopg (also tuples using basic types). There are plenty of examples like this in Python, unfortunately.
> My point was that if this was the case, your Python code is probably suboptimal.
Sure, you are comparing against naive implementation as well, but if performance is a concern, don't do naive Python :)
I agree, and I'll go further: if performance could be a concern and you aren't certain that even optimized Python is up for the task, don't do Python. :)
I don't know how optimized these SSGs are, but given how frequently this complaint occurs and how popular they are, I would expect that someone would have tried to optimize them a bit. Even assuming naive implementations, tens of seconds versus tens of milliseconds for an I/O-bound task is pretty concerning.
> Yes, it's easy and sometimes even idiomatic to write non-performant Python code. Getting the most out of pure Python is hard and it means avoiding some common patterns.
It probably shouldn't be easy for someone to write non-performant Python code when they're trying desperately to write performant Python code. :)
> Getting the most out of pure Python is hard and it means avoiding some common patterns.
And even then, you're probably going to be coming in 10-100X slower than naive Go/Java/C#/etc unless your application happens to be a good candidate for C-extensions (e.g., matrix math) or if it really is I/O bound (a CRUD webapp). It honestly just seems better to avoid Python altogether than try to write Python without using "common patterns" (especially absent guidance about which patterns to avoid or how to avoid them).
> I agree 100%. I wish every software engineer would spent at least a little time writing some programs in bare C and running them to get a feel for how fast a native executable can start up and run. It is breathtaking if you're used to running scripting languages and VMs.
Conversely when 99.9% of the software you use in your daily life is blazing fast C / C++, having to do anything in other stacks is a complete exercise in frustration, it feels like going back a few decades in time
Conversely when 99.9% of the software you use in your daily life is user friendly Python, having to do anything in C/C++ is a complete exercise in frustration, it feels like going back a few decades in time
As a person who uses both languages for various needs, I disagree. Things which takes minutes in optimized C++ will probably take days in Python, even if I use the "accelerated" libraries for matrix operations and other math I implement in C++.
Lastly, people think C++ is not user friendly. No, it certainly is. It needs being careful, yes, but a lot of things can be done in less lines then people expect.
I was a C++ dev in a past life and I have no particular fondness for Python (having used it for a couple of decades), and "friendliness" is a lot more than code golf. It's also "being able to understand all of the features you encounter and their interactions" as well as "sane, standard build tooling" and "good debugability" and many other things that C++ lacks (unless something has changed recently).
I delved into Python recently to work on some data science hobbies and a Chess program and it's frankly been fairly shit compared with other languages I use.
Typescript (by way of comparison with other non-low-level languages) just feels far more solid wrt type system, type safety, tooling etc. C# (which I've used for years) is faster by orders of magnitude and IMO safer/easier to maintain.
Python is a powerful yet beginner friendly language with a very gentle learning slope, but I would still take C++ tooling and debuggability any day over Python.
Nah man, I've spent way too much time trying to piece together libraries to turn core dumps into a useful stack trace. Similarly, as miserable as Python package management is, at least it has a package manager that works with virtually every project in the ecosystem. I actually really like writing C++, but there are certain obstacles that slow a developer down tremendously--I could forgive them if they were interesting obstacles (e.g., I can at least amuse myself pacifying Rust's borrow checker), but there's no joy in trying to cobble together a build system with CMake/etc or try to get debug information for a segfault.
You need to provide all of the libraries referenced by the core dump (at the specific versions and compiled with debug symbols) to get gdb to produce a useful backtrace. It's been a decade since I've done professional C++ development, so I'm a bit foggy on the particulars.
Glad to hear the 2022 C++ ecosystem is finally catching up on some regards, but how does it know which version of those dependencies to download, and how does it download closed source symbols?
Java and Go were both responses to how terrible C++ actually is. While there are footguns in python, java, and go, there are exponentially more in C++.
As a person who wrote Java and loved it (and I still love it), I understand where you're coming from, however all programming languages thrive in certain circumstances.
I'm no hater of any programming language, but a strong proponent of using the right one for the job at hand. I write a lot of Python these days, because I neither need the speed, nor have the time to write a small utility which will help a user with C++. Similarly, I'd rather use Java if I'm going to talk with bigger DBs, do CRUD, or develop bigger software which is going to be used in an enterprise or similar setting.
However, if I'm writing high performance software, I'll reach for C++ for the sheer speed and flexibility, despite all the possible foot guns and other not-so-enjoyable parts, because I can verify the absence of most foot-guns, and more importantly, it gets the job done the way it should be done.
I've seen a lot of bad C++ in my life, and have seen Java people write C++ like they would Java.
Writing good C++ is hard. People who think they can write good C++ are surprised to learn about certain footguns (static initialization before main, exception handling during destructors, etc).
I found this reference which I thought was a pretty good take on the C++ learning curve.
> I've seen a lot of bad C++ in my life, and have seen Java people write C++ like they would Java.
Ah, don't remind me Java people write C++ like they write Java, I've seen my fair share, thank you.
> Writing good C++ is hard.
I concur, however writing good Java is also hard. e.g. Swing has a fixed and correct initialization/build sequence, and Java self-corrects if you diverge, but you get a noticeable performance hit. Most developers miss the signs and don't fix these innocent looking mistakes.
I've learnt C++ first and Java later. I also tend to hit myself pretty hard during testing (incl. Valgrind memory sanity and Cachegrind hotpath checks), so I don't claim I write impeccable C++. Instead I assume I'm worse than average and try to find what's wrong vigorously and fix them ruthlessly.
The remark is rooted from variable naming and code organization mostly. I've seen a C++ codebase transferred to a java developer, and he disregarded everything from the old codebase. Didn't refactor the old code, and the new additions were done Java Style. CamelCase file/variable/function names, every class on its own file with ClassName.cpp files littered everywhere, it was a mess.
The code was math-heavy, and became completely unreadable and un-followable. He remarked "I'm a java developer, I do what I do, and as long as it works, I don't care".
That was really bad. It was a serious piece of code, in production.
The biggest weakness of C++ (and C) is non-localized behavior of bugs due to undefined behavior. Once you have undefined behavior, you can no longer reason about your program in a logically consistent way. A language like Python or Java has no undefined behavior so for example if you have an integer overflow, you can debug knowing that only data touched by that integer overflow is affected by the bug whereas in C++ your entire program is now potentially meaningless.
Memory write errors (some times induced by UB) in one place of the program can easily propagate and later fail in a very different location of the program, with absolutely zero diagnostics of why your variable suddenly had a value out of possible range.
This is why valgrind, asan and friends exist. They move the error diagnostic to the place where error actually happened.
If your C++ program exhibit undefined behaviour, the compiler is allowed to format your entire hard drive. Or encrypt it and display a "plz pay BTC" message. That's called a vulnerability. Real and meaningful security checks have been removed as "dead code" because of signed integer overflow (which is undefined behaviour by default).
If anything, I would guess the gross misunderstanding sprouted somewhere between the specs and the compiler writers. Originally, UB was mostly about bailing out when the underlying platform couldn't handle this particular case, or explicitly ignoring edge cases to simplify implementations. Now however it's also a performance thing, and if anything is marked as UB then it's fair game for the optimiser — even if it could easily be well defined, like signed integer overflow on 2's complement platforms.
> If your C++ program exhibit undefined behaviour, the compiler is allowed to format your entire hard drive. Or encrypt it and display a "plz pay BTC" message.
No, it isn't. That's a completely made up fabrication. And if you had a compiler that was going to do that, then what the standard says or if there's undefined behavior is obviously not relevant or significant in the slightest.
The majority of the UB optimization complaints are because the compiler couldn't tell that UB was happening. It didn't detect UB and then make an evil laugh and go insane. That's not how this works.
Compilers cannot detect UB and then do things in response within the rules of the standard. Rather, they are allowed to assume UB doesn't happen. That's it, that's all they do. They just behave as though your source has no UB at all. As far as the compiler is concerned, UB doesn't exist and can't happen.
When a compiler can detect that UB is happening it'll issue a warning. It never silently exploits it.
> Real and meaningful security checks have been removed as "dead code" because of signed integer overflow (which is undefined behaviour by default).
Real and meaningful security checks have been removed because the security check happened after the values were already used in specific ways, not because of UB. The values were already specified in the source code to be a particular thing via earlier usage. UB is just the shield for developers who wrote a bug to hide behind to avoid admitting they had a bug.
Use UBSAN next time.
> even if it could easily be well defined, like signed integer overflow on 2's complement platforms.
Signed integer overflow is defined behavior, that's not UB. Also platform specific behavior is something the standard doesn't define - that's why it was UB in the first place.
It is kinda ridiculous it took until C++20 for this change, though
> > UB allows the to format/encrypt your entire hard drive.
> No, it isn't. That's a completely made up fabrication.
Ever heard of viruses exploiting buffer overflows to make arbitrary code execution? One cause of that can be a clever optimisation that noticed that the only way the check fails is when some UB is happening. Since UB "never happens", the check is dead code and can be removed. And if the compiler noticed after it got past error reporting, you may not even get a warning.
You still get the vulnerability, though.
> UB is just the shield for developers who wrote a bug to hide behind to avoid admitting they had a bug.
C is what it is, and we live with it. Still, it would be unreasonable to say that the amount of UB it harbours isn't absolutely ludicrous. It's like asking children to cross a poorly mapped minefield and blame them when they don't notice a subtle cue and blow themselves up.
Also, UBSan is not enough. I ran some of my code unde ASan, MSan, and UBSan, and the TIS interpreter still found a couple things. And I'm talking about pathologically straight-line code where once you test for all input sizes you have 100% code path coverage.
> Signed integer overflow is defined behavior, that's not UB.
The C99 standard explicitly states that left shift is undefined on negative integers, as well as signed integers when the result overflows. I had to get around that one personally by replacing x<<n by x(1<<n) on carry propagation code.
> Also platform specific behavior is something the standard doesn't define - that's why it was UB in the first place.*
One point I was making is, compiler writers didn't get that memo. They treat any UB as fair game for their optimisers. It doesn't matter that signed integer overflow was UB because of portability, it still "never happens".
> C is what it is, and we live with it. Still, it would be unreasonable to say that the amount of UB it harbours isn't absolutely ludicrous.
There's a lot of ludicrous stuff about C and I wouldn't recommend anyone use it for anything. Not when Rust and C++ exist.
But UB really isn't the scary boogie man. There could probably stand to be a `as-is {}` block extension for security checks, but that's really about it.
Granted, C is underpowered and I would like namespaces and generics. But from a safety standpoint nowadays, C++ is just as bad. Not only is is monstrously complex, it still has all the pitfalls of C. C++ may have been "more strongly typed" back in the day, but now compiler warnings made up for that small difference.
Granted, C++ can be noticeably safer if you go RAII pointer fest, but then you're essentially programming in Java with better code generation and a worse garbage collector.
---
There's also a reason to still write C today: its ubiquity. Makes it easier to deploy everywhere and to talk to other languages. It's mostly a library thing though, and the price in testing effort and bugs is steep.
Well, I'll check who gets rid of all undefined overflows first. 2's complement is nice and dandy, but if overflow is still undefined that doesn't buy me much.
I've written a whole bunch of all of those languages, and they each occupy a different order of magnitude of footguns. From fewest to most: Go (1X), Java (10X), Python (100X), and C++ (1000X).
Most of those aren’t “footguns” at all, but rather preferences (naming conventions, nominal vs structural subtyping) and many others are shared with Python (“magical behavior”, Go’s structural subtyping is strictly better for finding implementations than Python’s duck typing) or non-issues altogether (“the Go compiler won’t accept my invalid Go code”).
The “forget to check an error” one is valid, but rare (usually a function will return data and an error, and you can’t touch the data without handling the error)—moreover, once you use Go for a bit, you sort of expect errors by default (most things error). But yeah, a compilation failure would be better. Personally, the things that really chafe me are remembering to initialize maps, which is a rarer problem in Python because there’s no distinction between allocation and instantiating (at least not in practice). I do wish Go would ditch zero types and adopt sum types (use Option[T] where you need a nil-like type), but that ship has sailed.
I’ve operated services in both languages, and Python services would have tons of errors that Go wouldn’t have, including typos in identifiers, missing “await”s, “NoneType has no attribute ‘foo’”, etc but also considerably more serious issues like an async function accidentally making a sync call under the covers, blocking the event loop, causing health checks to fail, and ultimately bringing down the entire service (same deal with CPU intensive endpoints).
In Go, we would see the occasional nil pointer error, but again, Python has those too.
I personally find C++ more friendly, just because of the formatting that python forces upon you.
But I do have to say that I never managed to really get into python, it always just felt like to much of a hassle, thus I always avoided it if possible.
The formatting python enforces is just "layout reflects control flow". It's really not any more difficult than that, and it's a lot better than allowing layout to lie about control flow.
To each their own, but Python's use of indenting for structure is why I never tried it. It just felt, to me, like it was solving one problem with another.
I think Go gets this right: it consistently uses braces for structure, but has an idiomatic reformatting tool that is applied automatically by most IDEs. This ensures that the format and indentation always perfectly matches the code structure, without needing to use invisible characters.
I didn't like it for years but then I kind of got into it for testing out machine learning and I found it kind of neat. My biggest gripe is no longer the syntax but the slowness, trying to do anything with even a soft performance requirement means having to figure out how to use a library that calls C to do it for you. Working with large amounts of data in native Python is noticeably slower than even NodeJS.
> Things which takes minutes in optimized C++ will probably take days in Python, even if I use the "accelerated" libraries for matrix operations and other math
I’m gonna need an example because I do not believe this whatsoever.
I'd rather open the code and show what I'm talking about, however I can not.
Let's say I'm making a lot of numerical calculations which are fed from a lockless queue with atomic operations to any number of cores you want, where your performance is limited by the CPU cores' FPU performance and the memory bandwidth (in terms of both transfer speed and queries that bus can handle per second).
As I noted below, that code can complete 1.7 million complete evaluations per core, per second on older (2014 level) hardware, until your memory controller congests with all the requests. I need to run benchmarks on a newer set of hardware to get new numbers, however I seriously lack the time today to do so and provide you new numbers.
There are definitely operations you cannot speed up in Python as much as in other languages, unless you implement it in one of those other languages and interface it in Python.
That much is obvious from Python providing a bunch of C-based primitives in stdlib (otherwise they'd just be written in pure Python).
In many cases, you can make use of the existing primitives to get huge improvements even with pure Python, but you are not beating optimized C++ code (which almost has direct access to CPU vector operations as well).
Python's advantage is in speed of development, not in speed of execution. And I say that as a firm believer that majority of the Python code in existence today could be much faster only if written with the understanding of Python's internal structures.
This is because numpy and friends are really good at matmul's.
As soon as you step out of the happy path and need to do any calculation that isn't at least n^2 work for every single python call you are looking at order of magnitude speed differences.
Years ago now (so I'm a bit fuzzy on the details) a friend asked me to help optimize some python code that took a few days to do one job. I got something like a 10x speedup using numpy, I got a further 100x speedup (on the entire program) by porting one small function from optimized numpy to completely naive rust (I'm sure c or c++ would have been similar). The bottleneck was something like generating a bunch of random numbers, where the distribution for each one depended on the previous numbers - which you just couldn't represent nicely in numpy.
What took 2 days now took 2 minutes, eyeballing the profiles I remember thinking you could almost certainly get down to 20 seconds by porting the rest to rust.
Have you tried porting the problem into postgres? Not all big data problems can be solved this way but I was surprised what a postgres database could do with 40 million rows of data.
I didn't, I don't think using a db really makes sense for this problem. The program was simulating a physical process to get two streams of timestamps from simulated single-photon detectors, and then running a somewhat-expensive analysis on the data (primarily a cross correlation).
There's nothing here for a DB to really help with, the data access patterns are both trivial and optimal. IIRC it was also more like a billion rows so I'd have some scaling questions (a big enough instance could certainly handle it, but the hardware actually being used was a cheap laptop).
Even if there was though - I would have been very hesitant to do so. The not-a-fulltime-programmer PhD student whose project this was really needed to be able to understand and modify the code. I was pretty hesitant to even introduce a second programming language.
That's definitely quite curious: I am sure pure Python could have been heavily optimized to reach 2 minutes as well, though. Random number generation in Python is C-based, so while the pseudo-random generators from Python's random module might be slow, it's not because of Python itself (https://docs.python.org/3/library/random.html is a different implementation from https://man7.org/linux/man-pages/man3/random.3.html).
Call overhead and loop overhead is pretty big in Python though. The way to work around that in Python is to use C-based "primitives", like the stuff from itertools and all the builtins for set/list/hash processing (thus avoiding the n^2 case in pure Python). And when memory is an issue (preallocating large data structures can be slow as well), iterators! (Eg. compare use of range() in newer Python with use of list(range())).
I'm reasonably sure the PRNG being used in the python version came from numpy and was implemented in C (or other native code, not python). The problem was that the necessary control flow and varying parameters around it meant you had to call it once per value from python (and you had to generate a lot of values).
And if I recall correctly there was no allocation in the hot loop, with a single large array being initialized via numpy to store the values before hand. Certainly that's one of the first things I would think to fix.
I was strongly convinced at the time that there was no significant improvement left in python. With >99% of the time being spent in this one function, and no way to move the loop into native code given the primitives available from numpy. Admittedly I could have been wrong, and I'm not about to revisit the code now, since it has been years and it is no longer in use - so everything I'm saying is based off of years old memories.
Sure, numpy introduces its own set of restrictions. I was mostly referring to taking a different approach before turning to numpy, but it could very well be true.
In essence, doing what you did is the way to get performance out of Python when nothing else works.
> The problem was that the necessary control flow and varying parameters around it meant you had to call it once per value from python (and you had to generate a lot of values).
The code I've written and still working on is using Eigen, which TensorFlow also uses for its matrix operations, so, I'm not far off from these guys in terms of speed, if not ahead.
The code I've written can complete 1.7 million evaluations per core, per second, on older hardware, which is used to evaluate things up to 1e-6 accuracy, which pretty neat for what I'm working on.
Because it is like saying you use a bash script to configure and launch a c++ application and saying it is a bash script. Python is not a high performance language, it isn't meant to be and it's strengths lie elsewhere. One of it's great strengths is interop with c libs.
Your assertion was that numpy etc will be faster than something else despite being python:
> Try writing a matmul operation in C++ and profile it against the same thing done in Numpy/Pytorch/TensorFlow/Jax. You’ll be surprised.
No. When I write Tensorflow code I write Python. I don’t care what TF does under the hood just like I don’t care that Python itself might be implemented in C. Though I got to say TF is quite ugly and not a good example of Python’s user friendliness. But that’s another topic.
That's a known and widely publicised trait of Python.
In the early days, Python tutorial warned against adding to strings by doing "+" even though it works because that performed a new allocation and string copy.
What you were asked to do was use fast, optimized C-based primitives like "\n".join(list_of_strings) etc.
Basically, Python is an "ergonomic" language built in C. Saying how something is implemented in C at the lower level is pointless, because all of Python is.
Yes, doing loops over large data sets in Python is slow. Which is why it provides itertools (again, C-based functions) in stdlib.
And fortran. Which really doesn't matter that much as long as that doesn't leak to the users of numpy, and it doesn't really. The only issue is that it means if you're doing something that doesn't fit the APIs exposed by the native code (in a way where the hot loops are in native code) it's roughly as slow as normal python.
But it does for the argument of a language being fast, which is what we are talking about here. I don't think it is an appropriate argument to say "Python is fast, look at numpy", when the core pieces are written in C/Fortran. It is disingenuous, at least to me.
I like writing things in python. It honestly feels like cheating at times. Being able to reduce things down to a list comprehension feels like wizardry.
I like having things written in C/C++. Because like every deep magic, there's a cost associated with it.
> and at the end of the day LLVM compiles 30min and uses tens of GBs of RAM on average hardware
I mean, that's the initial build.
Here's my compile-edit-run cycle in https://ossia.io which is nearing 400kloc, with a free example of performance profiling, I haven't found anything like this whenever I had to profile python. It's not LLVM-sized of course, but it's not a small project either, maybe in the medium-low C++ project size: https://streamable.com/o8p22f ; pretty much a couple seconds at most from keystroke to result, for a complete DAW which links against Qt, FFMPEG, LLVM, Boost and a few others. Notice also how my IDE kindly informs me of memory leaks and other funsies.
Here's some additional tooling I'm developing - build times can be made as low as a few dozen milliseconds when one puts some work into making the correct API and using the tools correctly: https://www.youtube.com/watch?v=fMQvsqTDm3k
"10 compilers, IDEs, debuggers, package managers" what are you talking about? (Virtually) No one uses ten different tools to build one application. I don't even know of any C++-specific package managers, although I do know of language-specific package managers for... oh, right, most scripting languages. And an IDE includes a compiler and a debugger, that's what makes it an IDE instead of a text editor.
"and at the end of the day LLVM compiles 30min and uses tens of GBs of RAM on average hardware" sure, if you're compiling something enormous and bloated... I'm not sure why you think that's an argument against debloating?
>No one uses ten different tools to build one application.
I meant you have a lot of choices to make
Instead of having one strong standard which everyone uses, you have X of them which makes changing projects/companies harder, but for solid reason? I don't know.
>"and at the end of the day LLVM compiles 30min and uses tens of GBs of RAM on average hardware" sure, if you're compiling something enormous and bloated... I'm not sure why you think that's an argument against debloating?
I know that lines in repo aren't great way to compare those things, but
.NET Compiler Infrastructure:
20 587 028 lines of code in 17 440 files
LLVM:
45 673 398 lines of code in 116 784 files
The first one I built (restore+build) in 6mins and it used around 6-7GB of RAM
The second I'm not even trying because the last time I tried doing it on Windows it BSODed after using _whole_ ram (16GBs)
Compiling a large number of files on Windows is slow, no matter what language/compiler you use. It seems to be a problem with the program invocation, which takes "forever" on Windows. It's still fast for a human, but it's slow for a computer. Quite apt this comes up here ;-)
Source for claim: That's a problem we actually faced in the Windows CI at my old job. Our test suite invoked about 100k to 150k programs (our program plus a few 3rd party verification programs). In the Linux CI the whole thing ran reasonably fast, but the Windows CI took double as long. I don't recall the exact numbers, but if Windows incurs a 50ms overhead per program call you're looking at 1:20 (one hour twenty minutes) more runtime at 100k invocations.
Also I'm pretty sure I've built LLVM on 16GB memory. Took less than 10 minutes on a i7-2600. The number of files is a trade off: You can combine a bunch of small files into a large file to reduce the build time. You can even write a tool that does that automatically on every compile (and keeps sane debug info). But now incremental builds take longer, because even if you change only one small file, the combined file needs to be rebuild. That's a problem for virtually all compiled languages.
I can only guess, I am neither a LLVM nor a MSVC dev.
1. Compile times: If you have one file with 7000 LOC that and change one function in that file, the rebuild is slower than if you had 7 files with 1000 LOC instead.
2. Maintainability: Instead of putting a lot of code into one file, you put the code in multiple files for better maintainability. IIRC LLVM was FOSS from the beginning, so making it easy for lots of people to make many small contributions is important. I guess .NET was conceived as being internal to MS, so less people overall, but newcomers probably were assigned to a team for onboarding and then contributing to the project as part of that team. With other words: At MS you can call up the person or team responsible for that 10000 LOC monstrosity; but if all you got is a bunch of names with e-mail addresses pulled from the commit log, you might be in for a bad time.
3. Generated code: I don't know if either commit generated code into the repository. That can skew these numbers as well.
4. Header files can be a wild card, as it depends on how their written. Some people/projects just put the signatures in there and not too much details, others put the whole essays as docs for each {class, method, function, global} in there, making them huge.
For the record, by your stats .NET has 1180 LOC per file and LLVM 391 on average. That doesn't say a lot, the median would probably be better, or even a percentile graph. Broken down by type (header/definition vs. implementation). You might find that the distribution is similar and a few large outliers skew it (especially generated code). Or when looking at more, big projects you might find that these two are outliers. I can't say anything definite, and from an engineering perspective I think neither is "suspicious" or even bad.
My gut feeling says 700 would be a number I'd expect for a large project.
I assume the parent was talking about the fragmentation in the ecosystem (fair point, especially regarding package management landscape and build tooling), but it's unclear.
> Is performance inversely proportional to dev experience?
No. I feel there is great developer experience in many high performance languages: Java, C#, Rust, Go, etc.
In fact, for my personal tastes, I find these languages more ergonomic than many popular dynamic languages. Though I will admit that one thing that I find ergonomic is a language that lifts the performance headroom above my head so that I'm not constantly bumping my head on the ceiling.
TCC is a fast compiler. So fast, that at one time, one could use it to boot Linux from source code! But there's a downside: the code is produces is slow. There's no optimization done. None. So the trade off seems to be: compile fast but slow program, or compile slow but fast program.
The trade-off is more of a gradient: e.g. PGO allows an instrumented binary to collect runtime statistics and then use those to optimize hot paths for future build cycles.
I wish product designers took performance into consideration when they designed applications. Engineers can optimize until their fingers fall off, but if the application isn't designed with efficiency in mind (and willing to make trade-offs in order to achieve that), we'll probably just end up right back in the same place.
And a product which is designed inefficiently where the engineer has figured out clever ways to get it to be more performant is most likely a product that is more complicated under the hood than it would be if performance were a design goal in the first place.
Rather than bare C something like C++, Rust, or even Haskell would be better. C isn't the fastest, especially not with normal code. C++ templates get a bad rep, but if you want to go fast they are extremely hard to beat.
Also those languages show you don't actually have to give up modern features or even that much convenience in order to get blazing fast speeds.
At all my recent jobs, I grow frustrated with how slow running a single unit test is locally on a codebase. We are talking 5+ seconds for even the most trivial of trivial unit tests (say, purely functional arithmetic unit test).
And this is even with dynamic languages like Python (you see pytest reporting how your unit test completed in 0.00s, and wall time is 7s).
And then I get grumpy if they don't let me go and fix it because I am the only one who is that annoyed with this :D
How on earth are you getting 5 seconds for simple tests? Simple tests should be running in 8ms, and those are my 2015 numbers that I've been too lazy to update.
Have you worked on a recent idiomatic development setup (dockerised local development, top level imports of everything and plenty of setup at the top level too, people unfamiliar with how to manage .pyc files so they simply disable them...)?
Common libraries like requests or sqlalchemy take 300-500ms to import (eg. try `time python3 -c 'import requests'` and contrast just `time python3 -c ''` which is python startup overhead).
As I said, tests run in sub 10ms, but from issuing pytest to completion it's usually 5-15s.
Ah, I see, so the setup time is very slow. I don't work in python much but I've worked in a few other languages with slow startup, and amortization is your friend. It's hard though when you have a small module with 'only' 300 tests and your test is 6ms of code that works out to 40ms once setup and teardown are included. I haven't had many opportunities to have the "well maybe you should be making bigger modules" conversation but I am ready for that moment to arise.
This is usually the point at which I pull out a 'watch' implementation, since the 5 seconds it's going to take me to switch windows and hit 'up' the right number of times counts too, if we're comparing apples to apples.
That said, one of the last times I had a unit testing mentor, I walked into a project that ran 3800 tests in about 7 seconds, and then started poking around trying to figure out who was materially responsible. (He didn't know much more than me from an implementation standpoint, but boy was he good at selling people on test quality.) If that had been 20 seconds it would have still been lovely, but it wouldn't have grabbed my attention quite as much.
While I'll take a bite at this, I think it's also fair to say how poorly portable C is. Can an mobile or web engineer quickly take some C code and use it in their stack somehow? I would guess not. While it's indeed an important lesson to see the speed of some of these 'close to the metal' languages, the question of how practical they are to use is a different question.
There is a class of C code that can be made extremely portable: pure computations. This allows you to write self contained code with zero dependencies, and if you're willing to give up on SIMD you can stick to fully conforming C99.
It's not applicable for everything, but we do have some niches where it comes in handy: cryptographic libraries (I've written one), parsers and encoders of all kind, compilers…
For instance can a mobile on web engineer quickly take TweetNaCl or Monocypher and use it in their stack? Yes. They may need to write some bindings themselves, but if they can run C code at all it's fairly trivial.
This was about my experience switching from webpack to ESBuild for Javascript. Why do incremental builds if rebuilding the whole thing takes just 2s (as opposed to 90+ with webpack).
They are. They've just chosen to spend all their speed gains on more optimization passes and static analysis, to produce ever faster outputs than to produce an output faster.
Their fundamental model is one translation unit per time, while developers decided that writing all library code in headers is a good idea. Which makes them parse and DCE literally kilometers of mostly irrelevant code again and again. You’re not wrong, but it’s not the complete point. C++ development is slow as a whole, and compilers/standards do nothing to fix that. It’s a kind of F1 engine in a tractor situation.
> while developers decided that writing all library code in headers is a good idea.
It wasn't developers who designed C++'s template model which requires generic code to be fully defined in header files.
Inheriting C's textual include file based "module" system and then bolting compile-time specialized generics is a choice the C++ committee made, not C++ users. It was probably the right choice given C++'s many very difficult constraints, but that's what directly leads to huge compile times, not dumb C++ users.
Please don't write programs in bare C. Use Go if you're looking for something very simple and fast-enough for most uses; it's even memory safe as long as you avoid shared-state concurrency.
Unqualified "fast enough" is pretty much exactly the problem being pointed out. Most developers have no idea what "fast" is let alone "fast enough". If they were taught to benchmark at with a lower level language, see what adding different abstractions causes, that would help a ton.
I would personally suggest C++ though because there is such a huge amount of knowledge around performance and abstraction in that community - wonderful conference talks and blog posts to learn from.
Go comes from a different school of compiler design where the code generation is decent in most cases, but struggles with calculations and more specific patterns. Delphi is a similar compiler. Looking at benchmarks, the performance is only a few times worse than optimized C. That's on par with the most optimized JITed languages like Java, while being overall a much simpler compiler. I feel it is is fair to say 'good enough' in this situation.
It's not an "unqualified" claim, Go really is fast enough compared to the likes of Python and Ruby. I'm not saying that rewriting a Go program in a faster language (C/C++/Rust) can't sometimes be effective, but that's due to special circumstances - it's not something that generalizes to any and all programs.
You've obviously been burned by null pointers (probably not just once). And you think they are a problem, and you're right. And you think they are a mistake, and you could be right about that, too.
But they're not the only problem. Writing async network servers can be a problem, too. Go helps a lot with that problem. If for your situation it helps more with that than it hurts with nulls, then it can be a rational choice.
And, don't assume that go must be a bad choice for all programmers, in all situations. It's not.
Nothing wrong with any of these languages, especially C. It's been around since the early 70s and is not going anywhere. There's a very good reason it (and to an extent C++) is still is the default language for doing a lot of things since everyone understands it.
C and C++ both have excellent library support, perhaps the best interop of any language out there and platform support that cannot be beat.
That said, they're also challenging to use for the "average" (median) developer who'd end up creating code that is error-prone and would probably have memory leaks sooner or later.
Thus, unless you have a good reason (of which, admittedly, there are plenty) to use C or C++, something that holds your hand a bit more might be a reasonable choice for many people out there.
Go is a decent choice, because of a fairly shallow learning curve and not too much complexity, while having good library support and decent platform support.
Rust is a safer choice, but at the expense of needing to spend a non-insignificant amount of time learning the language, even though the compiler is pretty good at being helpful too.
> That said, they're also challenging to use for the "average" (median) developer who'd end up creating code that is error-prone and would probably have memory leaks sooner or later.
Many of the most highly credentialed, veteran C developers have said they can't write secure C code. Food for thought.
> Go is a decent choice, because of a fairly shallow learning curve and not too much complexity, while having good library support and decent platform support. Rust is a safer choice, but at the expense of needing to spend a non-insignificant amount of time learning the language, even though the compiler is pretty good at being helpful too.
Go doesn't have the strongest static guarantees, but it does provide a decent amount of static guarantees while also keeping the iteration cycle to a minimum. Languages like Rust have significantly longer iteration cycles, such that you can very likely ship sooner with Go at similar quality levels (time savings can go into catching bugs, including bugs which Rust's static analysis can't catch, such as race conditions). Moreover, I've had a few experiences where I got so in-the-weeds trying to pacify Rust's borrow-checker that I overlooked relatively straightforward bugs that I almost certainly would've caught in a less-tedious languages--sometimes static analysis can be distracting and in that respect, harm quality (I don't think this a big effect, but it's not something I've seen much discussion about).
There is unsecure code hidden in every project that uses any programming language ;)
I get what you're saying here, you're specifically talking about security vulnerabilities from memory related errors. I honestly wonder how many of these security vulnerabilities are truly issues that never would have come up in a more "secure" language like Java, or if the vulnerabilities would have just surfaced in a different manner.
In other words, we're constantly told C and C++ are unsafe languages they should never be used and blah blah blah. How much of this is because of the fact that C has been around since the 1970s, so its had a lot more time to rack up large apps with security vulnerabilities, whereas most of the new recommended languages to replace C and C++ have been around since the late 90s. In another 20 years will we be saying the same thing about java that people say about C and C++? And will we be telling people to switch to the latest and greatest because Java is "unsafe"? Are these errors due to the language, or is it because we will always have attackers looking for vulnerabilities that will always exist because programmers are fallible and write buggy code?
> In another 20 years will we be saying the same thing about java that people say about C and C++? And will we be telling people to switch to the latest and greatest because Java is "unsafe"?
As long as the vulnerability types that cause trouble in language B are a superset of those that cause trouble in language C, it makes sense to recommend moving from B to C for safety reasons.
This is true even if there is a language A that is even worse and in the absence of language C, we recommended moving from A to B. Code written in A will be worse in expectation than code written in B than code written in C.
> I honestly wonder how many of these security vulnerabilities are truly issues that never would have come up in a more "secure" language like Java, or if the vulnerabilities would have just surfaced in a different manner.
Memory safety vulnerabilities basically boil down to following causes: null pointer dereferences, use-after-free (/dangling stack pointers), uninitialized memory, array out-of-bounds, and type confusion. Now, strictly speaking, in a memory-safe languages, you're guaranteed not to get uncontrollable behavior in any of these cases, but if the result is a thrown exception or panic or similar, your program is still crashing. And I think for your purposes, such a crash isn't meaningfully better than C's well-things-are-going-haywire.
That said, use-after-free and uninitialized memory vulnerabilities are completely impossible in a GC language--you're not going to even get a controlled crash. In a language like Rust or even C++ in some cases, these issues are effectively mitigated to the point where I'm able to trust that it's not the cause of anything I'm seeing. Null-pointer dereferences are not effectively mitigated against in Java, but in Rust (which has nullability as part of the type), it does end up being effectively mitigated. This does leave out-of-bounds and type confusion as two errors that are not effectively mitigated by even safe languages, although they might end up being safer in practice.
It depends on what you mean by mitigated. Java mitigates null pointers by deterministically raising an exception (as well as out of range situations), but indeed it doesn’t handle them at compile time (though the latter can’t even be solved in the general case, and only with dependent types)
> There is unsecure code hidden in every project that uses any programming language ;)
Security isn't a binary :) Two insecure code bases can have different degrees of insecurity.
> I honestly wonder how many of these security vulnerabilities are truly issues that never would have come up in a more "secure" language like Java, or if the vulnerabilities would have just surfaced in a different manner.
I don't know how memory safety vulns could manifest differently in Java or Rust.
> In other words, we're constantly told C and C++ are unsafe languages they should never be used and blah blah blah. How much of this is because of the fact that C has been around since the 1970s, so its had a lot more time to rack up large apps with security vulnerabilities
That doesn't address the veteran C programmers who say they can't reliably write secure C code (that's new code, not 50 year old code).
> Are these errors due to the language, or is it because we will always have attackers looking for vulnerabilities that will always exist because programmers are fallible and write buggy code?
A memory safe language can't have memory safety vulnerabilities (of course, most "memory safe" languages have the ability to opt out of memory safety for certain small sections, and maybe 0.5% of code written in these languages is memory-unsafe, but that's still a whole lot less than the ~100% of C and C++ code).
Of course, there are other classes of errors that Java, Rust, Go, etc can't preclude with much more efficacy than C or C++, but eliminating entire classes of vulnerabilities is a pretty compelling reason to avoid C and C++ for a whole lot of code if one can help it (and increasingly one can help it).
First of all, you’re comparing “most PHP and JS programmers” with veteran C programmers, and secondly most PHP and JS programmers can write code which is secure against memory-based exploits.
It is easier to just pick an existing library and deal with security flaws, than trying to ramp up an ecosystem from scratch, unless one has the backing of a multinational pumping up development.
Yes. For some reason programming culture repeatedly fails to realise that if you want to group languages into two buckets by performance with one being "like C" and the other being "like Python" then all the languages you list (except maybe JS) belong in the "like C" bucket.
I mean, he just explained that after rewriting his program in Dart, it was fast enough? That's not really the point here.
On the other hand, I tried writing a Wren interpreter in Go and it was considerably slower than the C version. Even programming languages that are usually pretty fast aren't always fast, and interpreter inner loops are a weak spot for Go.
> I mean, he just explained that after rewriting his program in Dart, it was fast enough?
Yes, and that makes his C advocacy even less sensible. Dart is a perfectly fine language, even though it seems to be a bit underused compared to others.
I didn't advocate that anyone ship production code written in C.
I advocated that people write programs in C and run them to see how fast executables can startup and run.
(Dart isn't great for that because while its runtime performance is pretty fantastic, it does still take a hit on startup because it's a VM with a fairly large core library and runtime system.)
Spending "a little time writing some programs in C" is not the same as advocating that people write most of their code in C, or that you use it in production.
Maybe try reading Crafting Interpreters, half of which is in Java and half in C.
I upgraded a desktop machine the last time I visited my family. It was a Windows 7 computer that was at least 10 years old with 4GB of ram. They wanted to use it online for basic web browsing, so I thought I'd install Windows 10 for security reasons and drop in a modern SSD to upgrade the old 7200rpm drive to make it more snappy.
Well, it felt slower after the "upgrade". Clicking the start menu and opening something like the Downloads or Documents folder was basically instant before. Now, with Windows 10 and the new SSD there was a noticeable delay when opening and browsing folders.
It really made me wonder how it would be running something like Windows 98 and websites of the past on modern hardware.
I wonder if you'd have any more luck with that hardware putting Ubuntu Mate on it. For basic web browsing, it probably wouldn't matter much to your family whether it's running Windows or Linux.
Problem with Ubuntu is it doesn’t auto update and it’s very hard to get it to do that. Not sure it’s even possible to auto update major releases as well.
Every time I have installed Ubuntu for someone, I have come back years later and it’s still on the same version.
I am not sure about major release upgrades. But if you are on an LTS release, this should cover it for five years. And as much as I dislike snaps, they do auto updates too, so in 22.04 Firefox at least keeps up-to-date too.
Throw in more RAM and Windows 10 will likely feel snappier than Windows 7 did.
It's probable the old Windows 7 install was 32-bit while your fresh install of 10 would have defaulted to 64-bit. That combined with 10's naturally higher memory requirements means the system has less overhead to work with.
recently I've seen new laptops being shipped with 4GB. possibly with a slightly lighter (but not fully debloated) version of 10 (Home? Starter? Edu?)
I'm not sure if this is because Windows memory usage is a lot more efficient now, or if the newer processors' performances can cancel out the RAM capacity bottleneck, or if PC4-25600 + NVMe pagefiles are simply fast enough, or if manufacturers are spreading thinly during the chip shortage. but it's certainly an ongoing trend
Mother I law bought a machine with 4GB of ram, which was fine before windows 10. Now it spends all day doing page/sysfile swap from its mechanical hard drive. Basically unusable.
So here in my pocket is an 8GB stick of DDR3 sodimm for later.
32bit PAE was supported since Windows XP and initially allowed for more than 4GB of RAM to be supported, but driver issues made Microsoft put a soft-cap in 4GB under this mode[0]. But Win7 32 bits with PAE would've surely been able to use all of those 4GB fine.
Try Win-R and type "notepad", at a reasonably fast programmer's pace. It consistently loses "no" for me, sometimes more if it's feeling particularly slow.
This should involve absolutely zero disk reads or anything of the sort, it's a window that runs a command. And it used to work reliably in past years. It feels like keyboard input simply isn't buffered like it used to be. Calculator it even worse as it loses input if you start typing the formula too soon. It used to be very easy for casual calculations now I have to wait for the computer.
In a similar vein I installed Ubuntu on an older laptop that had been running Windows 10. I was shocked at how fast it was compared to Windows 10, it was night and day.
This is part of it - many things are "fast enough" that were you used to have caches that would display nearly instantly, now you don't have those - it reads from disk each time it needs to show the folder, etc.
This is very visible in any app that no longer maintains "local state" but instead is just a web browser to some online state (think: Electron, teams, etc). Disconnect the web or slow it down and it all goes to hell.
That's interesting, I cloned a Win10 installation on a HDD to a sata SSD a year or two back and the speed difference was considerable. Especially something like Atom that took minutes to open before was ready to go in like 10 seconds afterwards.
Somewhere around IIRC Win8 Microsoft must have gotten really lax about minimizing disk access. Windows started being slow as molasses on an HDD, even for stuff like opening the start menu.
This hurts performance a ton on SSDs, too, it's just less noticeable. Something that should happen so fast you can hardly measure how long it takes, takes... just long enough to notice, which may amount to 100x as long as it should take, but 100x a small number is still pretty small.
Yeah the change from a 7200 HDD to an SSD for those 10 year old machines provides a very considerable improvement. It goes from "unusable" to "moderate" performance for general web browsing and business duties.
I'm talking about Windows 10 on 4G C2Q or Phenom/Phenom II machines - they aren't fast but they're very usable with a SSD and GPU in place.
You're comparing 10 to 10, so of course an SSD will only help in that situation.
But if any parts of 10 are sufficiently badly coded compared to 7, that will overcome the drive. And some parts definitely are, especially in the start menu code.
10 years of malware definition updates. 10 years of countless security additions. Every operation needs to be checked for correction, memory safety etc.
I hope one day latency in general will be "back to normal".
I still remember how fast console based computing, an old gameboy or a 90's macintosh would be - click a button and stuff would show up instantly.
There was a tactility present with computers that's gone today.
Today everything feels sluggish - just writing this comment on my $3000 Macbook Pro and i can feel the latency, sometimes there's even small pauses. A little when i write stuff, a lot when i drag windows.
Hopefully the focus on 100hz+ screens in tech in general will put more focus on latency from click to screen print - now when resolution and interface graphics in general are close to biological limits.
I'm on an M1 Air (cheapest base model), and I use it largely for writing (also dev but I get that that's not your question).
- For native M1 apps like Pages, Sublime, or Highland there's no lag at all. For example, with Highland 2 from double-clicking a file to editing it is less than a second and there's no lag during use even with a 49,000 word book manuscript open.
- For x86 apps like the not-quite-latest Office there's a couple of seconds at first launch (for that session) whilst Rosetta does its x86 translation work, but after that it launches without lag for the remainder of that session and it stays snappy in use (snappy for Word that is).
- Native VS Code goes from launch to editing in under two seconds and never lags, even with something like side-by-side Markdown preview going.
- If you're using Vellum for publishing it's about 1.5 seconds from double-clicking a file to editing it.
That's very good to hear, I've been looking at MacBook Air also because they're pretty much the kings when it comes to battery life for a handbag sized laptop. I think the bidder MacBooks have slightly better battery, but you can't really fit those in a smaller bag, you do kinda need a backpack for it or a laptop specific bag.
> I've been looking at MacBook Air also because they're pretty much the kings when it comes to battery life for a handbag sized laptop.
Battery life is, indeed, impressive.
Last night I spent around 5 hours doing C# dev in VS Mac, with multiple projects being built every few minutes, cross-platform binaries for Intel Mac, Windows, and Linux being produced every half hour or so, plus Highland 2, Word 2016, and Vellum. With all that it used 28% battery across that 5 hours (and never got warm). On full brightness too (for my sins).
I know the question isn't about dev, but writing uses less resources and gives even better battery life so 18 hours (for example) is definitely possible.
The only issue I have is the keyboard. Far better than the 'broken' ones of a few years ago but I really wish they'd go for thicker machines and increase the travel. I've just got rid of my last ThinkPad and it's the one thing I miss.
Oh, and there is no longer a hotkey to control the backlight brightness; it's automatic. Which genuinely works perfectly except that it doesn't come on for your very first sign in at boot-up, so entering your password then can be tricky without ambient light (though after that you can use the fingerprint reader). It's a really strange UX flaw. Not related to your question, I know, but you don't say whether you're already on a Mac or switching so I wanted to be honest about this as it is really annoying but rarely mentioned.
I have an M1 Air right I'm typing on right now and have not had any sluggishness concerns besides when switching between Spaces. Even that is more of a visual stutter instead of actually lagging to the point the animation takes longer than usual. This is the first thin & light computer I've owned that I'm 100% happy with its performance.
Weird. I don't use Spaces (this is the multiple desktops thing, right?) but I've just tried it and it's not laggy at all for me. I turn on the reduce motion thing, so it fades between them rather than swiping, but neither feel laggy.
(I'm on an M1 Air and I think the performance is great)
Most flagship Android phones are >60hz and have been for a few years. Flagship iPhones and iPads are >60hz. Very nearly every gaming laptop is >60hz. Many new TVs are >60hz with inputs to match.
My guess is that few people have stopped to compare them. I've never knowingly seen a 100+hz screen in person, so I stopped by a local store. Sure enough, I could tell that the motion was smoother. Bought 2. After using those, I can feel my older monitors that I'm using to write this are choppy.
But do you notice the smoothness in the day to day basis or have you, in a way, crippled yourself, because now the majority of monitors feel choppy to you?
Sounds a bit like the, 'Never meet your heroes', thingy.
I 100% notice it but interestingly it doesn’t affect me on my laptop/desktop much since I use a mouse and scrolling is already not smooth. While mobile has smooth scrolling and a lot more animations/swipes.
Do you think that besides gaming there really any need to move to higher then 60Hz on desktops and laptops?
My phone (POCO X3 PRO) allowed me to turn on 120Hz but when I do I don't notice any change except if I really look at it, like scrolling up and down very quickly while looking behind the phone I notice a difference, but otherwise I don't notice it, so I just have it turned off, should give more battery life.
True, it's probably just bleeding edge, but i've noticed several flagship phones, have 90HZ, and the new iPad Pros have up to 120hz "smooth scrolling", so it seems something will be happening x years down the line.
For me, there is far more latency on typical operations, but far less waiting for longer intensive operations like opening a program/tab or saving a file (bloat aside, some are guilty here).
I'd also prefer the sluggishness gone if I had my choice between the two.
It's not only a matter of 750ms instead of 200ms. I'm astonished every time I open some tool like Visual Studio, SAP Power Designer, or Libre Office that can stay for the most part of a minute on its loading screen.
What do those tools even do for that long? They can read enough data from the disk to overflow my computer's main memory a few times during it.
I heard optimization described this way: Sure, you think you need to tune the engine, but really, the first thing you need to do is get the clowns out of the car.
I remember a video of a guy running an old version of Visual C++ on an equally old version of Windows, in a VM on modern hardware, to try Windows development "the old way". It took about one frame to launch. One. Frame.
By the way, Apple isn't much better. Xcode takes around 15 seconds to launch on an M1 Max.
Not only Visual Studio s up instantly in an older version of Windows running in a VM. Debugger values update instantly there as well, something that Visual Studio can no longer do.
I really liked Win 2000 because of this feeling of speed. Most programs would simply "open" when you clicked their icon. There wouldn't be a loading screen. I remember getting frustrated because I could not look at the pretty spalsh screen that Excel had added because it would flash and disappear in milliseconds. Amd this was on hardware of that time.
Just based on memory, Visual C++ 6 was written using the good old Win32 API, which is just plain C code. Without access to the source code, I can assume that the object-oriented craze and XML fad had not corrupted that codebase. Superb software.
Visual C++ 7 was rewritten to use another SDK, likely based on .Net, and it was noticeably slower. The problem, as I see it, is people don't understand the cost of abstractions and intermediate layers, and add them gratuitously. This has been a trend ever since.
> Xcode takes around 15 seconds to launch on an M1 Max
Not really related to launch time but it’s hilarious how much faster Xcode is when working with Objective-C compared to Swift. I understand why, but it’s still jarring
Of video. Which probably was 30 fps. I mean, the splash screen just blinked for a barely noticeable split second before the main window appeared. You double click the shortcut, and it's already done launching before you realize anything. That's how fast modern computers are.
(actually, some things on the M1 are fast enough that I'm now getting annoyed at networking taking what feels like ages)
Why would you assume video is at 30fps? Geographic location? People not in the US (and a handful of other countries) would assume video framerate of 25fps.
Does the refresh rate of a computer monitor get referred to as frames? Usually, it's just the frequency like 120Hz type units. Sorry for the conversation break, but I've just never heard app start up times with a framerate reference. Was just an unusual enough thing that I let me brain wonder on it longer than necessary
Oh ffs. First off, I'm not from the US. I've been there for less than a month combined. Secondly, if you do want to nitpick, at least do some research first. The video in question is 60 or 30 fps depending on the quality setting.
$ yt-dlp -F https://www.youtube.com/watch?v=j_4iTovYJtc
[youtube] j_4iTovYJtc: Downloading webpage
[youtube] j_4iTovYJtc: Downloading android player API JSON
[youtube] j_4iTovYJtc: Downloading player df5197e2
[info] Available formats for j_4iTovYJtc:
ID EXT RESOLUTION FPS │ FILESIZE TBR PROTO │ VCODEC VBR ACODEC ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────
sb2 mhtml 48x27 │ mhtml │ images storyboard
sb1 mhtml 80x45 │ mhtml │ images storyboard
sb0 mhtml 160x90 │ mhtml │ images storyboard
139 m4a audio only │ 46.85MiB 48k https │ audio only mp4a.40.5 48k 22050Hz low, m4a_dash
249 webm audio only │ 49.06MiB 51k https │ audio only opus 51k 48000Hz low, webm_dash
250 webm audio only │ 63.84MiB 66k https │ audio only opus 66k 48000Hz low, webm_dash
140 m4a audio only │ 124.33MiB 129k https │ audio only mp4a.40.2 129k 44100Hz medium, m4a_dash
251 webm audio only │ 125.02MiB 130k https │ audio only opus 130k 48000Hz medium, webm_dash
17 3gp 176x144 8 │ 56.70MiB 59k https │ mp4v.20.3 59k mp4a.40.2 0k 22050Hz 144p
160 mp4 256x144 30 │ 37.86MiB 39k https │ avc1.4d400c 39k video only 144p, mp4_dash
278 webm 256x144 30 │ 42.59MiB 44k https │ vp9 44k video only 144p, webm_dash
133 mp4 426x240 30 │ 84.31MiB 87k https │ avc1.4d4015 87k video only 240p, mp4_dash
242 webm 426x240 30 │ 70.03MiB 72k https │ vp9 72k video only 240p, webm_dash
134 mp4 640x360 30 │ 167.27MiB 174k https │ avc1.4d401e 174k video only 360p, mp4_dash
18 mp4 640x360 30 │ 352.24MiB 366k https │ avc1.42001E 366k mp4a.40.2 0k 44100Hz 360p
243 webm 640x360 30 │ 134.68MiB 140k https │ vp9 140k video only 360p, webm_dash
135 mp4 854x480 30 │ 294.98MiB 307k https │ avc1.4d401f 307k video only 480p, mp4_dash
244 webm 854x480 30 │ 233.37MiB 243k https │ vp9 243k video only 480p, webm_dash
136 mp4 1280x720 30 │ 653.31MiB 680k https │ avc1.4d401f 680k video only 720p, mp4_dash
22 mp4 1280x720 30 │ ~795.07MiB 808k https │ avc1.64001F 808k mp4a.40.2 0k 44100Hz 720p
247 webm 1280x720 30 │ 548.72MiB 571k https │ vp9 571k video only 720p, webm_dash
298 mp4 1280x720 60 │ 817.18MiB 850k https │ avc1.4d4020 850k video only 720p60, mp4_dash
302 webm 1280x720 60 │ 651.39MiB 678k https │ vp9 678k video only 720p60, webm_dash
And the units? Hz and FPS are generally interchangeable but FPS is more often used as a measure of how fast something renders while Hz is more often used for monitor refresh rates (a holdover from CRTs I guess).
Exactly. Users are subsidizing the software provider with CPU cycles and employee time.
Assume it costs $800 for an engineer-day. Assume your software has 10,000 daily users and that the wasted time cost is 20 seconds (assume this is actual wasted time when an employee is actively waiting and not completing some other task). Assume the employees using the software earn on average 1/8 of what the engineer makes. It would take less than 4 days to make up for the employee's time. That $800 would save about $80,000 per year.
Obviously, this is a contrived example, but I think it's a conservative one. I'm overpaying the engineer (on average) and probably under-estimating time wasted and user cost.
Servers are expensive, too. Humans waiting on servers to process something is even more expensive. No software runs in a vacuum; someone is waiting on it somewhere.
Adding more servers doesn't generally make things faster (latency). It only raises capacity (bandwidth). It does, however, generally cost quite a bit on development. Just about the only thing worse than designing a complex system is designing a complex distributed system.
If you don't want to take the advise of running the numbers that's up to you.
E.g. if end user latency is 10ms (and it's not voip or VR or something) then that's fast enough. Doesn't matter if it's optimizable to 10 us.
If this is code running on your million CPU farm 24/7, then yeah. But always run the numbers first.
Like I said, the vast majority of code optimization opportunities are not worth taking. Some are, but only after running the numbers.
On the flip side optimizing for human time is almost always worth it, be it end users or other developers.
But run the numbers for your company. How much does a CPU core cost per hour of it's lifetime? Your developers cost maybe $100, but maybe $1000 in opportunity cost.
Depending on what you do a server may cost you as much as one day of developer opportunity time. And then you have the server for years. (Subject to electricity)
Latency and throughput may be better solved by adding machines.
> Like I said, the vast majority of code optimization opportunities are not worth taking. Some are, but only after running the numbers.
Casey Muratori said it best: there are 3 philosophies of optimisation. You're talking about the first: actual optimisation where you measure and decide what to tackle. It's rarely used, and with good reason.
The second philosophy however is very different: it's non-pessimisation. That is, avoid having the CPU do useless work all the time. That one should be applied in a fairly systematic basis, and it's not. To apply it in practice you need to have an idea of how much time your algorithm requires. Count how many bytes are processed, how many operations are made… this should give a nice upper bound on performance. If you're within an order of magnitude of this theoretical maximum, you're probably good. Otherwise you probably missed something.
The third philosophy is fake optimisation: heuristics misapplied out of context. This one should never be used, but is more frequent than we care to admit.
> avoid having the CPU do useless work all the time
It's not worth an engineer spending 1h a year even investigating this, if it's less than 20 CPU cores doing useless work.
The break even for putting someone full time on this is if you can expect them to save about fourty thousand CPU cores.
YMMV. Maybe you're a bank who has to have everything under physical control, and you are out of DC floor space, power budget, or physical machines.
There are other cases too. Maybe something is inherently serial, and the freshness of a pipeline's output has business value. (e.g. weather predictions for tomorrow are useless the day after tomorrow)
But if you're saying that this second way of optimizing is that things should be fast for its own sake, then you are not adding maximum value to the business, or the mission.
Performance is an instrumental goal of an effort. It's not the ultimate goal, and should not be confused for it.
In the specific case of batch processing, I hear you. Machine time is extremely cheap compared to engineer time.
Then there are interactive programs. With a human potentially waiting on it. Someone's whose time may be just as valuable as the engineer's time (morally that's 1/1, but even financially the difference is rarely more than a single order of magnitude). If you have as few as 100 users, shaving off seconds off their work is quickly worth a good chunk of your time.
Machine time is cheap, but don't forget that user's time is not.
You should, however, not pessimize. People make cargo-cult architecture choices that bloat their codebase, make itnless readable, and make it 100x slower.
Using actual numbers vetted by actual expenses in an actual company, if you can save 100 CPU cores by spending 3h a year keeping it optimized, then it is NOT worth it.
It is cheaper to burn CPU, even if you could spend one day a year making it max out one CPU core instead of 100.
It can be better for the business to cargo cult.
Not always. But you should remember that the point of the code is to solve a problem, at a low cost. Reducing complexity reduces engineer cost in the future and may also make things faster.
Put it this way: Would you hire someone at $300k doing nothing but optimizing your pipeline so that it takes one machine instead of one rack, or would you spend half that money (TCO over its lifetime) just buying a rack of machines?
If you wouldn't hire them to do it, then you shouldn't spend current engineers time doing it.
I wasn't talking about optimization! I was talking about non-pessimization, which includes not prematurely abstracting/generalizing your code.
I've seen people making poor decisions at the outset, and having code philosophies that actively make new code 100x slower without any clear gain. Over-generalization, 100 classes and subclasses, everything is an overriden virtual method, dogmatic TDD (luckily, nobody followed that.)
The dogma was to make things more complicated and illegible, 'because SOLID'.
Run the lifetime cost of a CPU, and compare it to what you pay your engineers. It's shocking how much RAM and CPU you can get for the price of an hour of engineer time.
And that's not even all! Next time someone reads the code, if it's "clever" (but much much faster) then that's more human time spent.
And if it has a bug because it sacrificed some simplicity? That's human hours or days.
And that's not even all. There's the opportunity cost of that engineer. They cost $100 an hour. They could spend an hour optimizing $50 worth of computer resources, or they could implement 0.1% of a feature that unlocks a million dollar deal.
Then having them optimize is not just a $50 loss, it's a $900 opportunity cost.
But yeah, shipped software like shrinkwrapped or JS running on client browsers, that's just having someone else pay for it.
(which, for the company, has even less cost)
But on the server side: yes, in most cases it's cheaper to get another server than to make the software twice as fast.
Not always. But don't prematurely optimize. Run the numbers.
One thing where it really does matter is when it'll run on battery power. Performance equals battery time. You can't just buy another CPU for that.
Yeah, it doesn't have a simple answer that works for all cases.
Say you need to do some data processing from format A to B. There's already a maintained codebase for converting from A to C, C to D, and a service that converts individual elements from D to A. All steps require storing back onto disk.
For a one-time thing it'll be MUCH cheaper to do it the naive way reusing existing high level blocks, and going to lunch (or vacation), and let it run.
For a recurring thing, or a pipeline with latency requirements, maybe it's worth building a converter from A to B.
Or… it could be cheaper to just shard A and run it on 20 CPUs.
Let's say you have the expensive piles of abstraction, and creating huge waste. At my company one HOUR of engineer time costs about the same as 20 CPUs running for A YEAR.
This means that if you reduce CPU use by 20 cores, forever, then ROI takes a full year. Including debugging, productionizing, and maintenance you pretty much can't do anything in 1h.
Likely your A-to-B converter could take 1h of human time just in ongoing costs like release management.
And to your point about code readability: Sometimes the ugly solution (A-C-D-B) is the one with less code. If you needed the A->C, C->D, D->A components anyway, then writing an A->B converter is just more code, with its potential readability problems.
On the flip side of this: It's been a trend for a long time in web development to just add layers of frameworks and it's now "perfectly normal" for a website to take 10s to load. Like what the fuck, blogspot, how do you even get to the point where you realize you need a "loading" animation, and instead of fixing the problem you actually do add one.
Human lifetimes have been spent looking at just blogspot's cogs spinning.
We shouldn't let people obtain CS degrees until they've had to write at least one fairly-complex program on a platform with little enough RAM that the amount of code in the program starts to be something they have to optimize (because the program itself takes up space in memory, not just the data it uses, which is something we hopefully all know but rarely think about in practice on modern machines). Tens or low hundreds of KB of memory. Get 'em questioning every instruction and every memory allocation.
I'm only half-joking.
[EDIT] For extra lulz let them use a language with a bunch of fancy modern language features so they get a taste of what those cost, when they realize they can't afford to use some of them.
It's not far fetched. Microcontroller programming should not be seen as magic.
And microcontrollers will never get abundant capacity because smaller and more efficient means less battery, no matter the tech level.
So it's not like "everyone should know the history of the PDP-11" which I would disagree with.
During my schooling we built traffic lights and stuff on tiny machines, and even in VHDL, even though desktop machines were hundreds of MHz. They both have a place still.
Regarding chrome, browsers are basically operating systems nowadays. A standards compliant HTML5 parser is at the bare minimum millions of lines of code. Same for the renderer and Javascript engine.
That's true. I'm not saying a browser solves a small and simple problem. But on the other hand Chrome takes much more RAM than the operating system (including desktop environment).
Even after closing all tabs, since tabs (and extensions) are basically programs in this operating system.
Yeah, at 600MB/s, 50 seconds of loading is 15GB... So ok, it can't fill my RAM at HDD speeds, but no, none of those use anything near 15GB of memory at startup. (If they did, my question would be WTF are they doing with gigabytes of memory.) And well, loading from disk ought to be the bottleneck of any reasonable cache.
About pre-computing things (that's very likely the answer), the question is what things? Excluding Visual Studio, those are very plain GUI programs, that have a huge amount of options, but not anything near enough. And on the Visual Studio case, all the indexes and intelligence helpers are certainly cached to disk, as it's impossible to recalculate them at load time (the information just isn't there).
One thing those 3 have in common is that they have complete language emulation environments that are exposed to the user but are not related to their main function. Yet, language emulation environments start-up much faster than that, so they can only explain a small part of that time.
I work at a BigCorp that ships desktop software (but none of the above products) and network latency is (usually) pretty easy to extract out of the boot critical path. Blocking UI with network calls is a big no-no, and I expect any sizeable organization to have similar guidelines.
Work like in the OP's article is probably the most difficult - it's work that is necessary, cannot be deferred, but is still slow. So it requires an expert to dig into it.
Power Designer surely is phoning home but this isn't nearly slow enough to matter here. AFAIK Visual Studio phones in during the operation, and not on startup. Libre Office almost certainly isn't phoning anywhere.
I didn't include the slowest starting software that I know, Oracle SQL Developer, because it's clear that all the slowness is caused by phoning home, several times for some reason. But that's not the case for all of them.
EDIT: Or, maybe it's useful to put it another way. The slowest region on the world for me to ping is around Eastern Asia and Australia. Some times, I get around 1.5s round trip time for there. A minute has around 40 of those.
Network lag can be worked around with concurrent programming techniques--you don't even have to use a high-performance language to do it. The problem is that concurrent programming is far beyond what the typical Jira jockey can do--bosses would rather hire commodity drones who'll put up with Agile than put up with and pay for the kind of engineers who can write concurrent or parallel programs.
I use Visual Studio on an air-gapped machine with no (active) network cards (so Windows / winsock2 knows there is nothing that can respond and any connection should error out immediately) and it still takes almost a minute.
At least VS is just kinda slow, maybe it's the XML parser :D
The answers in this subthread had me think more: I am using a company provided Windows machine and a Linux virtual desktop for the same tasks. Startup times difference for many applications is night and day. Probably due to virus scan and MS OneDrive.
> And in the end, the code seems to run "fast enough" and nobody involved really notices that what is running in 750ms really ought to run in something more like 200us.
Nobody has created a language that is both thousands of times faster than Python and nearly as straightforward to learn and to use. The closest thing I know of might be Julia, but that has its own performance problems and is tied closely to its AI/ML niche. Even within that niche I'm certainly not going to get most data scientists to write their code in C or C++ (or heaven forbid Rust) to solve a performance impediment that they've generally been able to work around.
It's great that you've been able to switch to higher-performance languages, but not everyone can do that easily enough to make it worth doing.
The "iterate from notebook to production" process which is common everywhere but the largest data engineering groups rules out anything with manual memory management from becoming popular with data science work.
Some data scientists I know like (or even love) Scala, but that tends to blow up once it's handed over to the data engineers as Scala supports too many paradigms and just a couple DSs will probably manage to find all of them in one program.
We use Go extensively for other things, and most data scientists I've worked with sketching ideas in Go liked it a lot, but the library support just isn't there, and it's not really a priority for any of the big players who are all committed to Python wrapper + C/C++/GPU core, or stock Java stacks. (The performance also isn't quite there yet compared to the top C and C++ libraries, but it's improving.)
I love scala and wish it was more popular. I've made piece with java at this point as it slowly adopts my favorite parts of scala but I miss how concise my code was.
I think that's my argument. If a developer thinks C or C++ is really that difficult and they can only write effectively in Python, they're a shitty developer and the world seems to be jam packed with them.
As a long-time C# user who started life with coding for embedded systems with C, graduated to C++ business tiers, and then on to C#, my personal crusade has always been to show that it's very possible to make things go pretty fast with C#.
One of my favorite moments happened after my C#-based back-end company was acquired by an all-[FASTER LANGUAGE] company. We had to connect our platforms and hit a shared performance goal of supporting 1 billion events/month, which amounted to something like (IIRC) 380 per second. Our platform hit that mark running on 3 server setup w/2 Amazon Medium FE servers and a SQL backend. The other company's bits choked at 10 per second, running on roughly 50x the infra.
Poorly written and architected code is a bigger drag than the specific language in many cases.
If you're using an IDE (Rider or Visual Studio) and avoid the Enterprise frameworks, then it's much easier to use than Python. Tooling makes a huge difference, no more digging through the sometimes flakey Python documentation and cursing compatibility issues with random dependencies not supporting Apple Silicon.
I agree tooling makes a huge difference but I specifically said this with the understanding that you're using C# with Visual Studio. Some stuff will be easier in C#, but a lot of other stuff just isn't as easy as in Python.
At the risk of setting up a strawman for people to punch down, try comparing how easy it is to do the equivalent of something like this in C#, and feel free to use as much IDE magic as you'd like:
x = [t[1] for t in enumerate(range(1, 50, 4)) if t[0] % 3 == 0][2:]
Was it actually easier?
There's a million other examples I could write here, but I'm hoping that one-liner will be sufficient for illustration purposes.
Enumerable.Range(1,50).Where((x,i) => i % 4 == 0).Where(e => e % 3 == 0).Skip(1).Select(e => e+4)
Okay, so you might consider that last e+4 cheating and against the spirit, but I couldn't be bothered to spend money upgrading my linqpad to support the latest .net with Enumerable.Chunk which makes taking two at a time easier for the first part.
Edit: more in spirit:
Enumerable.Range(1,50).Where(e => e % 4 == 0 && e % 3 == 0).Skip(1).Select(e => e + 1)
If I understand dataflow's example correctly you don't need the Select at the end:
var x = Enumerable.Range(1,50)
.Where((num, index) => num % 4 == 1 && index % 3 == 0)
.Skip(2)
.ToArray();
That computes the same thing as their Python snippet: [25,37,49]. Of course, what this is actually computing is whether the number is congruent to 1 modulo 4 and 3 so it was a weird example, but here's how you'd really want to write it (since a number congruent to 1 modulo 4 and 3 is the same as being congruent to 1 module 12):
var x = Enumerable.Range(1,50)
.Where(num => num % 12 == 1)
.Skip(2)
.ToArray();
Rewriting that Python example to be a bit clearer for a proper one-to-one comparison:
y = [t for t in range(1, 50, 4) if t % 3 == 1][2:]
That enumerate wrapper was unnecessary. I don't recall a way, in LINQ, to generate only every 4th number in a range, but I also haven't used C# in a few years so my memory is rusty on LINQ anyways.
You're right, the maths simplifies it a lot. I rushed out a one-liner without much analysis, and eventually come to the same conclusion.
There's no Range method that takes (start, stop, step) but it's trivial enough to write one, it's a single for loop and yield return statement.
We can even trigger the python users by doing it in one line ;)
public static class CustomEnumerable { public static IEnumerable<Int32> Range(int start, int stop, int step) {for (int i = start; i < stop; i+=step) yield return i;}}
Try writing your function definitions on one line in python!
Yeah, that would work, throw it before the Where clause and change 49. Range here doesn't specify a stopping point, but a count of generated values (this makes it not quite the same as Python's range). So you'd want:
Enumerable.Range(0,13).Select(x => 4 * x + 1).Where((e, i) => i % 3 == 0).Skip(2)
And that's equivalent to the original, short of writing a MyRange that combines the first Range and Select. Still an awful lot of work for generating 3 numbers.
No, I'm suggesting that your original example was a great example of obfuscated Python. Even supposing that you wanted to alter the total number of values generated and the number of initial values to skip, you're doing unnecessary work and made it more convoluted than necessary:
def some_example(to_skip=2, total_count=3):
return [n * 12 + 1 for n in range(to_skip, to_skip+total_count)]
There you go. Change the variable names that I spent < 1 second coming up with and that does exactly the same thing without the enumeration or discarding values. In a thread on how computer speed is wasted on unnecessary computation, it seems silly that you're arguing in favor of unnecessary work and obfuscated code.
What you're missing is that C# example works on any Enumerable. And it's very hard to explain how damn important and impressive this is without trying it first.
Yes, it's more verbose, but I can swap that initial array for a List, or a collection, or even an external async datasource, and my code will not change. It will be the same Select.Where....
> is that C# example works on any Enumerable. And it's very hard to explain how damn important and impressive this is without trying it first.
Believe me I've tried (by which I mean used it a ton). I'm not a newbie to this. C# is great. Nobody was saying it's unimportant or unimpressive or whatever.
> Yes, it's more verbose, but I can swap that initial array for a List, or a collection, or even an external async datasource, and my code will not change
Excellent. And when you want that flexibility, the verbosity pays off. When you don't, it doesn't. Simple as that.
> Excellent. And when you want that flexibility, the verbosity pays off. When you don't, it doesn't. Simple as that.
It's rarely as simple as that. For example, this entire conversation started with "At the risk of setting up a strawman for people to punch down, try comparing how easy it is to do the equivalent of something like this".
And this became a discussion of straw men :) Because I could just as easily come up with "replace a range of numbers with data that is read from a database or from async function that then goes through the same transformations", and the result might not be in Python's favor.
It's not "twice as long" in any syntactic sense, and readability is easily fixed:
Enumerable.Range(1,50)
.Where(e => e % 4 == 0 && e % 3 == 0)
.Skip(1)
.Select(e => e + 1)
That's very understandable, it's clear what it does, and if your complaint is that dotnet prefers to name expressions like Skip rather than magic syntax, we can disagree on what make things readable and easy to maintain.
It's literally "twice as long" syntactically. 120 vs. 67 characters.
And again, you keep omitting the rest of the line. (Why?) What you should've written in response was:
var y = Enumerable.Range(1,50)
.Where(e => e % 4 == 0 && e % 3 == 0)
.Skip(1)
.Select(e => e + 1)
.ToArray();
Compare:
y = [t[1] for t in enumerate(range(1, 50, 4))
if t[0] % 3 == 0][2:]
And (again), my complaint isn't about LINQ or numbers or these functions in particular. This is just a tiny one-liner to illustrate with one example. I could write a ton more. There's just stuff Python is better at, there's other stuff C# is better at, that's just a fact of life. I switch between them depending on what I'm doing.
There's not a lot of difference if you use the query syntax in C# (assuming you add an overload to Enumerable.Range() to take the skip) - only no-one uses that because it's ugly. Also really nice that the types are checked + shown by tooling, as is the syntax.
I use Python a lot for scripting - what it lacks in speed of development/runtime it gains in being more accessible to amateurs and having less "enterprise" style libraries (particularly with cryptographic libraries, MS abstract way too much whilst Python just has think wrappers around C). That makes Python a strong scripting language for me. PyCharm is really nice too.
For real work? C# is better as long as you have either VS or Rider. Really dislike the VS Code experience (these JS-based editors are slow and nowhere near as nice a Rider) so then I can understand why people would avoid it.
The ToArray is unneccessay, it's much more idiomatic dotnet to deal with IEnumerable all the way through.
The only meaningful difference in lengths is that C# doesn't have an Enumable.Range(start, stop, increment) overload but it's easy enough to write one, and then it'd be essentially the same length.
"Unnecessary"? You can't just change the problem! I was asking for the equivalent of some particular piece of code using a list, not a different one using a generator. Sometimes you want a generator, sometimes you want an array. In either language.
This is a silly argument, you're asking for a literal translation of a pythonic problem without allowing the idioms from the other languages.
If you were actually trying to solve the problem in dotnet, you'd almost certainly structure it as the Queryable result and then at the very end after composing run ToList, or ToArray or consume in something else that will enumerate it.
Now even including the ToList it's now just four basic steps:
Range, Filter, Skip, Enumerate.
Those are the very basics, all one line if wanted. It doesn't get much more basic than that, and I'd still argue it's easier for someone new to programming to see what's going on in the C# than the python example.
edit: realised the maths simplifies it even further.
There's very little difference between the two as long as you're using modern versions of both and add your own functions to fill any API gaps and are using type hinting properly in Python. My C# tends to be "larger" because I use more vertical whitespace and pylint is rather opinionated.. :)
Where you can complain about C# - and I do - is where you're having to write (or work with) code which has been force to stick to strict architectural and style standards. That makes code-bases which are very hard to understand for newbies and are verbose.
On the flip side, once you start doing anything even slightly interesting with Python you run into the crappy package management. The end result of which is lots of frustration getting projects working and a lot of time wasted on administration vs work.
But who compares Python with C#, they are not even in the same league? Python is a glorified bash scripting replacement with a mediocre JIT engine. Modern C# is faster than Go which is what it is competing against.
I was able to convert a couple of my data scientist colleagues over to using Scala (given that they were writing code for our Spark cluster it seemed like a no-brainer compared to Python or R). It's not thousands of times faster but it might be ten or a hundred times faster, and a lot of the time you can write the very same code aside from punctuation (and even that difference is smaller in Scala 3, although I don't think Spark has moved to that yet).
And yet, even with all the evidence that modern, heavily-bloated software development is AWFUL (constant bugs and breakage because no one writing code understands any of the software sitting between them and the machine, much less understands the machine; Rowhammer, Spectre, Meltdown, and now Hertzbleed; sitting there waiting multiple seconds for something to launch up another copy of the web browser you already have running just so that you can have chat, hi Discord)... you still have all the people in the comments below trying to come up with reasons why "oh no it's actually good, the poor software developers would have to actually learn something instead of copying code off of Stack Overflow without understanding it".
most numerical algorithms are loops over arrays, accumulators and simple arithmetic. this is where numba shines.
for the other cases, there's a python compatibility mode (on by default) that allows for use of arbitrary python.
the hard parts in numba are ensuring type inference works correctly and adding it to existing python environments that might have dependencies pinned at inconvenient versions or other drama associated with adding an entire llvm to your python environment.
also, there's the explosion of python versions cross numpy/mkl versions cross distributions cross bitwidths... but that's the nature of publicly shipping numerical code in python in general.
all that said, when it's all set up, numba can be quite elegant and simpler than cython.
Python _is_ slow, but even back in 2006 on a pentium 4 I had no problem using it with PyGame to build a smooth 60fps rtype style shooter for a coding challenge.
One just has to not do anything dumb in the render loop and it's plenty responsive.
Of course, if you're going to interactively process a 50mb csv or something... But even then pandas is faster.
Nah that's too general. A lot of website/app backends use Django or Fastapi and they work fine. Many more use PHP, also not a language famed for extreme performance.
It depends on the application. Personally I wouldn't use Python for a GUI (because I'd use JS/TS).
I'd be the first to complain about latency where it maters, but launching a Python program is perceptually instant (and significantly lower-latency than many nominally "faster" languages, IME).
> And in the end, the code seems to run "fast enough" and nobody involved really notices that what is running in 750ms really ought to run in something more like 200us.
At least with Chrome's V8, the difference is not that big.
Sure, it loses to C/C++, because it can't vectorize and uses orders of magnitude more memory, but at least in the Computer Language Benchmarks Game it's "just" 2-4x slower.
I remember getting a faster program doing large matrix multiplication in JavaScript than in C with -o1, because V8 figured out that I'm reading from and writing to the same cell, so optimised that out, which gave it an edge, because in both cases the memory bandwidth limited the speed of execution.
As for Electron and the like: half of the reason why they're slow is that document reflows are not minimized, so the underlying view engine works really, really hard to re-render the same thing over and over again.
It's not nearly as visible in web apps, because these in turn are often slowed down by the HTTP connection limit(hardcoded to six in most browsers).
Languages top out at around 50x, and that's the extreme of pure CPython to C.
For as many factors of magnitude as I am talking about, you have to be screwing up algorithms, networks, and a whole bunch of other things too.
Python and similar languages like Ruby really do make it easy to accidentally pile things on top of each other, but you can screw up in pure assembler with enough work put into it. Assembler doesn't stop you from being accidentally quadratic or using networks in a silly way.
Except as a developer I lose lots of time if I have to wait long for my code (esp. Unit tests) to run. Having said that larger projects in C/C++ are often very slow to build (esp. if dependencies are not well defined and certain header files affect huge numbers of source files - a problem that doesn't exist with higher level languages).
But even if using a particular language and framework saves developer time, it rarely seems to translate into developers using that saved time to bother optimizing where it might really count.
I've not found that to be the case. The first draft might get done faster, but then I spend more time debugging issues in dynamic languages that only show up at runtime that the compiler would find in other languages. And then more time optimizing the code, adding caching, moving to more advanced algorithms, and rewriting parts in C just to get it to run at a reasonable speed when the naive approach I implement in other languages is fast enough on first try.
For most tasks, modern mid-level statically typed languages like C#, Go, Kotlin really are the sweet spot for productivity. Languages like Python, Ruby and JS are a false economy that appear more productive than they really are.
That's only an excuse if you're sociopathically profit-oriented. The program is developed orders of magnitude fewer times than it is run. Shitty performance, like pollution, is an externality that can be ignored but should not.
Shitty performance certainly is bad, but it is not an externality like emissions into the atmosphere. The fundamental difference is that the customer (and only the customer) is harmed by bad performance, while emissions harms everyone.
I'm not so sure. Emissions don't harm everyone instantly; they affect people disproportionately and only impact everyone over time as the effects accumulate. Sure, maybe bad performance only affects the customer initially, but can't you help but wonder what the cumulative opportunity cost of bad performance on civilization has been?
The predominant perception among nontechnical people is that computers are fundamentally unreliable and slow. It doesn't seem unreasonable to think that might be holding up the rate of innovation.
By this reasoning, though, there is no such thing as localized harm. The reason I can't abide by the idea that "there's no such thing as localized harm" is that, when you actually try to analyze nonlocal harms caused by personal decisions, you get swallowed up by the butterfly of doom.
It's the butterfly effect. For example, a lot of software that actually gets written is a net negative to society, even if it functions perfectly. So does making it more efficient actually benefit anybody? And a lot of other software is embedded in organizations that will add features to the software until it fails, expanding like an ideal gas to fill whatever space it's given, so even if you make it more efficient and less failure-prone, you're only really delaying the inevitable anyway. However, making a bureaucratic organization less efficient might not actually stop it; consider, for example, how the Social Security Card was originally engineered to be unusable as a national ID, but got used as one anyway, so now the United States not only has a national ID that most citizens didn't want, but we're stuck with a bad one. However, identity theft might actually be considered just another case of externalities, and if the bureaucrats had to eat the cost of easy-to-forge national IDs, this problem might have gotten fixed.
I think you can analyze nonlocal harms, but not using informal reasoning in a chatroom. There are too many possible interactions in the real world to fit them all in your head. You end up with an impossible-to-analyze infinite regress.
Instead, nonlocal harms should probably expect real-world measurements to prove that they actually exist and aren't entirely being washed out by the much larger effect sizes of unrelated phenomena.
Externalities is a concept in economic theory. It shows how net negative behaviour occurs, even when all actors act perfectly rational and have perfect information (while optimizing for their own gain). Bad software does simply not map to this concept in the same way environmental damage does. In your example people use bad software, against their interest, despite better alternatives.
> any task that speed is even remotely a consideration for anymore
How do you know whether or not speed is a consideration?
Yes, OP delivered impressive efficiency gains. I'm sure he could improve the efficiency even more by dropping into pure Assembly.
But is it worth it?
The prime consideration is not execution speed but maintainability. The further that OP got away from pure Python, the more difficult to maintain the code became. That's a downside.
Now, OP describes an important technique because in the real world, you have a performance budget. Code needs to execute at speeds that return quickly enough to the user, or long execution is financially expensive (i.e. cloud computing resources), etc. But optimizing beyond what the budget requires is wasteful in terms of time needed to do the optimization as well as harmful in terms of negatively impacting future maintainability.
> The prime consideration is not execution speed but maintainability.
Why? And how did you measure this drop in maintainability? I'm asking because I see developers prioritize _perceived_ maintainability over _measurable_ things that matter to the user (like performance).
I don't really find python slow for what I do (typically writing UIs around computer vision systems) but also, several years back I made a microcontroller-based self-balancing robot. It was hard to debug the PID and the sensor, so I replaced it with a Pi Zero and the main robot loop ran in python- enough to read the accelerometer, compute a PID update, and send motor instructions- 100 times a second. If there was a problem (say, another heavy process, like computer vision, running on the single CPU) it would eventually not respond fast enough and the robot would fall over.
Most of the time it's not that you need a faster language, it's that you need to write faster code. I was working on a problem recently where random.choices was slow but I realized that due to the structure of my problem I could convert it to numpy and get a 100X speedup.
More important than the language, is using the right tool for the job. If you are using the scientific Python stack, correctly, you'll have a difficult time beating that with c++. For many applications. While producing way simpler and more maintainable code.
I felt this pretty viscerally recently. I did Advent of Code 2021 in python last year. My day job is programming in Python so I didn't really think about the execution speed of my solutions much.
As a fun exercise this year I've been doing Advent of Code 2020 in C, and my god it's crazy how much faster my solutions seem to execute. These are just little toy problems, but even still the speed difference is night and day.
Although, I still find Python much easier to read and maintain, but that may just be I'm more experienced with the language.
> Although, I still find Python much easier to read and maintain, but that may just be I'm more experienced with the language.
Python is definitely easier to read and maintain if you have loads of dependencies. C dependency management is a pain.
If you can read and write a little C, you should consider giving C#/Java/Kotlin/Swift a try. They're probably an order of magnitude slower than C if you write them in a maintainable style, but they're still much faster than Python. If you're doing stuff like web APIs then ASP.NET/Spring will perform very admirably without manually optimizing code, for example. You might find that these languages are C-like enough to understand and Python-like enough to be productive in. Or you might not, but it's worth a shot!
I personally believe that C is difficult if not impossible to properly to maintain long term, at least not as much as the faster alternatives. On the other hand my experience with Python is that it's one of the slowest mainstream languages out there, relying heavily on C libraries to get acceptable performance.
Haha, what C/C++ web framework should I use instead of Django/Rails/JS-whatever? Performance is a consideration, but I'm not going to reinvent a bunch of packages because of it.
This kind of blanket comment that "scripting languages are too slow" makes it sound like you shouldn't use them for anything, but they are perfectly adequate for many tasks. I'm more likely to have network and DB slowdowns than problems with scripting languages.
There is a balance, like sure there is inefficient code but often its because that code is accessing an I/O resource inefficiently, and so the CPU and RAM speed of the host machine isnt the bottlebeck no matter what dumb things the programmer does
So you dont need to pretty much ever reinvent or even use a hackerrank algorithm, you need to understand that the database compute instance has a fast cpu and lots of RAM too
"I have a hard time using (pure) Python anymore for any task that speed is even remotely a consideration anymore. Not only is it slow even at the best of times, but so many of its features beg you to slow down even more without thinking about it."
Something that you do a lot? Fine, write it in C/C++/Rust.
It's something that costs thousands/millions of dollars of compute? Ok, maybe it's worth it for you to spend a month on, put your robe on, and start chanting in latin.
Brian Cantrill has a video out there where he rewrote some C code in Rust and the benchmark was enough faster that he couldn't make sense of it. After much digging it turned out that it was because Rust was using a better data structure to represent the data, one that's difficult to get right in C.
In the end his test was comparing algorithms not compilers, but there is still something to that: we always make algorithmic compromises based on what is robust and what is brittle in our language of choice. The speed limits don't matter if only a madman would ever drive that fast.
But perfect performance isn't even the benchmark, it's "not ridiculously slow". This is what is meant by "Computers are fast, but you don't know it", you don't even know how ludicrously fast computers are because so much stuff is so insanely slow.
They're so fast that, in the vast majority of cases, you don't even need optimization, you just need non-pessimization: https://youtu.be/pgoetgxecw8
To be really fast, yes. Those are optimizations that allow you to go beyond the speed of just C and proper algorithms.
But C and proper algorithms are still fast - Moore's law is going wider, yes, and single-threaded advancements aren't as impressive as they used to be, but solid C code and proper algorithms will still be faster than it was before!
What's not fast is when, instead of using a hashmap when you should have used a B-tree, you instead store half the data in a relational database from one microservice and the other half on the blockchain and query it using a zero-code platform provided by a third vendor.
These things only net you one or two orders of magnitude (and give you very little or even negative power efficiency gain), or maybe 3 for the gpu.
This pales in comparison to the 4-6 orders of magnitude induced by thoughtless patterns, excessive abstraction, bloat, and user-hostile network round trips (this one is more like 10 orders of magnitude).
Write good clean code in a way that your compiler can easily reason about to insert suitable vector operations (a little easier in c++, rust, zig etc. than c) and it's perfect performance in my book even if it isn't saturating all the cores
I've always been tempted to make things fast, but for what I personally do on a day to day basis, it all lands under the category of premature optimization. I suspect this is the case for 90% of development out there. I will optimize, but only after the problem presents itself. Unfortunately, as devs, we need to provide "value to the business". This means cranking out features quickly rather than as performant as possible and leaving those optimization itches for later. I don't like it, but it is what it is.
> for what I personally do on a day to day basis, it all lands under the category of premature optimization
Another perspective on premature opt: When my software tool is used for an hour in the middle of a 20-day data pipeline, most optimization becomes negligible unless it's saving time on the scale of hours. And even then, some of my coworkers just shrug and run the job over the weekend.
I agree... for a business "fast" means shipping a feature quickly. I have personally seen the convos from upper management where they handwave away or even justify making the application slower or unusable for certain users (usually people in developing countries with crappy devices). Oh it will cost +500KB per page load, but we can ship it in 2 weeks? Sounds good!
Lots of businesses have nearly zero engineering in them and cobble together libraries and frameworks that they sell or rent as software.
On the other end of the spectrum you have companies hiring specialists at all points of the stack to squeeze out the last drops of performance, dedicated perf teams, etc. The latter also typically produce the tools that enable the former to function.
Performance is something that needs to be considered throughout the development cycle. If optimization happens at the end then it’s either a rewrite or a minor concern anyway because the building blocks like frameworks and libraries were already optimized. Or the software is just slow but still sells for other reasons.
Totally depends in the business. Most businesses just don't need ultra low ms response times.
Rewriting an app because is too slow is a rather extreme approach. Most of the times it's just a small part of the application that needs optimization and not the entire app.
I'd argue that if the app experiences huge growth, then that's a good problem to have and a rewrite is in order.
This is not about ultra-low responses or anything. Performance is just as much part of an application’s architecture as security or usability are. You can’t add those things at the end, they need to be done iteratively.
So when you say you optimize at the end as needed, you get away with that because somebody already did that job for you.
A) The frameworks and libs are heavily optimized, so that developers deploying them will get the best possible performance just by idiomatic usage and connecting those libraries together.
+
B) The software itself is not technically challenging.
When A and B don’t hold, ignoring performance will get the project in big trouble.
Yup. We have gotten into the habit of leaving a lot of potential performance on the floor in the interest of productivity/accessibility. What always amazes me is when I have to work with a person who only speaks Python or only speaks JS and is completely unaware of the actual performance potential of a system. I think a lot of people just accept the performance they get as normal even if they are doing things that take 1000x (or worse) the time and/or space than it could (even without heroic work).
I think it's even stronger than a habit. When you're exposed to the typical "performance" of the web and apps for a decade or so, you may have forgotten about raw performance entirely. Young people may have never experienced it at all.
I once owned a small business server with a Xeon processor, Linux installed. Just for kicks I wrote a C program that would loop over many thousands of files, read their content, sort in memory, dump into a single output file.
I ran the program and as I ran it, it was done. I kept upping the scope and load but it seems I could throw anything at it and the response time was zero, or something perceived as zero.
Meanwhile, it's 2022 and we can't even have a text editor place a character on screen without noticeable lag.
Shit performance is even ingrained in our culture. When you have a web shop with a "submit order" button, if you'd click it and would instantly say "thanks for your order", people are going to call you. They wonder if the order got through.
Shit performance is what happens when every response to optimizations or overhead is immediately answered with "premature optimization is the root of all evil."
Or the always fun "profile it!" or "the runtime will optimize it" when discussing new language features and systems.
So often performance isn't just ignored, it's actively preached against. Don't question how that new runtime feature performs today or even dare to ask. No no no, go all in on whatever and hope the JIT fairy is real and fixes it. Even though it never is and never does.
There's a place for all the current tech, of course. Developer productivity can be more important at times. But it should be far more known what the tradeoffs are and rough optimization guides than there are.
I think the issue isn't even individual developers, it's indeed the runtime itself. Anything you build on top of it is laggy.
Take my simple example of reading a file, processing it in memory, writing output. A process that should be instant in almost any case.
An implementation of such process that is commonly used in the front-end world would be CSS compilation, where a SCSS file (which is 90% CSS) is compiled into normal CSS output. The computation being pretty simple, it's all in-memory and some reshuffling of values.
In terms of what is actually happening (if we take the shortest path to solve the problem), this process should be instant. Not only that, it can probably handle a 1,000 of such files per second.
Instead, just a handful of files takes multiple seconds. Possibly a thousand times slower than the shortest path. Because that process is a node package with dependencies 17 layers deep running an interpreted language. Worse, the main package requires a Ruby runtime (no longer true for this example, but it was), which then loads a gem and then finally is ready to discover alien life, or...do simple string manipulation.
To appreciate the absurdity of this level of waste, I'd compare it to bringing the full force of the US army in order to kill a mosquito.
It's in end user apps too, and spreading. Desktop apps like Spotify, parts of Photoshop, parts of Office365...all rewritten in Electron, React, etc.
I can understand the perspective of the lonesome developer needing productivity. What I cannot understand is that the core layers are so poor. It means that millions of developers are building millions of apps on this poor foundation. It's a planetary level of waste.
Hmm, I recently built a site on Zola, and rebuilding the whole blog (including a theme with around 10 files of Sass) compiles in a few dozen milliseconds, and around 1 second on a 15 year old Core 2 Duo. But then again this is compiled Rust camping into libsass, which (despite Rust's dependency auditing nightmare) compiles to low-overhead executables. And apparently libsass is now deprecated for Dart Sass which relies on JS or the Dart VM.
In my experience, one of the most common causes of slowness is IO when there should be none. I’ve managed to speed up some computations at my company by over 1000x by batching IO and keeping the main computational pathways IO-free.
The Java and Python runtimes, which have much better test coverage and higher correctness standards than most enteprise applications, shipped a broken sort method for decades because it was a few percent faster. Never mind that for some inputs the returned value wouldn't actually be sorted.
As an industry we're not qualified to even start caring about performance when our record on correctness is so abysmal. If you have a bug then your worst-case runtime is infinity, and so far almost all nontrivial programs have bugs.
Wouldn't "profile it!" be the exact opposite if ignoring performance wins? It tells you which optimizations will noticeably improve your performance and which are theoretical gains that made no difference to realistic workloads.
It's a dismissive answer. It'd be like if someone asked "why does 0.2f + 0.1f print 0.30000000001?" and getting back an answer of "use a debugger!" It's not strictly wrong, the debugger would provide you with the data on what's happening. But it doesn't actually answer the question or provide commentary on why.
Similarly, the "profile it!" answer is often used when the person answering doesn't actually know themselves, and is just shutting down the discussion without meaningfully contributing. And it doesn't provide any commentary on why something performs like it does or if the cost is reasonable.
Well, performance is rarely the most important thing nowadays. What's preached against is not performance, but a performance-first attitude.
I agree it would be nice to value performance a bit more, but not at all costs, and depending on the use case and context of the application not necessarily as the priority over security, maintainability, velocity, reliability, etc.
> What's preached against is not performance, but a performance-first attitude.
That's what's preached against in theory. But in practice any performance discussion is immediately met with that answer. The standing recommendation is build it fully ignorant of all things performance, and then hope you can somehow profile and fix it later. But you probably can't, because your architecture and APIs are fundamentally wrong now. Or you've been so pervasively infested with slow patterns you can't reasonably dig out of it after the fact. Like, say, if you went all in on Java Streams and stopped using for loops entirely, which is something I've seen more than a few times. Or another example would be if you actually listen to all the lint warnings yelling at you to use List<T> everywhere instead of the concrete type. That pattern doesn't meaningfully improve flexibility, but it does cost you performance everywhere.
> What's preached against is not performance, but a performance-first attitude.
No, I can tell you this same record has been stuck on repeat since at least the mid 1990's. People want to shut down conversations or assign homework because it gets them out of having to think. Not because they're stupid (though occasionally...) but because you're harshing their buzz, taking away from something that's fun to think about.
This is a tangent but there are other, arguably better ways to give the user confidence the order took place in your example. You could show the line items to them again with some indicators of completion, or show the order added to an excerpt of their order history, where perhaps they can tap/click to view line items. Something like that is a bit more convincing than just thank-you text even with the delay, IMO, though it may be tougher to pull off design-wise.
In my SAAS app, we have a few artificial delays to ensure all "background tasks" that pop up a progress dialog take at least long enough to show that to the user.
I once did a progress dialog for a Windows app. It had a delay so it wouldn't even show until 0.5 seconds had elapsed - if the operation completed in that time, you never saw the pop-up. Once the dialog appeared it would stay on screen for at least a second so you wouldn't get freaked out by a sudden flash, even if the operation completed immediately after the 0.5 second delay.
In my experience, this wouldn't be needed if the rest of the app ran at native speed. There would already be a natural delay that would be noticed by the user.
> I think a lot of people just accept the performance they get as normal even if they are doing things that take 1000x (or worse) the time and/or space than it could (even without heroic work).
Habit is a very powerful force.
Performance is somewhat abstract, as in "just throw more CPUs at it" / it works for me (on my top of the line PC). But people will happily keep on using unergonomic tools just because they've always done so.
I work for a shop that's mainly Windows (but I'm a Linux guy). I won't even get into how annoying the OS is and how unnecessary, since we're mostly using web apps through Chrome. But pretty much all my colleagues have no issue with using VNC for remote administration of computers.
It's so painful, it hurts to see them do it. And for some reason, they absolutely refuse to use RDP (I'm talking about local connections, over a controlled network). And they don't particularly need to see what the user in front of the computer is seeing, they just need to see that some random app starts or something.
I won't even get into Windows Remote Management and controlling those systems from the comfort of their local terminal with 0 lag.
But for some reason, "we've always done it this way" is stronger than the inconvenience through which they have to suffer every day.
Part of the problem is we use unintentionally vague terms like "performance." What does that mean? Bandwidth? Reliability? Scalability? Something we can fix later right? That's what all executives and—frankly—most engineers hear.
I only ever talk about "latency." Latency is time—you can't get latency back once you've spent it.
It's the downside to choosing boring tech. It costs believable dollars to migrate and unbelievable dollars to keep course. There is a happy medium, I believe, that is better than "pissing away the competitive edge."
> only speaks Python or only speaks JS and is completely unaware of the actual performance potential of a system
If you stick to only doing arithmetic and avoid making lots of small objects, javascript engines are pretty fast (really!). The tricky part with doing performance-sensitive work in JS is that it’s hard to reason about the intricacies of JITs and differences between implementations and sometimes subtle mistakes will dramatically bonk performance, but it’s not impossible to be fast.
People building giant towers of indirection and never bothering to profile them is what slows the code down, not running in JS per se.
JS, like other high-level languages, offers convenient features that encourage authors to focus on code clarity and concision by building abstractions out of abstractions out of abstractions, whereas performance is best with simple for loops working over pre-allocated arrays.
Agreed that switching to lower level languages give the potential of many orders of magnitude. But the thing that was most enlightening was that removing pandas made a 9900% increase in speed without even a change to language. 20 minutes down to 12 seconds is a very big deal, and I still don't have to remember how to manage pointers.
I think that should be emphasized. The rest of the optimizations are entirely unneeded and added complexity to the code base. The next guy to work on this needs to be a cpp dev, but the requirements were only asking for 500ms which was more than met by the first fix. What the payoff of this added performance with and at what cost?
I don’t believe orders of magnitude is achievable in general. Even python, which is perhaps the slowest mainstream language clocks in at around 10x that of C.
Sure, there will be some specialized program where keeping the cache manually small you can achieve big improvements, but most mainstream managed languages have very great performance. The slowdown is caused by the frameworks and whatnot, not the language itself.
That’s a tiny as hell microbenchmark though, where rust likely was able to vectorize even. The difference won’t be as drastic for larger (more meaningful) applications.
Python programs often need to use a lot of optimized C and C++ libraries to get anywhere near reasonable performance. I would be shocked if a webserver written in Python was only 10x slower than one written in C (or Go or Rust for that matter).
It's interesting to me that two of the top three comments right now are talking about gaining performance benefits by switching from Python to C when the actual article in the link claims he gained a speedup by pulling things out of pandas, which is written in C, and using normal Python list operations.
I would like to see all of the actual code he omitted, because I am skeptical how that would happen. It's been a while since I've used pandas for anything, but it should be pretty fast. The only thing I can think is he was maybe trying to run an apply on a column where the function was something doing Python string processing, or possibly the groupby is on something that isn't a categorical variable and needs to be converted on the fly.
> the actual article in the link claims he gained a speedup by pulling things out of pandas, which is written in C, and using normal Python list operations.
Well, he claims he did three things:
(1) avoid repeating a shared step every time the aggregate function was called,
(2) unspecified algorithmic optimizations.
(3) use Python lists instead of pandas dataframes.
(1) is a win that doesn't have anything to do with pandas vs python list ops, (2) is just skipped over any detail but appears to be the meat of the change. Visually, it looks like most of the things the pandas code tries to do just aren't done in the revised code (it's hard to tell because some is hidden behind a function whose purpose and implementation are not provided). It's not at all clear that the move out of pandas was necessary or particularly relevant.
While I would certainly welcome awareness when it comes to performance it's not always useful to make something 1000x faster if it takes even as little as 25% longer to develop. Taking an extra day to make something take 1s instead of an hour is just not always worth it.
Though I will never understand webpages that use more code than you'd reasonably need to implement a performant lisp compiler and build the webpage in that (not that I'm saying that's what they should have done, I just don't understand how they use more code)
Developers are genuinely bad at watching themselves work. I've had any number of conversations with people who are being slowed down by things and just don't see it. If you take the roadblock away, a lot of them will start to notice, but most won't notice when it comes back, so recruiting people to help you keep things working is a challenge, and guard dogging things can be a significant time suck.
The thing I usually end up saying in situations like this is that, if the application doesn't 'work' for the developers, then soon it won't work for anybody.
For the average user, speeding up some things by 10 seconds will affect their lives more than you think, but it's not going to be the secret to happiness. But for some of these same workflows, the developers are running them over, and over, and over in a day, and cutting a few seconds off each iteration can add up quick. I've fixed build issues that saved team members 45 minutes a day. That sounds nice, but not earth shattering, until you look at the work flow and see 45 minutes is the difference between 4 attempts at fixing a hard problem in one day versus 5. That's not just time that's stress. "I have one more try at this and then I'm done for the day, having accomplished nothing."
The XKCD math on whether you should implement a time saving tool is off by at least an order of magnitude for most real world problems, because it doesn't account for team dynamics. It's written for and about people who don't stop and ask for help. The sooner a person finds their own solution to a problem I'm knowledgeable in, the lower the likelihood that I will get preempted. The three minutes it takes to help them costs me half an hour. Even with tricks to salvage a silver lining from such interactions, that's still expensive.
You also have some mental thresholds that multiply this effect even more. The difference between 5 min build and 30 min build isn’t just 25 mins. It’s the difference between I will only run this over lunch break, vs I will run this while fetching coffee. Add many other thresholds like short enough to still stare at progress bar vs alt-tab into Facebook and loose attention and waste another 10mins there, slow enough to only run over night, etc.
Then there is the death by a thousand paper cuts effect. For smaller tasks like updating status in Jira, if this takes 30 seconds of clicking and waiting (far from a hypothetical scenario btw!), I’m simply going to say fuck it and not do it at all.
Agreed. But I’ll add another phenomenon here. A five minute build takes ten minutes, because once you start something else you’ve estimated will take five minutes, you quickly discover that it takes ten, or you forget that you were doing that other thing. So taking four minutes off of a build actually takes 8 minutes off of the expected round trip time.
And that’s not even counting the “what if it fails the first time” tax which can double it again. Especially if it fails 30 seconds in and you don’t check until the end of the expected time. That four minutes can go to fifteen minutes on a really bad day, and that bad day might be a production issue or just trying to get out the door for your anniversary dinner. These are the situations when the light bulb goes on for people.
It depends on how often you need to do the thing and how long it takes to do it. There’s an XKCD that’s just a chart of that.
Sadly any concept of performance seems to completely go out the window for most programmers once it leaves their hands; taking 2-3x longer to write a performant app in a compiled language would save a ton of time and cycles on users’ machines but Programmer Time Is Expensive, let’s just shit out an Electron app, who cares that it’s five orders of magnitude larger and slower.
That knowledge is often not required to earn a living, so it's not surprising to me at all. My only realistic advice for people lamenting the common lack of this knowledge is to teach it (so you feel like you're making a difference) or put yourself among people with similar interests. Making performance knowledge a requirement to earn a paycheck these days is going to take a hell of a lot of change.
As a hobby, I still write Win32 programs (WTL framework).
Its hilarious how quickly things work these days if you just used the 90s-era APIs.
Its also fun to play with ControlSpy++ and see the dozens, maybe hundreds, of messages that your Win32 windows receive, and imagine all the function calls that occur in a short period of time (ie: moving your mouse cursor over a button and moving it around a bit).
Linux windows get just as many (run xev from a terminal and do the same thing). Our modern processors, even the crappiest Atoms and ARMs, are actually really, really fast.
Vega64 can explore the entire 32-bit space roughly 1-thousand times per second. (4096 shaders, each handling a 32-bit number per clock tick, albeit requiring 16384 threads to actually utilize all those shaders due to how the hardware works, at 1200 MHz)
One of my toy programs was brute forcing all 32-bit constants looking for the maximum amount of "bit-avalanche" in my home-brew random number generators. It only takes a couple of seconds to run on the GPU, despite exhaustive searching and calculating the RNG across every possible 32-bit number and running statistics on the results.
There was one place where a coworker had written a function that converted data from a proprietary format into a SQL database. On some data, this took 20 minutes on the test iPhone. The coworker swore blind it was as optimised as possible and could not possibly go faster, even though it didn't take that long to load from either the original file format or the database in normal use.
By the next morning, I'd found it was doing an O(n^2) operation that, while probably sensible when the app had first been released, was now totally unnecessary and which I could safely remove. That alone reduced the 20 minutes to 200 milliseconds.
(And this is despite that coworker repeatedly emphasising the importance of making the phone battery last as long as possible).
It's a ~community language without the backing of an 800lb gorilla to offer up both financial and cheerleading support.
I love the idea of Nim, but it is in a real chicken-and-egg problem where it is hard for me to dedicate time to a language I fear will never reach a critical mass.
I've used Nim for about 2 years now. It's a wonderful language but it's desperately lacking a proper web framework and a proper ORM. If such a thing existed I would probably drop Elixir for Nim career-wise.
The dichotomy between "compiled/interpreted" languages is completely meaningless at this point. You can argue that Python is compiled and Java is interpreted. I mean one of our deployment stages is to compile all our Python files.
The thing that makes the difference isn't the compilation steps, it's how dynamic the language is and how much behind the scenes work has to be done per line and what tools the language gives you to express stronger guarantees that can be optimized (like __slots__ in Python).
No, compiling JIT like Java/.NET/V8 becomes real cpu machine instructions on the fly.
Compiling to .pyc files is still byte code that later the python interpreter will read in what is more or less a gigantic while(true)switch/case (operation_type). What you save with pyc compilation is just the text parsing of the source code.
How dynamic the language is, does however affect how feasible it is to do JIT compilation and here python has done itself a big disfavor of simply being too dynamic. Some attempts at JITing it have been done before (pypy, ironpy), usually by nerfing the language a bit to get rid of the most dynamic parts - something that is probably for the better anyway.
I think it depends and could be either worse or better depending.
Some code is compiled more often than it is run, and some code is run more often than it's compiled.
If you can spend 100k operations per compilation to save 50k operations at runtime on average... That'll probably be a net positive for chromium or glibc functions or linux syscalls, all of which end up being run by users more often than they are built by developers.
If it's 100k operations at build-time to remove 50k operations from a test function only hit by CI, then yeah, you'll be in the hole 50k operations per CI run.
All of this ignores the human cost; I don't really want to try (and fail) to approximate the CO2 emissions of converting coffee to performance optimizations.
Not all optimizations are more energy consuming. For an analogy, does a using a car consume more energy than a bicycle? Yes. But a using a bicycle does not consume more energy than a man running on feet.
I also found this disappointing. There’s supposedly a 100x speed up to be had going from something in pandas to something using plain python lists but I have no real idea what it is or why it might have produced a speed up. I can guess, but what’s the point of writing an article that just makes me guess at the existence of some hypothetical slow code?
And then shows some grouping and sorting functions using pandas.
Then he says:
"I replaced Pandas with simple python lists and implemented the algorithm manually to do the group-by and sort."
I think the point of the first optimization is you can do the relatively expenseive group/sort operations without pandas, and improve performance. For the rest of the article it's just "algorithm_wizardry", which no longer deals with that portion of the code.
We never get a good sense of how much time was actually saved with that change not least because the original function calls "initialise weights" inside every loop, the new function does not. It would have been interesting to see what difference that alone made.
The takeaway of the article, that computers are blindingly fast and we make them do unecessary work (and often sit around waiting on I/O) with most their time is true of course.
I'm currently writing a utility to do a basic benchmark of data structures and I/O and it's been a real learning experience for me in just how fast computers can be, but also just how slow a little bit of overhead or contention can cause things, but that's better left for a full write up another day.
We never get a good sense of how much time was actually saved with that change not least because the original function calls "initialise weights" inside every loop, the new function does not.
Good point. Furthermore to your point, I would assume a library like pandas has fairly well optimized group and sort operations. It would not occur to me that pandas is the bottleneck, but the author does clarify in his footnote that pandas operations, by virtue of creating more complex pandas objects, can indeed be a bottleneck.
[1] Please don't get me wrong. Pandas is pretty fast for a typical dataset but it's not the processing that slows down pandas in my case. It's the creation of Pandas objects itself which can be slow. If your service needs to respond in less than 500ms, then you will feel the effect of each line of Pandas code.
It also doesn’t stop when you reach your destination so you have to jump and roll out. Get it wrong and you die. Questioning this method is widely frowned on.
I would say to the local farm, then you have to wait for the cow to be milked (like an external api call...). At the end you just reduced you journey time by 0.1%, and incread you code complexity by 100%.
On the other hand, to non-webdevs, webdevs are like an obese american woman trundling down the road on her mobility scooter, giving the evil eye to people overtaking her on foot as she takes a bite out of a hunk of RAMcheese.
My entire career, we never optimize code as well as we can, we optimize as well as we need to. Obviously the result is that computer performance is only "just okay" despite the hardware being capable of much more. This pattern repeats itself across the industry over decades without changing much.
The problem is that performance for most common tasks that people do (f.e. browsing the web, opening a word processor, hell even opening an IM app) has gone from "just okay" to "bad" over the past couple of decades despite our computers getting many times more powerful across every possible dimension (from instructions-per-clock to clock-rate to cache-size to memory-speed to memory-size to ...)
For all this decreased performance, what new features do we have to show for it? Oh great, I can search my Start menu and my taskbar had a shiny gradient for a decade.
I think a lot of this is actually somewhat misremembering how slow computers used to be. We used to use spinning hard disks, and we were so often waiting for them to open programs.
Thinking about it some more, the iPhone and iPad actually comes to mind as devices that perform well and are practically always snappy.
> I think a lot of this is actually somewhat misremembering how slow computers used to be
Suffice to say: I wish. I have a decently powerful computer now, but that only happened a few years ago.
> We used to use spinning hard disks, and we were so often waiting for them to open programs.
Indeed, SSDs are much faster than HDDs. That is part (but not all) of how computers have gotten faster. And yet we still wait just as long or longer for common applications to start up.
> the iPhone and iPad actually comes to mind as devices that perform well and are practically always snappy
Terribly written programs are perfectly common on iP* and can certainly be slow. But you're right, having a high-end device does make the bloat much less noticeable.
I recently ported some very simple combinatorial code from Python to Rust. I was expecting around 100x speed up. I was surprised when the code ended running only 14 times faster.
Parts of Python are implemented in C. For example the standard for loop using `range`. So when comparing Python's performance with other languages using just one simple benchmark can lead to unexpected results depending on what the program does.
Did you use python specific functions like list comprehensions, or "classic" for/while loops? Because I've found the former to be surprisingly fast, while naive for loops are incredibly slow in python.
While I find this comment section fascinating and will read it top to bottom, I can't help but make an observation that such articles often comply with:
+-------------------------------------------------+
| People really do love Python to death, do they? |
+-------------------------------------------------+
I find that extremely weird. As a bystander who never relied on Python for anything important, and as a person who regularly had to wrestle with it and tried to use it several times, the language is non-intuitive in terms of syntax, ecosystem, package management, different language version management, probably 10+ ways to install dependencies by now, subpar standard library and an absolute cosmic-wide Wild West state of things in general. Not to mention people keep making command-line tools with it, ignoring the fact that it often takes 0.3 seconds to even boot.
Why would a programmer that wants semi-predictable productivity choose Python today (or even 10 years ago) remains a mystery to me. (Example: I don't like Go that much but it seems to do everything that Python does, and better.)
Can somebody chime in and give me something better than "I got taught Python in university and never moved on since" or "it pays the bills and I don't want to learn more"?
And please don't give me the fabled "Python is good, you are just biased" crap. Python is, technically and factually and objectively, not that good at all. There are languages out there that do everything that it does much better, and some are pretty popular too (Go, Nim).
I suppose it's the well-trodden path on integrating with pandas and numpy?
Or is it a collective delusion and a self-feeding cycle of "we only ever hired for Python" from companies and "professors teach Python because it's all they know" from universities? Perhaps this is the most plausible explanation -- inertia. Maybe people just want to believe because they are scared they have to learn something else.
I am interested in what people think about why is Python popular regardless of a lot of objective evidence that as a tech it's not impressive at all.
I've started using micropython to interact with embedded arm chips, it's a revelation to interact with hardware through a REPL instead of compiling, transferring, resetting, and writing print statements to serial...
This talk by the creator of micropython [0] gives his reasoning for why to implement python on microcontrollers despite it being hundreds of times slower than C. Starts @ 3:00
- it has nice features like list comprehension, generators, and good exception handling
- it has a big, friendly, helpful community with lots of online learning resources
- it has a shallow but long learning curve. It's easy to get started as a beginner, but you never get bored of the language, there's always more advanced features to learn.
- it has native bitwise operations
- has good distinction between ints and floats, and floats are arbitrary precision, you're not restricted to doubles or even long longs. (I'll add that built in complex numbers is a plus)
- compiled language, so it can be optimized to improve performance
Ha, that actually looks pretty cool and good tech. I'd wonder if that can even be called Python anymore but still, it looks to be very useful. Thanks for the tidbit!
Completely agree that objectively, Python has really bad underlying tech.
Emotionally though (once you have the environment set up), it’s just such a breeze to write it. It’s like executable pseudo code with zero boilerplate. You can focus purely on the algorithms and business logic. Compared to many other languages the line count is often 50-80%, even if you include type annotations! This doesn’t only apply to plain imperative code, using the dynamic features you can also turn it into your own DSL where needed.
Then there is obviously the huge eco-system around it, there is not a single service, file format or database that doesn’t have a good python library for it. While go might have equally wide library choices, I wouldn’t be so sure about nim, go on the other hand has a lot of other wtfs even though it provides a lot of good fresh tech.
Would I use it for a big service with potentially lots of performance requirements? No. But there is no doubt why it’s so popular. For many applications where the the outcome of the program is more important than the performance or environment, like glue code, simple intranet applications or exploratory coding, it is still the perfect choice. You also have to consider what it is replacing, often the alternative would be even worse; bash-scripts, Excel or Matlab.
Another way to put it is that it’s a very good Swiss Army knife that is good at everything but not best at anything.
> Emotionally though (once you have the environment set up), it’s just such a breeze to write it.
That's no different than the JS fans saying "hey, if you have the exactly right versions combo of Node.JS, npm and webpack, everything works fine!". I almost never manage to install anything via `pip install`. I just cross my fingers and am not at all surprised when it doesn't work. Not for a lack of trying, mind you, I've fiddled with separating Python 2 and 3 environments, tried a few package managers etc.
If I can't install Python + a package manager and I can't then install any random tool that proudly says "we're just a `pip install` away" in their GitHub repo then to me that's a failed ecosystem relying on inertia and Stockholm Syndrome. ¯\_(ツ)_/¯
> It’s like executable pseudo code with zero boilerplate.
Is it? I've seen some pretty horrible constructor boilerplate that managed to look almost as alien as Perl. Also who decided that methods starting by underscores is a solid convention lol. Of course bad programmers will manage to butcher every language, think we all agree on that, but as I get older I appreciate languages that don't allow bad coding in the first place (or realistically, limit it as much as they can).
> You can focus purely on the algorithms and business logic.
Not my observation. I know a few data scientists and a few people who just learned Python out of desperation related to how mundane and repetitive their jobs are. They all had to wrestle with OS-specific quirks -- that a stdlib absolutely should abstract away -- the same package management woes as me, and libraries they need subtly breaking because apparently they only work well in Python 3.6 and not after (one random example).
I get the sentiment and I wanted to believe that such a language existed a while ago
but I am just not seeing it. Foot-guns and all, you know the old sayings.
> For many applications where the the outcome of the program is more important than the performance or environment, like glue code, simple intranet applications or exploratory coding, it is still the perfect choice.
I am still looking for such a tech for myself as well because my main choices of languages aren't a super good fit for tinkering. But so far I haven't seen Python filling that niche. Too much minutiae to handle.
> You also have to consider what it is replacing, often the alternative would be even worse; bash-scripts, Excel or Matlab.
Yeah that's a real problem, absolutely. For now I always got away by just learning an insane amount of CLI / TUI tools and always assembling them together just enough with bash/zsh scripting to get my task done... but that approach has deficiencies as well, and will not work forever.
> Another way to put it is that it’s a very good Swiss Army knife that is good at everything but not best at anything.
To add to the analogy: it also starts getting rusty and doesn't work as reliably as before but grandpa will let anyone replacing it only over his dead body.
(And I semi-mockingly call people "grandpas" as a 42 year old who is supposed to be conservative but I find it extremely amusing how many 28-year olds I've met that are more conservative than me and my 69-year old mother.)
How many of the better languages have equal or better readability than Python? IMO, that's the #1 reason for its continued popularity. Python is not full of parentheses, like a lisp, nor is it full of semicolons and brackets for bookkeeping, like most other C-style languages.
If that's a blocker to productivity to anyone I'd seriously argue their programming prowess. Especially nowadays with LSP auto-formatting, snippet management, and such.
Syntax is a subjective preference. I didn't like that fact for the longest time but it's a fact regardless.
Doubtful that moving from vectorised pandas & numpy to vanilla python is faster unless the dataset is small (sub 1k values) or you haven't been mindful of access patterns (that is, you're bad at pandas & numpy)
Maybe it's been stated already by someone else here but I really hope that CO2 pricing on the major Cloud platforms will help with this.
It boils down to resources used (like energy) and waste/CO2 generated.
Software/System Developers using 'good enough' stacks/solutions are externalising costs for their own benefit.
Making those externalities transparent will drive alot of the transformation needed.
How are we supposed to optimize coding languages, when the underlying hardware architecture keeps changing? I mean you don't write assembly anymore, you would right in the LLVM. Optimization was done because it was required. It will come back when complete commoditization of cpus occur. Enforcement of standards and consistent targets allow for high optimizations. Just see what people are able to do with outdated hardware in the demo and homebrew scene for old game consoles! We don't need better computers, but so long as we keep getting them, we will get unoptimized software, which will necessitate better computers. The vicious cycle of consumerism continues.
I am amazed by the discussions below on computer performance vs. software inefficiency: I remember the same discussions and arguments about software running on 8088 vs 80286 vs 80386 vs i486 vs Pentium... and so on.
You could have had those discussion at anytime since the upgraded computers and microprocessors have become compatible with the previous generation (i.e. the x86 and PC lines).
The point is that software efficiency measurement has never changed: it is human patience. The developers and their bosses decide the user can wait a reasonable time for the provided service. It is one-to-five seconds for non-real-time applications, it is often about a target framerate or refresh in 3D or real-time applications... The optimization stops when the target is met with current hardware, no matter how powerful it is.
This measure drives the use of programming languages, libraries, data load... all getting heavier and heavier when more processing power gets available. And that will probably never change.
Not sure about it? Just open your browser debugger on the Network tab and load the Google homepage (a field, a logo and 2 buttons). I just did: 2.2 MB, loaded in 2 seconds. It is sized for current hardware and 100 Mbps fiber, not for the actually provided service!
Bingo. It's not that software engineers are stupid, it's that they don't 'see' when they do something stupid and don't have a good mental model because of that lack of sight. Everyone figures out quickly to efficiently clean out their garage or other repetitive chores because it's personally painful to do it poorly and it's right in front of your nose. If only computers were more transparent and/or people learned and used profilers daily...
This afternoon, discussing with my boss, why issuing two x 64 byte loads per cycle is pushing it; to the point where l1 says no..
400GB of l1 bandwidth is all we have..
Is all we have..
I remember when we
could move maybe 50KB/s.. Ans that was more than enough..
You also have to optimize for the constraints you have. If you're like me then development time is expensive. Is optimizing a function really the best use of that time? Sometimes yes, often no.
Using Pandas in production might make sense if your production system only has a few users. Who cares if 3 people have to wait 20 minutes 4 times a year? But if you're public facing and speed equals user retention then no way can you be that slow.
> If you're like me then development time is expensive. Is optimizing a function really the best use of that time? Sometimes yes, often no.
Almost always yes, because software is almost always used many more times than it is written. Even if you doubled your dev time to only get a 5% increase of speed at runtime, that's usually worth it!
(Of course, capitalism is really bad at dealing with externalities and it makes our society that much worse. But that's an argument against capitalism, not an argument against optimization.)
Nitpick: software optimization isn’t an example of an externality. Externalities are costs/benefits that accrue to parties not involved in a transaction.
> ex·ter·nal·i·ty: a side effect or consequence of an industrial or commercial activity that affects other parties without this being reflected in the cost of the goods or services involved
The buyer is an "other part[y]" from the seller's (edit: or better yet, developer, who might just be contracted by the ultimate seller...) perspective, and performance is basically impossible to quantify, therefore price.
Moreover, even if you want to limit externalities to being completely third-party... sure: Pollution. More electrical generation capacity needed.
A less efficient car engine means that the magnitude of the externality (in this case, the externalized cost, you can have externalized benefits as well) is larger.
It basically assumes all maths is finite and defined, then ignores how floating point arithmetic actually works, optimizing based purely on "what the operations suggest should work if we wrote them on paper" (alongside using approximations of certain functions that are super fast, while also being guaranteed inaccurate)
It reorders instructions in ways that are mathematically but not computationally equivalent (as is the nature of FP). This also breaks IEEE compliance.
Python and Pandas are absolutely excellent until you notice you need performance. I say write everything in Python with Pandas until you notice something take 20 seconds.
Then rewrite it with a more performant language or cython hooks.
Developing features quickly is greatly aided by nice tools like Python and Pandas. And these tools make it easy to drop into something better when needed.
Yep, many (especially younger) programmers don't get the "feel" for how fast things should run and as a result often "optimize" things horribly by either "scaling out" i.e. running things on clusters way larger than the problem justifies or putting queuing in front and dealing with the wait.
The main optimization at that stage seems to be preallocating the weights. I don't know pandas but such a thing would have been possible without dropping any of the linalg libraries I do know how to use.
I doubt the author's C++ implementations beat BLAS/LAPACK, but since they're not shown I can only guess.
I've done stuff like this before but the tooling is really no fun, somewhere between 2 and 3 I'd just write it all in C++.
Changing the interface just to get parallelism out seems not great - give it to the user for free if the array is long enough - but maybe it was more reasonable for the non-trivial real problem.
I used to be a hardcore functional programming weenie, but over time I realized that to do high-performance, systems programming in an FP language means writing a bunch of non-idiomatic code, to the point that it's worth considering C (or C++ for STL only, but not that OOP stuff) instead unless you have a good reason (which you might) for a nonstandard language.
The problem isn't Python itself. Python has come a long way from where it started. The problem is people using Python for modules where they actually end up needing, say, manual memory management or heterogeneous high performance (e.g. Monte Carlo algorithms).
No, I think it is fair to call out mediocrity, even when it tries to pull the "disclaim exactly the set of specific applications it gets called out on" trick.
Sure, pandas often beats raw python by a bit, but come on, there's so much mediocrity between the two that I doubt they even had to cheat to find a situation the other way around.
People create accidentally quadratic code all the time. It's even easier in pandas because the feature set is so huge and finding the right way to do it takes some experience (see stackoverflow for a lot of plain loops over pandas dataframes).
You would think that, wouldn't you? But every time I've worked on a Python code base I have torn out Pandas and replaced it with simple procedural code, getting at least an order of magnitude.
Pandas is spectacularly slow. I don't understand how or why, but it is.
Slow Code Conjecture: inefficient code slows down computers incrementally such that any increase in computer power is offset by slower code.
This is for normal computer tasks-- browser, desktop applications, UI. The exception to this seem to be tasks that were previously bottlenecked by HDD speeds which have been much improved by solid state disks.
It amazes me, for example, that keeping a dozen miscellaneous tabs open in Chrome will eat roughly the same amount of idling CPU time as a dozen tabs did a decade ago, while RAM usage is 5-10x higher.
And if you wrote your instructions in assembly, it would be even faster!
/s
Sorry for the rude sarcasm, but isn't this a post truly just about the efficiency pitfalls of Python? (or any language / framework choice for that matter)
Of course modern computers are lightning fast. The overhead of every language, framework, and tool will add significant additional compute however, reducing this lightning speed more and more with each complex abstraction level.
I don't know, I guess I'm just surprised this post is so popular, this stuff seems quite obvious.
I wonder if eventually there is going to be consideration for environment required when building software.
For instance running unoptimised code can eat a lot of energy unnecessarily, which has an impact on carbon footprint.
Do you think we are going to see regulation in this area akin to car emission bands?
Even to an extent that some algorithms would be illegal to use when there are more optimal ways to perform a task? Like using BubbleSort when QuickSort would perform much better.
The return value on the function in C++ is of the wrong type :)
I agree though. I used these tricks a lot in scientific computing. Go to the world outside and people are just unaware. With that said - there is a cost to introducing those tricks. Either in needing your team to learn new tools and techniques, maintaining the build process across different operating systems, etc. - Python extension modules on Windows for e.g. are still a PITA if you’re not able to use Conda.
When I have to explain the speed of a processor to a neophyte I always begin by avoiding using GHz unit which has the weakness of hiding the magnitude of the number, so I explain things in terms of billions of cycles each second.
As an example, with an ILP ~4 instruction/cycle at 5GHz we get 20 billion instructions executed each second in a single core. This number is not really tangible but it shocks
This is exactly what I was dealing last year, some particular costumer came to meeting with the idea developers has to be aware of making the code Inclusive and sustainable... We told them that we must set priorities on the performance and the literal result from the operation (a transaction development from an integration)
Nothing really happened at the end but it's a funny history in the office
I have written some data wrangling software in pure C++. I would like to benchmark it again Pandas to see how the speed compares. Does anyone know if there is a good set of Pandas benchmarks that I can create a comparison to? Even better if it has an R comparison.
Believe me I do. This is why my backends are single file native C++ with no Docker/VM/etc. The performance on decent hardware (dedicated servers rented from OVH/Hetzner/Selfhost) is nothing short of amazing.
The fact that now AWS CPU cost is a constant consideration in software development is making developers use better algorithms and languages, a trend that seems the opposite of the 2010s.
If anything this is a testament to how slow python can be, and most importantly how easily it pushes you to write miserably unoptimized code.
It could be a bit overkill, but whenever I'm writing code on top of optimizing data structures and memory allocations I always try to minimize the use of if statements to reduce the possibility of branch prediction errors.
Seeing woefully unoptimized python code being used in a production environment just breaks my heart.
the CPU branch predictor is so many levels down it will have almost no discernible effect on anything you might call a branch in Python code. Even a statement like "a = 1" likely executes a few tens if not a few hundred branches
That is not to say aiming for generally unbranchy code is not a good thing - that often implies well designed code and well chosen data structures anyway
Use C or C++ or Rust or even Java and you don't have to worry about any of this. You can just write the obvious thing with your normal set of tools and it will be good enough.
That's really cool but I somewhat resent the use of percentages here. Just use a straight factor or even better just the order of magnitude. In this case it's four orders of magnitude of an improvement.
Something all architecture astronauts deploying microservices on Kubernetes should try is benchmarking the latency of function calls.
E.g.: call a "ping" function that does no computation using different styles.
In-process function call.
In-process virtual ("abstract") function.
Cross-process RPC call in the same operating system.
Cross-VM call on the same box (2 VMs on the same host).
Remote call across a network switch.
Remote call across a firewall and a load balancer.
Remote call across the above, but with HTTPS and JSON encoding.
Same as above, but across Availability Zones.
In my tests these scenarios have a performance range of about 1 million from the fastest to slowest. Languages like C++ and Rust will inline most local calls, but even when that's not possible overhead is typically less than 10 CPU clocks, or about 3 nanoseconds. Remote calls in the typical case start at around 1.5 milliseconds and HTTPS+JSON and intermediate hops like firewalls or layer-7 load balancers can blow this out to 3+ milliseconds surprisingly easily.
To put it another way, a synchronous/sequential stream of remote RPC calls in the typical case can only provide about 300-600 calls per second to a function that does nothing. Performance only goes downhill from here if the function does more work, or calls other remote functions.
Yet, every enterprise architecture you will ever see, without exception has layers and layers, hop upon hop, and everything is HTTPS and JSON as far as the eye can see.
I see K8s architectures growing side-cars, envoys, and proxies like mushrooms, and then having all of that go across external L7 proxies ("ingress"), multiple firewall hops, web application firewalls, etc...
I think folks often make trade offs with their working requirements.
If you provide an end result response from your web app to a user's browser in 50ms-100ms (before external latency) then things like 200 microseconds vs 4 milliseconds have less of a meaningful difference. If your app makes a couple of internal service calls (over HTTP inside of the same Kubernetes cluster) it's not breaking the bank in terms of performance even if you're using "slow" frameworks like Rails and get a few million requests a month.
I'm not defending microservices and using Kubernetes for everything but I could see how people don't end up choosing raw performance over everything. Personally my preference is to keep things as a monolith until you can't and in a lot of cases the time never comes to break it up for a large class of web apps. I also really like the idea of getting performance wins when I can (creating good indexes, caching as needed, going the extra mile to ensure a hot code path is efficient, generally avoiding slow things when I have a hunch it'll be slow, etc.) but I wouldn't choose a different language based only on execution speed for most of the web apps I build.
MicroServices are great, when your "app" is actually 500 different apps - and the user could be none the wiser that they are talking to 500 different one man applications. You probably need a few helper services in this world for common data access, authorization, sending notifications etc. in this environmen - but these things might also be standard libraries.
When Microservices go awry, it's often because one "service" has been broken up to meet some arbitrary org structure that will change in 6 months. In these cases the extra overhead of the microservices becomes additive to the user, and hitting latency budgets becomes exceptionally difficult. Costs increase, and in 12 months the team decry's the non-sensical service boundaries.
Traditional enterprise companies are agglomerations of IT systems from a sprawling network of acquisitions, subsidiaries, and partners. They collect fad languages, architectures, and proprietary ecosystems from across decades of computing history. And then try to somehow make them all play with each other.
At least in our world we have the source code to all the services. They have explicit and intentional APIs. They're constructed from a small set of frameworks and speak an even smaller number of protocols. Our enterprise brothers have none of that. Screen scraping, retrofitting TCP/IP stacks onto things that never had them, patching binaries whose source is long gone, etc.
In my case, microservices are often asynchronous messaging applications serving hundreds, thousands, or _maybe_ tens of thousands of transactions per day. Message processing time matters much less to me than reliability and separation of concerns, generally. Kubernetes is great for this.
Its a different world if I have to deal with synchronous user response time.
> in the typical case can only provide about 300-600 calls per second to a function that does nothing
This is a provocative framing but I'm not sure it makes sense. Functions aren't resources; they don't have throughput or utilization. It would be bad if a core could only call the function 300-600 times per second, but that is why we have async programming models, lightweight threads, etc. So that the core can do other stuff during the waiting-on-IO slices of the timeline. Which, as you mention, dominate.
It would also be bad if a user had to wait on 300-600 sequential RPCs to get back a single request, but like... don't do that. Remote endpoints are not for use in tight loops. There are cases where pathological architectures lead to ridiculous fanout/amplification, but even then we are usually talking about parallel tasks.
There is overhead to doing things remotely vs. locally. But the waiting isn't the interesting part. It's serialization, deserialization, copying, tracking which tasks are waiting, etc. A lot of performance work goes on around these topics! Compact and efficient binary wire protocols, zero-copy network stacks, epoll, green threads, async function coloring schemes, etc. The upshot of this work is also, as is typical in web/enterprise backend world, not so much about the latency of individual requests (those are usually simple) but about the number of concurrent requests/users you can serve from a given hardware footprint. That is normally what we're optimizing for. It's a different set of constraints vs. few but individually expensive computations. So of course the solution space looks different too.
Being fair, for many of the things that it is worth using a microservice for, you should already have some sort of dominant factor to the call that would more than justify the added latency of the remote call. Be it a database read/write or some other heavy calculation.
Granted, this is exacerbated when architectures don't make a good division between control/compute/data planes.
Control plane, which is exposed to users, should almost certainly be limited to a single (or handful, at most) microservice calls. Preferably to the fastest storage mechanism that you have, such that what latency it does add is minimized entirely.
Is it though? This goes back to my point of architects and developers having internalised thoroughly outdated rules of thumb that are now wrong by factors of tens of thousands or more.
This is not a simple problem to solve efficiently using traditional RDBMS query APIs because they're all rooted in 1980s thinking of: "The network is fast, and this is used by human staff doing manual data entry into a GUI form."
Let's say you're writing an "app" that's given a list of, say, 10K numbers to check. You have a database table in your RDBMS of choice with a column of "banned phone numbers". Let's say it is 100 million numbers, so too expensive to download in bulk.
How would you do this lookup?
Most programmers would say, it's an easy problem to solve: Make sure there is a unique index on that column in the database, and then for each row in the input run a lookup such as:
SELECT 1 FROM BadNumbers WHERE PhoneNumber = @numbertocheck
So simple. So fast!
Okay, that's 10K round-trips on the network, almost certainly crossing a firewall or two in the process. Now it'll take minimum of 1 millisecond per call, more like 2ms[1], so that's at least 20 seconds of wait time for the user to process mere kilobytes of data.
Isn't that just sad? A chunk of a minute per 100KB of data.
Like I'm saying, nobody has internalised just how thoroughly Wrong everything is top-to-bottom. The whole concept of "send a query row-by-row and sit there and wait" is outdated, but it's the default. It's the default in every programming language. In every database client. In every ORM. In every utility, and script, sample, and tutorial. It's woven throughout the collective consciousness of the IT world.
The "correct" solution would be for SQL to default to streaming in tables from the client, and every such lookup should be a streaming join. So then the 100KB would take about 5 milliseconds to send, join, and come back, with results coming back before the last row is even sent.
PS: You can approximate this using table-valued parameters in some RDBMS systems, but they generally won't start streaming back results until all of the input has arrived. Similarly, you can encode your table as JSON and decode it on the other end, but that's even slower and... disgusting. The Microsoft .NET Framework has a SqlBulkCopy class but it has all sorts of limitations and is fiddly to use. But that's my point. What should be default case is being treated as the special case because decades ago it was.
[1] If you're lucky. But luck is not a strategy. What happens to your "20 seconds is not too slow app" when the database fails over the paired cloud region? 1-2 ms is now 15 ms and so those 100K round trips will cost two and a half minutes.
While I agree that databases could absolutely be improved to make streaming query results as described better, that isn't a limiting factor here IMO.
I'd tackle that problem by batching my queries to the database into some logical batch size and send them as table valued parameters. If I had 10k phone numbers to check and minimum latency matters, why not batch into queries of 500-1000 values per-query? That cuts down time to first response, while reducing the network roundtrips.
The issue with taking this out of the database, is you lose consistency. I don't know about your industry, but I don't think mine would be terribly happy if I was using stale data to validate my Do Not Call/Email list. Now there are some situations where you can just update your list of numbers nightly/weekly/monthly, etc. If you don't need any concurrency or other guarentees, might as well save the time/resources on your DB server.
If I was to rebuild that python script application today.
to try and match 100,000 records against 10 million. If I were to do it in a database driven micro architecture solution. I’m not sure if I could come up with that returns results faster even using up probably a million times more clock cycles.
Yes, that is a good explainer on the horrors of single value lookups to a database. It isn't the only way to do that though, as explained in my other post.
I absolutely agree that a DB (even an extremely efficient one) is going to use many more clock cycles to return results there than a local data structure + application. No questions asked.
But how is that list kept up to date? If a user wants to be placed on the list, do you do that in real time to your local data structure? Do you wait for a batch process to sync that data structure with your source of truth?
I'm just saying that a simple program like that will be faster because it lacks a lot of the features that people would consider necessary in today's world.
The database is an amazing general tool that can be used to tackle whole classes of problems that used to require specialized solutions.
I cringe every time a senior developer thinks he’s more clever by using an @Annotation, a Jointpoint, and Spring’s Aspect-Oriented Programming to solve issues. Not only it appears in the stacktrace as 10 method calls, but we’ve also forbidden the GOTO, and yet senior developers keep implementing the @COMEFROM [1] instruction. Both slow and impossible to debug.
Okay, but the proportion of overhead is what matters here.
If you have a CPU or memory size bottleneck and a parallelizable workload, it makes plenty of sense to split the work across multiple machines and coordinate them over the network. If your job was going to take 20 minutes for 1 machine to do and you can fan it out to 100 machines and accomplish the same in 12 seconds per machine plus an extra fraction of a second in communication overhead, that’s a huge win in total latency. The overhead doesn’t matter.
If you have a trivial workload that can be handled quickly on one machine, but you unnecessarily add extra network hops and drop your potential throughput by 100x, then it’s a huge loss.
A ping server that makes a bunch of international RPC calls before replying is the worst case scenario for overhead.
The phrase "goes downhill" means things only get worse. The OP is suggesting that since the ping is already slow, a function that does something is going to be much worse.
But for most system you don't want performance to get worse. You want to make performance get better.
The ping is just being used to measure of the foundations of the system and the OP is pointing out don't expect to see a 'high performing' system when the ping has identified the foundations are broken.
If you fix the ping and then system is automatically fixed.
For years I stuck with MATE, Xfce4, LXQT, etc. to get optimal performance on old hardware but nothing can top a tiling window manager.
With Nixos I switch between Gnome 40 (I do like the Gnome workflow) and i3 w/ some Xfce4 packages, but lately on my older machine the performance of Gnome (especially while running Firefox) is so sluggish in comparison that I may have switched back permanently now.
where I work, every frontend dev has a 64gb ram/2tb ssd/multicore laptop to develop web pages...everything is lightning fast apparently!...so they never do performance engineering of any kind
I was working in C, and looking back I came up with a quite performant solution mostly by accident: all the memory allocated up front in a very cache-friendly way.
The first time I ran the program, it finished in a couple seconds. I was sure something must have failed, so I looked at the output to try to find the error, but to my surprise it was totally correct. I added some debug statements to check that all the data was indeed being read, and it was working totally as expected.
I think before then I had a mental model of a little person inside the CPU looking over each line of code and dutifully executing it, and that was a real eye-opener about how computers actually work.