Hacker News new | past | comments | ask | show | jobs | submit login
Popcount CPU instruction (2019) (vaibhavsagar.com)
230 points by softwaredoug 12 months ago | hide | past | favorite | 99 comments



On the PS3 SPU cores, the best "does everything" instruction was the single "rotate and mask" series of instructions which was the underlying implementation of all rotates, shifts, and bitfield extraction instructions. That architecture is still my favourite years on - every register was vectorised by default, and the instruction was the minimalistic set to achieve a lot of unusual operations that were incredibly useful for graphics or signal processing work. Often you couldn't find an exact instruction to do what you wanted, but if you studied the documentation long enough, you'd find one that generalised it and actually did do what you wanted, as well as gracefully handling loads of other edge cases you hadn't thought of - for instance the various instructions to create shuffle masks so that lots of byte/word/long re-ordering operations could be done with the single shuffle instruction, which could be used to emulate unaligned memory access using aligned-only memory accesses.

Not exactly weird, but on the PowerPC side "eieio", is my most favourite name for an instruction on any architecture - without fail, every time I come across it, I start humming Old McDonald!


> "rotate and mask" series of instructions

Sounds a lot like the PowerPC's rlwinm instruction.

The PowerPC 600 series, part 5: Rotates and shifts

https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99...


SPU is significantly derivative from PowerPC, so that's quite expected.


There's also a gnu error code called EIEIO: https://www.gnu.org/software/libc/manual/html_node/Error-Cod...


Macro: int ED

    “?.” The experienced user will know what is wrong. 
Macro: int EGREGIOUS

    “You really blew it this time.” You did what? 
Macro: int EIEIO

    “Computer bought the farm.” Go home and have a glass of warm, dairy-fresh milk. 
Macro: int EGRATUITOUS

    “Gratuitous error.” This error code has no purpose.


And apparently also

> EGREGIOUS

and

> EGRATITUOUS

and... lots of stuff ^^


There are similar instructions on PowerPC (rlwinm) and bfi/ubfx instructions on ARMv8.


Not a whole lot of asm experience here- what does it mean for a register to be vectorized?


SIMD. Instead of having one value in one register, you have several values per register. (usually a small power of 2. SSE is 4 floats per register. AVX512 is 16 floats per register.)


And, when you perform an operation on a register, like an add, the operation works on all of the values in that register simultaneously in a single instruction.


Continuing the tradition of having ridiculously -named instruction, when PowerPC went 64-bit, it got a rotate-and-mask instructions named `rldicl` - Rotate Left Doubleword Immediate then Clear Left


Do you happen to have a link to said docs?


It looks like my memory is playing tricks and the combined instruction that does everything was just on the PPU (the PowerPC part) and not the SPU.

But the SPU ISA docs are available in many places, just googled and found this mirror: http://www.ece.stonybrook.edu/~midor/ESE545/SPU_ISA_v12.pdf - the various shifts start on page 117, but they're simpler than I remember - shift left in various forms, rotate (right) and mask (i.e. shift right) and rotate (right).

But, even looking at the instructions, it's not immediately obvious that every instruction is operating on a 128-bit vector - e.g. shl operates on 4 32-bit fields at once,and shlh operates on 8 16-bit fields at once, and so e.g. a single instruction can even shift by different amounts. Almost all instructions operate this way, only a few only use the first field in a vector, e.g. conditional branches (brz, brnz, brhz, brhnz). The other rotate instructions are interesting - you can do bitwise rotate across the entire 128-bit vector, rotate by bytes (very useful for memory accesses) and the wonder instruction shufb (shuffle bytes, see page 116) which allows each byte of the result to be pulled from a specific byte in of 2 vectors, or a constant (0,0x80,0xFF) which allows you do do all manner of things for example byte order swapping or different values together.

So, you have this really powerful instruction set that's vectorised on almost every instruction, and then you have a massive 128 registers per core coupled with 32KB local cache memory which is single cycle access. It's an absolute joy to program for on anything which will fit into the 32KB local cache (which includes your program). The only downside is that anything that doesn't fit into 32KB requires you to start a DMA request to copy a chunk to or from main memory, and wait for it to complete and/or continue doing other work in the background. With suitable workloads operating on 10KB or less, many people used this in a true double-buffered fashion, so the next set of data was transferring while you were processing the previous set, but this made lots of tasks quite complicated and many developers found it too hard to make good use of the SPU, and so it was often sadly under-utilised. One interesting effect though, is that DMA'ing your buffers took a similar time to a cache miss on the PowerPC cores, so actually even a naive approach that always triggered a small DMA from main memory in parallel with other operations often wouldn't see much latency and might still even be quicker than running the same task on the PowerPC cores where there was less opportunity for doing other work during a cache miss.

Another interesting feature of this CPU is that you can do a pre-emptive branch prediction so that the instruction cache is already full when a branch is takenm meaning that a branch can execute in a single cycle. Coupled with indirect branch (i.e. using a register value), you can give a CPU a branch hint in advance of the end of a loop as to whether it'll terminate or not and also avoid stalls. In some cases this could outperform the x86 branch prediction that was common at the time, which was just maintaining a cache of the last target of recent branches.

Overall, SPU programming was tough to master, but once you got something working the performance was phenomenal. I think that's why I liked it so much - because the sense of accomplishment was so tangible once you'd battled through it and emerged the other side with some amazing code. But nowadays, the power it provided, that was unheard of at the time (7 cores at 3GHz), isn't anything special over a mid-range i9.


> So, you have this really powerful instruction set that's vectorised on almost every instruction, and then you have a massive 128 registers per core coupled with 32KB local cache memory which is single cycle access

SPU local store is 256kb, not 32kb. It's not single cycle access either, it's multiple cycle latency (6, if I recall correctly); it's still wicked fast though, it's about the same as L1 cache access on the PPU.


Hmmm, just checked and you're right! I've not touched the SPU for well over a decade now, and it appears I've forgotten a lot more than I realised.

You're right about the latency too, but because it was a pipelined architecture all instructions had some latency, so while 6 sounds high, it wasn't really significant. Now I'm thinking about it (memories of re-arranging instructions manually, before I shifted from assembler directly to using GCC and intrinsics and letting the compiler worry about the interleaving), I also realise I'd forgotten the odd/even cycle split where you would pair instructions of different types (roughly ALU and non-ALU) together so they could execute concurrently.



> it has in fact been present in CPU architectures since at least 1961

Actually it was present for the first time in an electronic computer already in February 1951 in Ferranti Mark I, which was the first commercially available electronic computer, slightly before Univac I.

"Population count" is a name that was first used in Cray-1 (1976).

In Ferranti Mark I, this instruction was named "sideways add" and this was one of a few instructions that had been included in its instruction set at the request of Alan Turing, so he is the original author of this "weird" instruction.

In CDC 6600 (August 1963), the name of this instruction was “sum of 1’s”.


Indeed, Knuth writes in TAOCP Vol 4A (in section "7.1.3. Bitwise Tricks and Techniques", previously pre-fascicle 1a: https://cs.stanford.edu/~knuth/fasc1a.ps.gz):

> Early machine designers provided fullword bitwise operations in their computers primarily because such instructions could be included in a machine’s repertoire almost for free. Binary logic seemed to be potentially useful, although only a few applications were originally foreseen. […] The Manchester Mark I computer, built at about the same time, included not only bitwise AND, but also OR and XOR. When Alan Turing wrote the first programming manual for the Mark I in 1950, he remarked that bitwise NOT can be obtained by using XOR (denoted ‘≢’) in combination with a row of 1s. R. A. Brooker, who extended Turing’s manual in 1952 when the Mark II computer was being designed, remarked further that OR could be used “to round off a number by forcing 1 into its least significant digit position.” By this time the Mark II, which was to become the prototype of the Ferranti Mercury, had also acquired new instructions for sideways addition and for the position of the most significant 1.

The "sideways add" operation (aka population count, sum of binary digits, number of 1s, …) finds mention in all five (so far) volumes of TAOCP, and he even uses a special notation νx for the sum of the bits of x. It has always been of interest (from later in the same volume):

> The first textbook on programming, The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler, and Gill, second edition (Reading, Mass.: Addison–Wesley, 1957), 155, 191–193, presented an interesting subroutine for sideways addition due to D. B. Gillies and J. C. P. Miller.


In case you are curious, here is the algorithm from that 1957 text book (2nd edition): https://archive.org/details/programsforelect00wilk/page/190/...

I never did sit down and figure out the notation.

It isn't in the 1951 (1st edition), at https://archive.org/details/preparationofpro0000maur


And in 1955 you had Arithmetic Operations in Digital Computers. https://archive.org/details/arithmetic_operations_in_digital...


> this instruction was named "sideways add"

An early instance of what's now more commonly called a "horizontal" operation in the context of SIMD.


which are slow.

If you wanna count the number of occurrences of a certain character in a string, you ultimately have to use popcount on an ordinary register, there's no corresponding SIMD instruction (at least in <= AVX2). Something along the lines of:

    __builtin_popcount(_mm256_movemask_epi8(_mm256_cmpeq_epi8(haystack, needle)));
So, if you wanna avoid horizontal reductions, you better sum the result of _mm256_cmpeq_epi8 and effectively keep track of 32 counters of 1 byte integers.

Meaning that you have an outer loop over the string, and a fixed-length inner loop of 255 iterations in which you accumulate counts, and after the inner loop a horizontal sum.

Then you go from 1 horizontal operation per iteration, to 1 horizontal operations per 255 iterations.


With AVX512, you can use masking and the vpopcntb instruction to avoid moving values between GPRs and vector registers until the very end:

  uint64_t count_occurrences(const __m512i* vec_string, size_t num_vecs, char target) {
      __m512i count_vec = _mm512_set1_epi64(0);
      __m512i target_vec = _mm512_set1_epi8(target);
      __m512i increment_vec = _mm512_set1_epi8(1);
      for(size_t i = 0; i < num_vecs; ++i) {
          __mmask64 cmp_mask = _mm512_cmpeq_epu8_mask(target_vec, vec_string[i]);
          count_vec += _mm512_maskz_popcnt_epi8(cmp_mask, increment_vec);
      }
      return _mm512_reduce_add_epi64(count_vec);
  }


There is actually such an instruction, psadbw (_mm_sad_epu8). It's not any noticeably faster for the scenario you describe since it only affects the outer loop, but it does avoid needing the popcnt instruction since psadbw only requires SSE2.


> at the request of Alan Turing

Well, I can imagine there might actually have been a connection to cryptography then. Not NSA related though.


Ah yes, pop count. Back in my days as a CPU logic designer on refrigerated, walk-in, scientific mainframes, I was working on a vector processor that was intended to be a competitor to Cray-1, but sank due to poor marketing and a collapse in oil prospecting... but I digress. A few of us on the team had come from previous CPUs at other companies, and argued with the architecture committee that we should have a vector pop count instruction in order to get some interest from No Such Agency. Shot down, of course.

When serial No. 1 was in initial benchmarking, two guys named Frank and George with No Last Names, who lived somewhere near Fort Mead, said they would buy at least one if we added vector pop count.


Popcount eventually gets added to every architecture that lacks it, always at great expense. You would think people would have learned by now.

And, no, not typically just because the spooks want it.


Why, then?


Because there are numerous order-of-magnitude optimizations it enables, that are useful far beyond cryptanalysis. People mention spooks because the spooks have often had access to enough money to force the issue.


I've read this story before but can't remember where


I suspect it played out multiple times at different companies.


If you want to learn more about Hamming Codes from a mathematical perspective, I recommend 3Blue1Brown's YouTube series:

> How to send a self-correcting message (Hamming codes) - https://youtu.be/X8jsijhllIA

> Hamming codes part 2, the elegance of it all - https://youtu.be/b3NxrZOu_CE

Associated with Ben Eater's video, if you want a perspective regarding low level computer hardware:

> What is error correction? Hamming codes in hardware - https://youtu.be/h0jloehRKas


At the risk of veering into crackpot territory, this operation is extremely profound and shows up in an absurd number of places.

For example, imagine you like the number three, so you want to make a three letter alphabet (ternary, which has the optimal radix economy amongst integers) — the simplest way to do it is to connect the 3rd row of Pascal’s triangle (1, 2, 1) to the popcount of two bits. And lo, you have trits! Coincidentally, this maps 4D onto 3D in such a way that there are two different spins to the number one: 01 or 10. Hmmm…where have we seen “spin” before?

It’s interesting to entertain the possibility that simple properties of the simplest possible strings of the simplest possible languages could explain all kinds of physical phenomena. And why wouldn’t they?

Also, HAMTs are incredible, so if the only thing you could do with popcount was make HAMTs, it would already be incredible.

Such a seemingly silly and mundane thing, how many hands are raised, how many coins flipped heads, how many voted for this or that candidate, I guess it’s just a tally function, but damn if I didn’t find myself astounded by the usefulness of popcount earlier this year. Not surprising to see it on hacker news. Great article, and timely.

Now, will someone please tell me how popcount is involved in neural coding?

Maybe it’s literally everywhere, even inside our own minds. Neurons vote for the presence of features, tally the votes, you can simplify your data substantially by not giving a bleep about whose hands are raised and instead only care about how many of them you can see.

That’s also why the second law of thermodynamics is, as Arieh titled his book, plain common sense. The more coins you flip, the more ways you can flip the same popcount total (a dim event), even though there might be many different combinations of heads and tails in various possible coin flip sequences (specific event).

So, yeah. Cheers to this one!


Why was there no standard operator for this in C since the 1970s? You got various bit shifts and bitwise logic operators, so why not this? std::popcount exists in C++ now since C++20, but that took a while if this instruction was in CPU's that long.


Because almost nobody had a computer that supported it when C was created. That article mentions 3 cpu architectures that have the instruction in the 1970s:

  1961: IBM Stretch
  1964: CDC 6000
  1975: Cray-1
After that, it mentions further support in 2005, with Sparc.

Reading https://en.wikipedia.org/wiki/IBM_7030_Stretch, a grand total of 9 IBM Stretch’s were sold.

In comparison, the CDC 600 was a wild success, with “100+” sales (https://en.wikipedia.org/wiki/CDC_6600)

The Cray-1 scores similarly with “over 100” (https://en.wikipedia.org/wiki/Cray-1)

C is from 1972 and started life for the PDP-7 and was soon ported to the PDP-11, of which “around 600,000” were sold (https://en.wikipedia.org/wiki/PDP-11; I guess most of these will have been sold late in the 1970s and early 1980s, but I think it’s a safe bet they were almost immediately more common than those 3 machines)


At least GCC and Clang can recognise most sane implementations of popcount and replace it with an instruction.


Maybe because it hasn't been in x86 CPUs that long. Like Intel added it in 2009 or so. And there were intrinsics for it so I'm guessing it wasn't a high priority.


We do have `stdc_popcount()` now with C23


Isnt't it stdc_count_ones?


To add to some of the uses of popcount, it is also used in chess engines. Since a chess board has 64 squares, many engines use 64 bit unsigned integers to represent the whole board, mapping each square to a bit, which are called bitboards. Then, each combination of piece type and color is given its own bitboard, for example, a bitboard to represent the positions of white rooks. The squares attacked by a piece can similarly be represented by a bitboard.

This in turn allows for the use of something called magic bitboards, which are essentially compressed lookup tables for the squares to which a rook or a bishop can move for every possible disposition of the board. Having all these possibilities precomputed speeds up move generation significantly.

Popcount is used, for example, to know the amount of pieces of a certain type on the board, the number of squares to which a piece can move, or the amount of pieces attacking a given square.


> You might be wondering, like I was, if there’s more to this instruction, but that’s all it does! This doesn’t seem very useful, right?

It seems very useful. I've used it for several algorithms including DNA analysis and vectorized math.


It’s of course useful, the question is it useful enough to warrant its own instruction


Whether it’s shift, mask, multiply, Kernighan’s method, or something else, it’s going to be multiple instructions and tens or hundreds of cycles to do this in software.

pop count instructions take a handful of cycles to run (~10 on ARM, ~3 on Intel?). It’s one of those things that silicon can do very well.


Question is whether performance critical code that would benefit from a pop count instruction, is common enough to warrant inclusion.

As long as it isn't a bottleneck in common software, a few shifts/masks/add/integer multiply or whatever, are very quick on modern cpus. Often 1-cycle. If not >1 such instructions in parallel per clock.


I work in cheminformatics, and wrote one of the documents cited by Sagar.

The answer is "yes and no".

My area of focus is a part of molecular similarity. The overall idea is that molecules which are similar tend to have similar functionality. There are many approximate ways to estaimte molecular similarity. The one I focus on maps chemical substructures ("features" or "descriptors") to 1 or several bits on a bitstring of length ~1024 bits, called a fingerprint.

We use Jaccard similarity (called Tanimoto similarity in my field) of two fingerprints as a proxy for molecular similarity computed as popcount(A & B) / popcount(A | B). Since popcount(A) and popcount(B) can be pre-computed, this ends up being popcount(A & B) / (popcount(A) + popcount(B) - popcount(A & B). If the fingerprints are grouped by popcount then this boils down to computing popcount(A & B), plus some tiny amount of integer math.

This can be used to find the k-nearest matches to a single query, to cluster the fingerprints, to identify diverse fingerprints, and so on.

These methods bounded by two things: 1. the time needed to compute popcount of (A & B), and 2. the memory bandwidth.

The CPU bottleneck really is the popcount calculation. At https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0... I compared different popcount implementations and found POPCNT about 2-3x faster than the fastest pure-C popcount.

However, POPCNT on Intel processors isn't all that fast. Rather, when I was really focused on this 5+ years ago, Intel processors only had one execution port than could handle POPCNT, so I could only get one 8 bytes per cycle. (Some AMD processors have several such ports, but I never tried one of those CPUs.)

Instead, Wojciech Mula, Nathan Kurz and Daniel Lemire showed that AVX2 instructions were even faster than sequential POPCNT instructions because AVX2 could handle more things in parallel. See my table for numbers.

For small bitstrings (I think 512 bits was the boundary) POPCNT is still the fastest.

With AVX2 it's fast enough that memory bandwidth becomes the limiting issue, and I need to start worrying about cache locality.


Since Intel Ice Lake and AMD Zen 4, the Intel and AMD CPUs with AVX-512 support (or AVX10 in the future) have the VPOPCNT instruction, which works on a short vectors of up to 512-bit length.

With VPOPCNT it is easy to accelerate any POPCNT dependent algorithm to speeds far beyond of what is possible with any other instructions.


Yes, I certainly expect so!

In my 2019 paper I earlier linked to I predicted "The VPOPCNTDQ instruction in the AVX-512 instruction set computes a 512-bit popcount, which should be faster still." because at the time I didn't have access to such hardware.

While those are easier to find now, I haven't revisited the issue because my experiments back then strongly suggested my code is now limited by memory bandwidth, not popcount evaluation performance.

I also don't know how many of my customers deploy on VPOPCNT-capable hardware.

My bitvectors have a 1-bt density of 5%, so I think the real next step is to look into something like BLOSC, where I store the data in compressed form, using a compression form which supports on-the-fly decompression faster than a memory read, then do the popcount on that transient data. 75% compression would need a 4x faster popcount.

I've tried to use BLOSC for this, but wasn't quickly able to figure how how to integrate it with my code, and realized it would likely require some pretty breaking big changes in my code that I couldn't really justify, so I've been putting it off for years.


I'd wager also yes.

> As long as it isn't a bottleneck in common software, a few shifts/masks/add/integer multiply or whatever, are very quick on modern cpus. Often 1-cycle. If not >1 such instructions in parallel per clock.

one instruction takes less cache space than multiple instructions. At worst it might mean fitting vs not fitting hot loop in cache.


New Intel/AMD CPU's do a register based popcount in a single clock.


Used to be three cycles.

Unfortunately, the original AMD64 back in 200x lacked a popcount, so most software built for PCs even today lacks any instances of the instruction. Means to get the instruction generated are finicky, non-portable, and often result unexpectedly in a function call, instead. E.g., without a "-m" option, Gcc and Clang will turn "__builtin_popcount()" into a function call. Likewise, "std::popcount()" and "std::bitset<>::count()". Always use at least "-mavx".


It is broadly useful for bit vectors and integer algorithm performance engineering, showing up all over the place. One advantage the instruction has is fast constant-time performance, whereas algorithms composed from ALU primitives either have performance that is highly sensitive to the bit distribution or have consistent performance but relatively slow. It also uses less space in the instruction cache.

The non-classical bit-wise instructions that I find indispensable are popcount and PDEP/PEXT. For some codes, these can turn very messy integer work into a tight, elegant algorithm. Where performance matters, I would not want to live without them.


> It’s of course useful, the question is it useful enough to warrant its own instruction

I argue: yes absolutely.


Why wouldn't it be? The cost is pretty trivial in the context of a modern core, though I can see the argument against in 1961. The biggest cost is in using up an instruction encoding, though when looking at that you have to compare the usefulness of popcount with whatever would replace it.


I recently used popcount for an audio-fingerprinting project [0] to compute the difference in binary "audio fingerprints". With AVX2 and 4 cores, my desktop managed 448 billion bit-diff calcs per second! [0] https://kenschutte.com/phingerprint/


Bit population count is useful for many things. MMIX has SADD ("sideways add", which is also a bit population count), and also has MOR ("multiple OR") and MXOR ("multiple XOR"), which are also useful (for many purposes, including endianness conversion (even PDP-endian), reversing the bits in a byte, hashing, and others) but I had not seen on other instruction sets.

In GNU C, you can use the function __builtin_popcount to make bit population count.


It's a very useful instruction. Go uses it extensively in its core runtime and the compiler itself. I assume other languages do too if they care about performance.

Examples:

1. Growing a slice.

2. Runtime's heap bitmap.

3. Page allocation.

4. malloc.

5. Register allocation.


I would love to ask Gregory R. Travis what he did with it...

   The 6600 certainly had population count; I used it. Note that it was about
   the SLOWEST instruction in the whole instruction set. I'll have to dig out
   the COMPASS manuals when I get home and find out just HOW slow.

   Parity was never added to the 6000 line, that came with the 7000 line.
  
   Why did I use it? Well, I could tell you but I'd have to kill you


I've used popcnt in prod in last 10 years. We were looking for similar images in large sets, using a perceptual hash (64 bits, at least 60 bits being same considered a match).

Tried a few different solutions. Fancy vector databases. Clustering algos to reduce search space. A really neat custom O(n) algo that searched the set of possible solutions instead of the intersection of n^2 possible matches.

The old fashioned N^2 solution ended up being fastest, simplest, and could handle quite large Ns, when the core of the loop was optimized with popcnt (well xor -> popcnt). Computers are fast.


Haskell added a function popCount in https://hackage.haskell.org/package/base-4.18.0.0/docs/Data-... some years ago to allow easy use of this operation where present.


Another use of popcount: generating binomial random variates corresponding to n coin flips in early games. Rather than looping n times, flipping coins and counting heads, you generated a new random number, masked off the bottom n bits, and hit ’em with popcount.


Hamming distance (and popcnt) is also used in gradient descent scenarios.

Eg. here's part of fuzzing engine code, where it tries to produce an input which matches some form of variable check inside the fuzzed code:

https://github.com/google/honggfuzz/blob/1f09e3bfb0785d9b311...

  register uint8_t v = ((sizeof(Arg1) * 8) - __builtin_popcount(Arg1 ^ Arg2));
  uint8_t prev = ATOMIC_GET(globalCovFeedback->bbMapCmp[pos]);
    if (prev < v) {
    ...


What's weird about popcount?


I'd guess to the "Surely my computer is a 1:1 mapping of the C programming language" people the expectation is that their CPU will have two increment instructions, two decrement instructions, and nothing like popcount. After all, C doesn't have a popcount operator.

I have this terrible feeling that somewhere down the line we're going to get a generation of people who assumed the computer is a 1:1 mapping of Python -- a friend who teaches CS for a major university explained that they're going to teach Python as first language for Computer Science undergraduates starting 2024.


> a friend who teaches CS for a major university explained that they're going to teach Python as first language for Computer Science undergraduates starting 2024

I think this is significantly more common than you think it is.

Also python has an extremely rich type system, can be completely duck typed as well, and does effectively everything you would need to know about in a CS degree, at least in the first semester or two, other than manual memory management. Which is significantly easier when you have a firm grasp of other aspects of programming.

There are multiple approaches to teaching, starting with python and then moving lower down later isn't necessary a bad move. Who says starting with assembly/C and moving up is a good move?


Oxford and Cambridge both begin with an ML, the same way I was taught last century (although of course it's a different ML today, I believe one of them uses Ocaml).

Unavoidably I am biased, but I nevertheless feel confident that this is the correct way to teach programming to CS students in particular. One reason to use an ML, and especially to use a less common ML, is to create a level playing field. Teaching Python, Java (as they do today at his institution), or C (yes that happens) means some of the fresh students have months, maybe years, even occasionally decades of prior practical experience in the "first language" before day one. These students will find most, or even all, of the exercises trivial and so they won't actually learn what you were teaching, which may surprise them too, they're also likely to be disruptive, after all they're bored because this is easy.

Now, for just a "level playing field" you could choose almost anything unpopular these days. Pascal? Scheme? Fortran? But that's not the only reason. MLs have a nice type system, which is great because probably simultaneously on the theory side of their course your students are learning about types, and Hindley–Milner gets you from a theory to a real world, from why this is a good idea, to huh, the machine did exactly what I meant, that's neat. If your colleagues are teaching them a sane approach to types, but you're teaching a language which says YOLO, everything is just a bag of bits don't worry about it, that's not a cohesive message. Likewise for function composition. It'd be nice if the first language you're teaching has actual bona fide mandatory TCO, so that when the nice elegant explanation in the theory class shows tail calls, the result in your practical class isn't a room full of stack overflow errors (or worse).

This is reasonable for a First Language because you're expecting to teach these students several languages. The same doesn't apply to say, an optional module for Geographers that gets them enough Python to use a GIS competently. But we are (hopefully) not relying on Geographers to invent the next Python.


I don't know if I agree with ML as the first language to teach students.

The syntax is vastly different from most of the "C like" languages out there and any functional style code is miles away from what actually gets executed by the CPU.

The fact that you even mentioned TCO means you are aware of that since that is a compiler level optimisation, effectively replacing your recursion with an imperative set of instructions simulating a loop or even precomputing everything in the compiler.

Python also does not treat types as "a bag of bits", it very much has a very strong concept of types and you can make them explicit using annotations.

What would be the difference between running a compiler vs running something like mypy? You as the programmer still had to reason about the code all the same.

I would absolutely also argue that multiple languages should be taught since in the real world that will be expected of you, so you might as well get a head up.

Why not make the languages taught be languages you are very likely to use? I would be all for teaching C/C++, JS/TS, python, C#/Java, Go and then sure add in a ML style language as well.

The arguments for python->C->ML are pretty strong to me, you can get a very very broad exposure to a large collection of programming paradigms at a ton of abstraction levels using just that. But now we are delving into territory where I begin to argue about what should be taught so let's leave it there.


I think you're making roughly the same mistake I described in the grand-parent comment, where people confuse the C abstract machine (roughly a caricature of the PDP-11) for their very different real computer. Yes, C works very like this abstract machine, as well it might, but, that's not how your computer actually works and so if you tell students that's what's really going on you're presenting them a fiction.

If you don't like TCO because it's "effectively replacing your recursion with an imperative set of instructions simulating a loop" then you're really going to hate the fact that the machine doesn't have loops anyway, the machine just has several flavours of go-to and computed go-to, and none of the languages you're proposing to teach instead provide these hard to use primitives (no C's goto is not the full go-to complained of in the famous letter)

What you wrote will be transformed, whether it's tail calls, a for-each loop, or the C-style pointer iteration - none of these are actually what the machine does. We should not prefer to write the code that most closely resembles what it'll get transformed into. Knowing how the transformation works is interesting, probably enough to justify showing some of it in class in an undergraduate degree but the purpose of writing high level languages like C is to express ourselves more clearly, something we certainly do want these students to learn - yes it needs to be transformed into machine code, but more important is that humans, including us and often future maintenance programmers can figure out what is intended.


> I think this is significantly more common than you think it is.

For context: it was the first language taught in my CS degree in 2014. They switched to Python from Java due to its pedagogical advantages - namely, students being able to focus more on the high-level algorithmic semantics than on language design minutiae (Java) or memory management / poor diagnostics (C).

I wasn't a fan of the decision at the time, but I certainly see why they did it in hindsight!


We could make a slow human friendly machine code, show how it abstracts electronic components and then get into physics and chemistry.

It would be as fascinating as unproductive.


Maybe you’ll feel better knowing that popcount has been standardized in C23


Thanks, I didn't know about that and so I just read a summary, it's kind of sad that even the C standard library is now obliged put prefixes on names for namespace reasons, but yeah, it is good.


To be fair it’s not different from C++ although it features namespaces. One still has to write “std::foo()” in headers in order to not pollute the global namespace, and therefore in template code, anyway.

As long as they stop adding locale and global/thread-state dependent functions I’ll be quiet :)


Regarding locale, let's hope that the text conversion API makes it in the next version so we can all work in unicode and be happy.


Having python as a first programming language isn't a big disadvantage. The core elements of program optimization are still there: making sure you don't accidentally write O(n2) code etc. even if the memory management is opaque, you end up with an idea of values vs references, stacks and heaps just from some surprising edge cases in the language.


> a friend who teaches CS for a major university explained that they're going to teach Python as first language for Computer Science undergraduates starting 2024.

This is extremely widespread. My institution does it, as do many others I know of. I would not be surprised to find that Python is the most common initial programming language right now.


When I was in college (2001-2006) it was Java. I’m pretty sure the school already switched to Python for intro CS within the past few years.


What would be so terrible?

Some people will be naturally curious, as always, and they will go deeper.

And even languages like Javascript influenced the design of CPUs and even specific instructions. Nothing is "pure" in the computing field


> 1:1 mapping of the C programming language

I was pretty surprised the first time I discovered the C compiler could "reverse engineer" a "logical or of two shifts" to generate a single rotate instruction.


Idiom recognition. Some compilers had idiom recognition before C existed, because in some of the early languages the literal interpretation of the common idioms is incredibly expensive to perform, but a compiler can recognise that actually it can take massive shortcuts.

For example, sort this list of numbers into ascending order, then, take that sorted list, get the last number from it. With idiom recognition the compiler can say oh, I'll walk the list once, tracking the maximum as I go, that's much faster. In C this seems like, well, if I meant that I'd write that but I think APL is an example where the list expression is idiomatic and so that's what programmers will write but they want the fast thing.


> I have this terrible feeling that somewhere down the line we're going to get a generation of people who assumed the computer is a 1:1 mapping of Python -- a friend who teaches CS for a major university explained that they're going to teach Python as first language for Computer Science undergraduates starting 2024.

Probably better than starting with C with all its footguns and "optimistic" approach to guessing what programmer meant...


Well there have been Java bytecode & Forth optimized (if not directly executing) cpus. So you never know. ;-)


Heh you underestimate (or perhaps understate) the predictability of stupidity... somewhere down the road you'll have kids running for-loops on top of GB-large LLM models :D

now, seriously - it gets me thinking a lot that certain generation is going to be the last to be interested in assembly, but actually there are always people fascinated by the beauty of assembly language.


I'm worried about python being assumed to be the same as every language than every CPU.


The only thing weird about it is that even if it was already implemented in the first commercial computer made with vacuum tubes, because its implementation in hardware is cheap and it can be much faster than a software implementation, it was omitted in most later computers, with the exception of the most expensive computers, which were typically used mainly by government agencies, until AMD decided to implement it in AMD Barcelona (2007).

Once AMD had it, Intel followed soon in Nehalem (2009), and now most CPUs have it.

The article is somewhat misleading about population count being available in ARM NEON since 2005 (i.e. since Cortex-A8).

The 64-bit Armv8-A ISA, which was introduced in 2012, has the CNT instruction, which is indeed a population count instruction.

The 32-bit Armv7-A ISA has the VCNT instruction, which is a variant of population count, but which is much less useful than the 64-bit POPCNT provided by Cray 1, AMD Barcelona, Intel Nehalem or Armv8-A, because it works only on 8-bit words, so executing the equivalent of a 64-bit POPCNT requires a sequence of several instructions.


CNT on arm64 counts 8 or 16-bit elements too.


That was my initial reaction as well. Counting bits arises quite naturally in many places.


Nothing. It's clickbait.


If you want an example where popcount is very useful, check out this challenge from Cryptopals: https://cryptopals.com/sets/1/challenges/6


I still don't know of any RISC-V implementation that supports the POPC (and related) instructions. Would like to be pointed at any.

It is generally a bad idea to field an instruction-set architecture that turns common, brilliant optimizations into gross pessimizations.


The bit manipulation [0] extension has been ratified for a while now and is part of the RVA22 application extension profile [1].

You can already buy SOCs that support it, e.g. vision five 2 and star64.

Interestingly the risc-v vector extension [2] has it's own popcount instructions for vector registers/register masks. This is needed, because the scalable architecture doesn't guarantee that a vector mask can fit into a 64 bit register, so vector masks are stored in a single LMUL=1 register. This works really well, because with LMUL=8 and SEW=8 you get 100% utilization of the single LMUL=1 vector register.

Another interesting thing is that the vector crypto extension will likely introduce a element wise popcount instruction [3].

[0] https://github.com/riscv/riscv-bitmanip

[1] https://github.com/riscv/riscv-profiles

[2] https://github.com/riscv/riscv-v-spec

[3] https://github.com/riscv/riscv-crypto/blob/master/doc/vector...


Thank you. It has generally been hard to discover exactly which extensions any given implementation supports, with usually most of any comprehensive documentation in Chinese. I hope for that to change.


jh7110, the SoC used in the first large-scale run SBC, has popcnt as part of the related bit manipulation extensions.

This is important, as it means that the first generation board to reach hordes of developers and enthusiasts is popcnt-capable.


When I look up the jh7110, I find it has four "RV64GC" cores, which AFAICT does not call for any B-extension instructions. E.g.:

https://www.cnx-software.com/2022/08/29/starfive-jh7110-risc...

https://doc-en.rvspace.org/JH7110/PDF/JH7110_Datasheet.pdf

Where do you find B-extension instructions documented?

Usually we want it citing at least an RVA22U64 profile, which I do not find anywhere in these materials.


RV64GC cores is a really vague definition of the cores.

SiFive U74 21G1 is a way more precise definition. It contains many extensions to mere RV64GC.

And it implements not just the 2019 version of the unprivileged spec but also a version of the privileged one. This is not implied by merely stating "RV64GC".

The specification for SiFive's U74 21G1 version[0] will tell you about the details of the supported extensions.

To save you the work of interpreting the document, I'll tell you it is actually Zba and Zbb, while missing the Zbc and Zbs extensions.

Zbb being the one that has CPOP and related instructions, thus JH7110 has them.

0. https://starfivetech.com/uploads/u74_core_complex_manual_21G...


i used popcount for visual art purposes in http://canonical.org/~kragen/sw/dev3/trama.html, both because it's extremely cheap to compute and i was interested in patterns that were extremely cheap to compute, and because it figures prominently in the simplest definition of the hadamard-walsh matrix: popcount(row bitand col) bitand 1, which in the stack language of that program is xy&#1&

epilepsy warning, although the page doesn't flash much initially when loaded, it will probably flash a lot if you start playing with it


It says that the popcount instruction is not very useful but I've needed in quite a lot in my time doing embedded, encoding/decoding or some special algorithms for storage in e.g. game engines.


Title checks out. I don’t believe a word of it.


simd and branchless


Reading the title I thought it was about debians popularity contest "popcon" https://popcon.debian.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: