Amusingly, (to me at least) there's also an SSE instruction for non-reciprocal square roots but it's so much slower than reciprocal square root that calculating sqrt(x) as x * 1/sqrt(x) is faster assuming you can tolerate the somewhat reduced precision.
I dunno about Intel and AMD, but ARM and RISC-V use lookup tables for rsqrt. Unlike AMD and Intel, those tables are precisely defined in their respective specs.
I don't recall the coprocessor having either reciprocal or reciprocal square root? I didn't do much Intel until later in my career though, so I might be missing something though.
Both _mm_rcp_ps (rcpps) and _mm_rsqrt_ps (rsqrtps) are only good for about half the bits.
Yeah, I think DRAM is almost certainly the future, just in terms of being able to afford the memory capacity to fit large models. Even Cerebras using a full wafer only gets up to 44 GB of SRAM on a chip (at a cost over $2M).
An interesting twist is that this DRAM might not need to be a central pool where bandwidth must be shared globally -- e.g. the Tensortorrent strategy seems to be aiming for using smaller chips that each have their own memory. Splitting up memory should yield very high aggregate bandwidth even with slower DRAM, which is great as long as they can figure out the cross-chip data flow to avoid networking bottlenecks
Seems like the claims of the abstract for speed and energy-efficiency relative to an RTX 3090 are when the GPU is using a batch size of 1. I wonder if someone with more experience can comment on how much throughput gain is possible on a GPU by increasing batch size without severely harming latency (and what the power consumption change might be).
And from a hardware cost perspective the AWS f1.2xlarge instances they used are $1.65/hr on-demand, vs say $1.29/hr for an A100 from Lambda Labs. A very interesting line of thinking to use FPGAs, but I'm not sure if this is really describing a viable competitor to GPUs even for inference-only scenarios.
The FPGA being used is I believe one of the lowest speced SKUs.
AWS instance prices are more of a supply/demand/availability thing, it would be more interesting to compare from a total cost of ownership / perf-power-area prespective.
The article also gives reason to be skeptical of the quoted "10 fatalities out of an estimated 3.65 million jumps in 2023". If we count 28 known fatalities at this one facility from 1983 to 2021, we get around 0.75 fatalities per year.
In other words, we would expect that 14 facilities of similar death counts to the one in the article would equal the total US fatalities for a year. The USPA dropzone locator [1] lists 142 facilities, so if we take everything at face value then this facility is ~10x worse than the average for USPA members.
> But I'd bet it's less than $200/jump worth of risk
In this case at least, it seems that this specific facility is higher risk than that. And with a lack of legally mandated reporting requirements, I'd say the onus is on a facility to prove safety once it's averaging a death every 1.3 years.
> so if we take everything at face value then this facility is ~10x worse than the average for USPA members.
The issue is that I would expect at least a factor of 10 typical variation in the number of yearly jumps done at different facilities, so it’s hard to conclude anything without getting at least a rough guess of how many jumps they are doing. (The article correctly notes that the inability to find this number publicly is a real problem.)
The specifications page [1] gives a bit more context. I think minimum buy is about a half rack, which includes at least 16 64-core CPUs, 16 TiB of RAM, and 465.75 TiB of NVMe SSD storage. Playing around a bit with the Dell server configurator tool, it seems like that is going to come in a rough ballpark of $1MM as stated in a sibling comment.
I do not purchase hardware, but $1MM is way above what I would have expected. Going to Dell, the most expensive pre-built rack mount starts at ~$30k. Assuming 16 of those only gets you to $480k. Throw in an extra premium for the rack itself + small company margins still leaves me reaching to get to that price point.
Price delta is out of the box cloud orchestration value (imho). Most large enterprises would struggle to build this themselves (Mesos->OpenShift->Kubernetes/Tanzu/etc), so you’re paying for turnkey cloud on prem. Probably save in the long run considering public cloud margins.
Enterprise CIO doesn’t want a hobby project (attempting to cobble together internal cloud orchestration and infra), they want to be able to show immediate business value. You charge what the market will bear. I’ve seen many companies with thousands of employees and spending millions, even tens of millions a month, on public cloud providers and just flail, unable to get to steady state post transformation (even after years of trying). This is made for those folks, especially with Broadcom having VMware self inflict harm on itself with recent strategy decisions.
I mean…you could also just get a z/VM system and have a few LPARs on it and just use Ansible for orchestration. Why wouldn’t an enterprise CIO just go for a mainframe system?
"just" is doing a lot of heavy lifting here, I'm not the target customer base for one of these, but if they can deliver a server rack that teams can plug in, turn on, and start deploying workloads to it in the same way they currently deploy to public clouds with familiar tooling, that seems extremely valuable to me.
It's going to depend on how well they manage to pull off the magic trick of "little or no configuration and maintenance required". If things start breaking in hard to diagnose ways, it's going to be just another broken appliance that requires expensive maintenance, and companies will be questioning why they didn't DIY it in the first place.
If there is one company that has made 'make it easy to debugging issues' their core philosophy, its them.
Its almost all open software, that helps a lot. They add a minimal amount firmware, rather then the many, many million lines of firmware that is usually around. And most of the stuff they added is Rust on a micro-kernel. (Check out the talk I linked top level to see some of their low-level debugging infrastructure).
To bad they can't (yet) get open firmware in the NIC, the SSDs and some of those other places (Time for an Oxide like company that makes P4 driven NICs). But nobody else can really offer that either.
The only real issue for them is that Illumos is the host OS. Its open source and stable of course, and has good debugging tools. But in terms of industry experience, the amount of people with deep knowlage of the system are harder to find compared to Linux.
The of course also add some complex software on top that will have to work properly, moving VMs, distributed storage and so on.
Full DIY is pretty damn hard, you need a serious team to pull that off. The Dell VxRail/VMWare is the more reasonable competition. I think VMWare going full Broadcom mode will make them more interesting. Buying into that ecosystem isn't that appealing right now.
Getting the same performance and feature out a mainframe will be considerably expensive I would guess. And in addition to that you are buying into an incredibly closed ecosystem where prices only go up from there.
You are also paying for a bunch of stuff you don't need. Most people just don't need to hot swap a CPU or turn these single socket 128 core machines into a gigantic 4096 machine either.
Simply moving virtual machine off and restarting or replacing a sled is enough for the waste majority of use-cases.
This is still pretty much commodity single socket server platforms, just with more sane and open firmware and a sane open source software stack.
Are you sure you're comparing equivalent memory and storage specs? I needed to go into the customization menus in the Dell configurator to spec something equivalent, where prices started going up quite rapidly.
For example "3.2TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 with carrier" is $3,301.65 each, and you'd need 10 of those to match the Oxide storage spec -- already above the $30k total price you quoted. Similarly, "128GB LRDIMM, 3200MT/s, Quad Rank" was $3,384.79 each, and you'd need 8 of those to reach the 1TiB of memory per server Oxide provides.
With just the RAM and SSD cost quoted by Dell, I get to $60k per server (x16 = $960k), which isn't counting CPU, power, or networking.
I agree these costs are way way way higher than what I'd expect for consumer RAM or SSD, but I think if Oxide is charging in line with Dell they should be asking at least $1MM for that hardware. (At least compared to Dell's list prices -- I don't purchase enterprise hardware either so I don't know how much discounting is typical)
Edit: the specific Dell server model I was working off of for configuration was called "PowerEdge R6515 Rack Server", since it was one of the few I found that allowed selecting the exact same AMD EPYC CPU model that Oxide uses [1]
> For example "3.2TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 with carrier" is $3,301.65 each
That’s the pricing for people who don’t know to ask for real pricing — it’s an absolute joke. I don’t know how might extra margin gets captured here, but it’s a lot.
Even in teeny tiny volumes, Dell will give something closer to real pricing, and a decent heuristic is that it’s at least 2x cheaper.
This is a real SSD. Dell likely buys this brand and others:
And they keep the margin to have money for R&D. I kind of get it because it’s low volume for now but I don’t necessarily see the appeal of being an early adopter here.
Thanks for the link, and very good to know. I've always struggled to find component prices for Kioxia drives and higher-capacity RAM sticks so it's good to see I can finally look these up on serversupply when I'm curious.
$480k + switches + management + support + virtualization licenses + integration - it adds up. It will also probably take you at least 3x as long. I can think of lots of examples where this premium for apple-like ux is totally worth it
One of the wildest R features I know of comes as a result of lazy argument evaluation combined with the ability to programmatically modify the set of variable bindings. This means that functions can define local variables that are usable by their arguments (i.e. `f(x+1)` can use a value of `x` that is provided from within `f` when evaluating `x+1`). This is used extensively in practice in the dplyr, ggplot, and other tidyverse libraries.
I think software engineers often get turned off by the weird idiosyncrasies of R, but there are surprisingly unique (arguably helpful) language features most people don't notice. Possibly because most of the learning material is data-science focused and so it doesn't emphasize the bonkers language features that R has.
I saw a funny presentation where Doug Bates said something like: "This kind of evaluation opens the door to do many strange and unspeakable things in R... for some reason Hadley Wickham is very excited about this."
In Dyalog APL you can set the index origin with ⎕IO←0 (or 1) and there are many ways in which this can bite you. In Lua, and I think Fortran, you can specify the range of array indices manually.
One of the stranger behaviours for me is that R allows you to combine infix operators with assignments, even thou there are no implemented instances of it in R itself. For example:
`%in%<-` <- function(x, y, value) { x[x %in% y] <- value; x}
x <- c("a", "b", "c", "d")
x %in% c("a", "c") <- "o"
x
[1] "o" "b" "o" "d"
Or slightly crazier:
`<-<-` <- function(x, y, value) paste0(y, "_", value)
"a" -> x <- "b"
x
[1] "a_b"
We with Antoine Fabri created a package that uses this behaviour for some clever replacement operators [1], but beyond that I don't see where this could be useful in real practice.
That sounds like asking for trouble. Someone coming from any other programming language could easily forget that expression evaluation is stateful. Better to be explicit and create an object representing a expression. Tell me, at least, that the variable is immutable in that context?
The good news is that most variables in R are immutable with copy-on-write semantics. Therefore, most of the time everything here will be side-effect-free and any weird editing of the variable bindings is confined to within the function. (The cases that would have side effects are very uncommonly used in my experience)
It's crazy how literally R takes "Everything's an object." While parentheses can be treated like syntax when writing code, it's actually a function named `(`.
Of course, playing with magic sounds fun until you remember you're trying to tell a computer to do a specific set of steps. Then magic looks more like a curse.
Asking out of lack of experience with R: how does such invocation handle case when `x` is defined with a different value at call site?
In pseudocode:
f =
let x = 1 in # inner vars for f go here
arg -> arg + 1 # function logic goes here
# example one: no external value
f (x+1) # produces 3 (arg := (x+1) = 2; return arg +1)
# example two: x is defined in the outer scope
let x = 4 in
f (x+2) # produces 5 (arg := 4; return arg + 1)? Or 3 if inner x wins as in example one?
If the function chooses to overwrite the value of a variable binding, it doesn't matter how it is defined at the call site (so inner x wins in your example). In the tidyverse libraries, they often populate a lazy list variable (think python dictionary) that allows disambiguating in the case of name conflicts between the call site and programmatic bindings. But that's fully a library convention and not solved by the language.
Well the point is that the function can define its own logic to determine the behaviour. Users can also (with some limits) restrict the variable scope.
A lot of the time you're not actually using what is passed to the function, but instead the name of the argument passed to the function (f(x), instead of f('x')). Which, helps the user with their query (dplyr) or configuration (ggplot2).
> I think software engineers often get turned off by the weird idiosyncrasies of R
That was at least true when I was looking at it. I didn't get it, but the data guys came away loving it. I came away from that whole experience really appreciating how far you can get with an "unclean" design if you persist, and how my gut feeling of good (with all the heuristics for quality that entails) is really very domain specific.
I had a colleague at Google who used to say: "The best thing about R is that is was created by statisticians. The worst thing about R was that it was created by statisticians."
To my knowledge AlphaGo models never became meaningfully available to the public, but 8 years later the KataGo project has open source, superhuman Go AI models freely available and under ongoing development [1]. The open source projects that developed in the wake of AlphaGo and AlphaZero are a huge success story in my mind.
I haven't played Go in a while, but I'm kind of excited to try going back to use the KataGo-based analysis/training tools that exist now.
I'm sorry about negative experiences and/or regrets other commenters might have about their vaccinations. Measuring the risk/reward profile of vaccines seems far from simple, particularly in cases like this where the large benefits (no cancer) and risks (autoimmune problems) may both be quite rare for any individual. It is too bad if the study didn't fully capture possible risks in this case, and hopefully follow-up studies and monitoring can help better describe the risk profile.
It's worth noting the benefits of HPV vaccination do seem to be quite real, though. In the US, >20% of the female population has a high-risk HPV infection [1], and cervical cancer runs at ~12k new cases and ~4k deaths a year [2]. A follow-up study found women vaccinated before age 17 had about 88% reduction in cervical cancer, with around 53% for women vaccinated at 17-30 years of age [3] (presumably later-vaccinated women had a high chance of already having an HPV infection so the vaccine wouldn't be useful).
I think potentially saving >3.5k lives and >10k cervical cancer cases annually in the US is a pretty good return if we can get widespread HPV vaccination, though of course we should also work hard to study and minimize vaccine side-effects. I'm similarly hopeful of news about EBV as a cause of multiple sclerosis [4], which is another situation where preventing a widespread infection might prevent rare but serious illnesses.
Strong second for wishing they tried physically testing some model output. The importance of "model that makes outputs AlphaFold thinks look like Cas" is very different from "model that makes functional Cas variants".
For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.
Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.
You are correct that it is dangerous to rely on the results of a model being an oracle for another model, extremely good models (say F=ma) are used all the time.
Sadly even SSE vs. AVX is enough to often give different results, as SSE doesn't have support for fused multiply-add instructions which allow calculation of a*b + c with guaranteed correct rounding. Even though this should allow CPUs from 2013 and later to all use FMA, gcc/clang don't enable AVX by default for the x86-64 targets. And even if they did, results are only guaranteed identical if implementations have chosen the exact same polynomial approximation method and no compiler optimizations alter the instruction sequence.
Unfortunately, floating point results will probably continue to differ across platforms for the foreseeable future.
Barring someone doing a "check if AVX is available" check inside their code, binaries are generally compiled targeting either SSE or AVX and not both. You can reasonably expect that the same binary thrown against multiple architectures will have the same output.
This, of course, doesn't apply if we are talking about a JIT. All bets are off if you are talking about javascript or the JVM.
That is to say, you can expect that a C++ binary blob from the Ubuntu repo is going to get the same numbers regardless the machine since they generally will target fairly old architectures.
> -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them
> The default is -ffp-contract=off for C in a standards compliant mode (-std=c11 or similar), -ffp-contract=fast otherwise.
I would have expected to be a bug in the documentation? Why would they turn FMA off for standard compliant C mode, but not for standard compliant C++ mode?
it defaults to off for standard-compliant mode. Which in my mind was the default mode as that's what we use everywhere I have worked in the last 15 years. But of course that's not the case.
In any case, according to the sibling comment, the default is 'fast' even in std-compliant mode in C++, which I find very surprising. I'm not very familiar with that corner of the standard, but it must be looser than the equivalent wording in the C standard.