> it may be possible to achieve a 100× energy-efficiency advantage
Running the math on a machine with 8x A100 (enough to run today's LLMs), that would be 300w * 8gpus / 100 = 24w.
This is within striking distance of IOT and personal devices. I'm trying to imagine what a world would look like where generative text models are commodetised to the point where you can either generate text locally on your phone, or generate GBs of text in the cloud.
I have to admit it's very hard to make any sort of accurate prediction.
> 100× energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a >8,000×
Maybe I interpreted that incorrectly but I thought it's saying a 100x advantage for current large Transformer models, and 8000x advantage for future quadrillion-parameter models? I didn't include those because I suppose that size of model is quite a few years away. Admittedly this is only based on the abstract...
Need to compare this with custom silicon like Apple will be shipping.
They already have the Neural Engine chip which can run Stable Diffusion, but eventually you could imagine casting a specific model instance to an ASIC (say GPT-3.5 or -4, today).
If most devices are replaced within a year or two then you get a pretty good cadence for updating your Siri model (and even more incentive for users to upgrade hardware).
You don't, because of the scaling law they say they've identified. If optical energy per MAC operation scales as 1/d, we know two things: 1) there is no electronic architecture possible that can catch it, and 2) bigger models give optical networks a bigger energy advantage.
It's possible to have a temporary lead because of constant factors, but as long as an electronic circuit has to expend a unit of energy per MAC, you'll always be able to specify a model big enough that an optical network will beat it.
1) this is a research device and a theoretical scaling law; it’s not been proven.
> We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a 100× energy-efficiency advantage
Emphasis on may.
2) in the real world, constant factors matter (as you allude to). For example if an ASIC gets a 1000x speedup (optimistic; we saw this for BTC) it might be the better choice for this generation, but start to lose next gen and beyond. If an ASIC only gets 100x or lower then it’s not favorable this gen.
So sure, this tech might win in the long term, but I wasn’t making any categorical claims, just noting that there are multiple horses we need to track.
It would be quite foolish to dismiss custom silicon solutions based on this paper.
You can run it on either now (for example, MochiDiffusion allows you to pick https://github.com/godly-devotion/MochiDiffusion#compute-uni...). Anecdotally, the GPU seems to be faster for an M1 Max or up GPU, the ANE is a touch faster on anything smaller, and more power efficient in general.
It will be also at least 100x times (physically) larger, because optical wavelength is ~1000nm vs 10nm of electronic gate size. So much for personal devices.
True but I think the difference is smaller than one might expect from pure element size.
IIRC one reason we don't already have fully 3D chips is because of the heat dissipation. Reducing 2400 W to 24 W means the heat is much more tractable, which means it can be closer to volumetric than planar.
Consider a 1cm*1cm*1mm chip; with 1μ^3 elements, 1e11 per chip; with (10nm)^3 elements limited to one layer because of heat, 1e12.
Yes this is still a factor of x10, and chips are a few layers because while heat is a problem it's not a total blocker, but it's still much less than the 100^3 ratio a simple scale-up would result in.
Maybe, but then we need to make sure light does not seep into neighboring cells, will need metal shields for that, and then heat will dissipate ... maybe they could solve this in the future.
What will happen, such a thin tube will not be able to confine the electromagnetic wave within its boundary, so most of the wave's field will propagate outside of the tube and will quickly diffract on any bumps it encounters.
>I'm trying to imagine what a world would look like where generative text models are commodetised to the point where you can either generate text locally on your phone, or generate GBs of text in the cloud.
Dead internet theory for one. Scalable spear phishing and scams. Scalable automated offensive hacking. SEO far worse than anything possible today. Mass manipulation campaigns.
Social interaction would also be strange. Every messenger and dating app able to automatically reply and suggest sophisticated messages.
I/O is already meeting those performance levels on today's technology.
You can get a hundred dollar SSD that has a read speed around 7 gigabytes per second using just over five watts. That will fill up 8x80GB in a minute and a half of load time. If your energy budget is 24 watts then install four and make it 20-25 seconds.
As far as cost, I don't know what the proposed chip would be, but $400 is 0.5% of that pile of GPUs and SSDs will only get cheaper.
I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra. https://arxiv.org/abs/2302.01834
This ties really nicely to the photonic DSP as convolution is fundamentally composition of convolutive systems. This is a generalized convolution, not the standard one.
The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.
I argue that autodiff can be replaced with what I call Hopf coherence which happens within the single layer as opposed to across the whole graph.
Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.
I'm working on a next gen Hopf algebra based machine learning framework. The beautil part is that it ties nicely to PL aspect as well.
Look at the diagram. Do you see the path going through the middle? And the paths at the top and bottom (they are generalized convolutions)? Well a Hopf algebra "learns" by updating it's internal state in order to enforce an invariance between the middle path and the top and bottom paths.
Allow me to restate it, it's an algebra that "learns". Reading about Hopf algebras is tripper than dropping acid.
I have a personal dislike of special purpose hardware. I didn't really realize how much I disliked the idea, until I woke up to GPUs and off-CPU processing and the complexity of dealing with this "thing" bolted onto the side of a beautiful ISA, to do "other things" -which of course is entirely normal, it's always been so, Bell and Howell discussed a wheel of life about the smarts to run a printer in mainframe days, winding up with the printer moving to its own dedicated CPU and being networked and then you repeat the wheel over the network controller.
I first met this with a floating-point systems co-processor on a Dec-10. It was highly problem specific, only a single DECUS tape FORTRAN compiler talked to it (IIRC) and you had this giant freezer sized box (it was lime green, unlike the blue DEC-10 livery) sat next to the CPU doing... mostly nothing. It was basically idle almost all the time. Occupying floorspace, consuming power, making the DEC-10 more complex to operate.
Same-same with 3DES processing cards, on-card ethernet TCP, you-name-it -these things absolutely do improve the world, but at a cost of complexity.
So the idea of designing optical interference/coherence/diffraction "engines" to bolt onto the side of a beautiful ISA, and winding up with something morally like a GPU which has to do complex TCAM like memory, wierd protocols on the bus, all to get what? some context specific speed up of .. text processing?
If this winds up down in the VLSI level on-chip, as an alternate compute element inside the ALU/ISA space, with its own bus path, and simply integrates more tightly into the ISA I'd be fine.
Wait, so your thesis would've led to no FPU, GPU, hw TCP, etc. We just slowly and linearly chug along with a classic 8086 beautiful ISA.
Well, the beautiful (not so beautiful these days) ISA is always there for you, while we grab a library to use a suitable specialized processor and run our shit 1000x faster for 1/100 the power.
yes. It's a naieve world view, I fully admit. bolt on co-processors are just a fact of life. Mostly, we now run in abstractions "above" this layer of concern.
I feel like this about the TPU and other chips bundled into phones too. I wonder if we're going to wind up with "photoshop specific" bolt on architectures.
So at what point does something become special purpose? Is L1, L2 or L3 cache special purpose? What about io chips? What about the firmware inside ethernet cards, harddrives or the video chips (not even talking about GPUs).
I'm not trying to be combatitive, but it seems completely arbitrary where one would draw a line and taken to the extreme one is essentially left with a CPU and possibly some RAM that might be some ideal computing machine, but can't even talk to the outside world.
On chip to me, is integrated. but with multi core. shared facilities on the die are a bit of a bottleneck sometimes (I don't do this for a living, thats how I understand it)
So the whole 'hyperthreading is cool, 2x the CPU' thing: it turns out that as long as your e.g. integer only, true. If you invoke FP, not true: they stall.
L1 cache in some instances is now big enough some people's code never leaves it.
Nobody who lives above a compiler should ever have to worry about this. I live above scripts. I don't really worry about this. I just hark back to when I ran these boxes, on raised floors, remembering how .. distasteful adjunct computing equipment was. It was completely asynchronous, required it's own resets, interfered with cabling, had effects on the bus (unibus!) you didn't entirely understand, they made life "harder" for operations.
I remember a rather interesting "application specific" thing that happened with Risc PC's in the UK. Because at that particular point in history Acorn and the RISC-OS ecosystem were having to coexist with the Windows ecosystem, people wanted to run their DOS and Windows applications on their Risc PC's. The 80386 and 80486 were bolted onto daughterboards and sold as special-purpose add-on processors, just to run those applications.
There a sort of necessary evil. They definitely complicate things, look at the debates over Tannenbaums work writing the os for secure enclave bios interactions. And, like protection rings they come with questions about bugs and subverted state.
I have worked on systems doing HSM like activity which depended on vendor specific TPM. They're a pain.
I think the better way to look at it is that the norm is a very heterogenous architecture with all sorts of purpose built accelerators. However, while Moore’s law was in play the general compute ate all the accelerators (hence integrated fpus, simd, etc). Now that it’s dead, we’re going to go back and this is likely the new normal unless we somehow unlock a new era of exponential growth (eg optical compute). However, at this time that seems less likely so we’re likely to see more of these accelerators. The challenge of course is scale - specialized accelerators by definition have smaller market niches which in turn limits how cheap they can be and the creativity of what people think to do with it (imagine what computing would look like if only supercomputers had floating point units to support numerical simulations).
Amdahl's law always resonated for me for this reason: you only gain speedup equal to the portion of the total work that can be accelerated.
From that perspective, speedup is less about hyperefficient design of a brand new thing, and more about recognizing large portions of existing workload that are amenable to acceleration.
More successful accelerators have been borne from the latter approach (superscalar/hyperthreading, GPUs once OpenGL/DirectX dominated, fixed function video decode hardware) than the former (Itanium/VLIW)*.
Or in simpler form, never start a value proposition with "First, rewrite all your software..."
* ML is an odd duck, as it's somewhat co-evolving with its own accelerators?
Even a single CPU operating like that is an illusion. Your ISA is controlling all sorts of stations within the CPU and getting good performance often means understanding hazards between the parallel processing occuring across different pipeline stages and stations.
General-purpose hardware can never be as efficient as special-purpose hardware, unfortunately. All computation has a structure to it that is defined by the way the data flows into the operations, which in turn suggests some optimal physical organization. So you first perform the computation on the general purpose hardware you already have, until you have to do it so often it becomes advantageous to design and build a more optimized structure for it.
If you're looking at deep learning, the end game is probably hardware a bit like the brain: a hundred billion of specialized, low-clocked coprocessors crammed in a small space with only local communication to their neighbours.
Isn’t the solution better “mechanical” and language integrations for “off-cpu compute”?
CUDA and subsequent libs arguably provided this and caused a pretty massive acceleration in the amount of GPU-utilising code that got written. Now, CUDA might still be pretty awful to write, but if general-propose-GPU API’s continue to evolve and refine, we might be able to get to a point where something like an LLVM compiler can auto generate efficient GPU code for matrix ops, much like how it can auto-vectorise certain patterns now.
I think AMD and Intel have encouraged this with opencl and other efforts. Nvidia hasn't and since nvidia/CUDA is pretty much the standard for ai and gpu acceleration....
I don’t understand this “stick to your guns(CPU)” mentality.
With Moore’s law dominating any ideas about software optimization, the mantra of “wait until the CPU is twice as fast in two years” has dominated the industry for long enough.
Not everything needs to fit in an idealistic CPU and we should allow for more complex computational machinery.
Let’s allow for new hardware, new programming paradigms, new ways of thinking.
This is kind-of a-historical. It worked, as long as Moores law worked. The special purpose h/w bought speedup for "now" and within a tick-tock cycle was overtaken by general purpose CPU, unless there was countervailing investment in that special purpose hardware: which happened to be basically the GPU and almosty nothing else.
FPGA's and ASIC were tuned to do highly specific tasks, and then wound up being much the same: prove it in an FPGA, make an ASIC, then see it move on-die as the VLSI matures.
Now Moores law has bottomed out, specific purpose processing asynchronously makes more sense.
I just don't like it, the same way angry old men growl at the moon at night.
There was only a very narrow time window where a single-CPU-without-special-purpose-h/w dominated.
The 8-bit era and before relied heavily on co-processors. Sometimes they were the same or similar to the main CPU - e.g. the Commodore 64 floppy drive had a CPU as powerful as the main CPU of the machine, and you could run code on it - but it took special effort to target them. The most successful 16-bit home computers had co-processors and special purpose hw (e.g. the Amiga had it's copper and blitter, but some models also had a 6502-compatible core controlling the keyboard, and many SCSI controllers at the time had full CPU's - mine had a Z80).
The PC got cheap by relying on the CPU to be able to control most things in an age where special purpose hardware was expensive due to low volumes (the Amiga "only" sold about 5 million across multiple models), but it didn't take long before a typical PC started getting micro-controllers and full on CPUs everywhere (e.g. in your hard drive [1]) because the CPU speed as mostly not gotten fast enough.
Just for a short window of maybe a decade it got fast enough to outpace the cost of low volume special purpose HW.
I guess you dont like arms or legs then. We should all look like amorphous gelatinous blobs of undifferentiated flesh. Look, I have my own quirks - but you gotta squash those ones that make no sense.
I'll admit, I'm entirely out of the loop when it comes to the technical aspects of how these large transformer models actually "work". Luckily, statements like "optical computers could have >8000x energy efficiency advantage" are well within my capacity to understand.
That said... Can anyone parse what's being done here? How/why is it more efficient? Is this essentially the Analog Computer of these giant transformer models?
Finally, how can I build one with fiber optic cable and leds?
The key idea is you can use physical effects such as attenuating light to implement multiplication in a way that uses much less energy than the fairly complex arrangement of digital logic necessary to implement the same operation. This is roughly the method they use in the paper:
I've seen a lab bench prototype of a different implementation, there are a lot of engineering problems to solve but as the paper points out the potential payoff is big.
Edit: The other key point is that one of the expensive components in transformers is effectively a giant matrix multiplication which implies many, many individual multiplications.
The optical stuff is still research territory and too early to be comparable to the current digital silicon. Lots of open questions about how to implement multiplication, implement activation functions, where to store weights, how to move data in and out, manufacturability of all of this, etc. Basically it's all questions.
To get some intuition about the promise imagine being able to implement the weights of a layer (fully connected layer is essentially vector matrix multiplication) as a 2D hologram and the compute as pushing an image from an OLED display through to an image sensor. Multiplication as attenuation in the hologram, summation is just the accumulation of charge on the sensor side. Everything happens all at once and the number of photons required is potentially very small. An actual working implementation would be both more clever and more practical. The potential to do every multiply in parallel for almost no energy is so attractive that I expect people to chip away at this problem for the foreseeable future.
To run an AI model, you do not need a general-purpose computer, you don't need a powerful CPU capable of executing a wide variety of code. You mostly need massive amounts of matrix multiplication, and some simpler operations.
You also don't need very high precision: many models perform well even using 8-bit floats. This allows to ditch the whole digital approach and implement analog circuits, which, while sometimes less precise, are massively simpler and more energy-efficient.
So they built a mostly-analog device specialized for running ML models, and used optics instead of electronics, which allows to make certain things much faster, on top of the simplification.
There are known attempts to use electronic approaches, such as using flash-like structures that store charge as an analog value to store model weighs, and to do addition and multiplication right inside the cells.
There are lots of fun things you can do with optical circuits. One of the most fascinating to me was running multiple threads of execution through the same circuits at the same time at different wavelengths of light! I would speculate that there are similar sorts of things going on in optical transformers, but honestly trying to get deep enough into both the ML architecture AND the optical architectures implementing them is probably a month of spare time that I don't have at the moment :-).
I know much less about the large transformer models than I do about the optical hardware but the hardware ideas have been floating around since the 1980s. The development of lasers-on-a-chip made it all a lot more realizable. E.g.
"On-chip optical matrix-vector multiplier for parallel computation" (2013)
Figure 1 has a nice visual of the vector-matrix computation in terms of the diode laser array, the multiplexer, the "microring modulator matrix" and the resulting output, the new (output) vector detection system.
> "We have designed and fabricated a prototype of a system capable of performing a multiplication of a M × N matrix A by a N × 1 vector B to give a M × 1 vector C. The mathematical procedure of MVM can be split into multiplications and additions, which is reflected in our design. Figure 1 shows a schematic of the architecture we propose. The elements of B are represented by the power of N modulated optical signals with N different wavelengths (λ1, λ2, …, λN), generated by N modulated laser diodes, either alone or together with N Mach-Zehnder modulators. These signals are multiplexed, passed through a common waveguide, and then projected onto M rows of the modulator matrix by a 1 × M optical splitter. Each element aij of matrix A is represented physically by the transmissivity of the microring modulator located in the ith row and the jth column of the modulator matrix. Each modulator in any one row only manipulates an optical signal with a specific wavelength."
That was 10 years ago, no idea what current state-of-the-art is.
I'm not sure how to reconcile the purported gains with the fact that matrix multiplies are empirically the most heavily accelerated primitive [1] on current-gen hardware and that the "digital ops" shown here aren't even a blip on the "fractional of total compute" Figure 6. Sure, they're very small in terms of FLOPs, but they take up a disproportionate amount of time being bandwidth-bound. Intuitively, adding another hop off-chip and A/D or D/A conversion doesn't sound great, and I wonder if that's why this work sticks to efficiency over end-to-end throughput. Given that GPUs today mostly trade efficiency for clock rate and speed (think about how a single GPU can be at > 300W TDP), how much efficiency could we gain by simply inverting that tradeoff?
I haven't read this paper yet but I'm familiar with the general work. The aspect that everyone ignores is that, yes linear transformations like matrix operations or fourier transforms are incredibly fast in optics, however the nonlinearity is the sticker. While optical propagation is nonlinear, you need very high intensities. The elephant in the room is that the linear operations rely on parallelism, i.e. they split the optical power up into multiple paths so each path has very low intensity, thus exhibits low nonlinearity. The solution that has been that everyone simply used optical to electrical conversion and did the nonlinearity digitally (or sometimes in analog electronics). That sort of works for one layer, but completely falls apart for multiple layers, it is neither cost not energy efficient to have hundreds or possibly thousands of a/d converters.
It's interesting because of the scaling law. No matter how much acceleration matrix multiplication gets on an electronic circuit, its energy usage is always going to scale as O(n^2.something). The implication here is that the energy usage by doing it optically is O(1). At least, that's how I read "We found that the optical energy per multiply-accumulate (MAC) scales as 1/d where d is the Transformer width". The best you can hope for is to stay on the right side of the constant factors (which, currently, the GPU world is).
IIRC from my undergrad research (8 years ago) theres still an issue of I/O interconnects bounding the usefulness of optical computers. & I believe memory and GPU ram is also the current primary bottleneck for current training, ie not compute. Still cool for the next wave of data center or edge compute I suppose
Wildly tangential (given what the article actually means by "Optical Transformer"), but I was thinking about this the other day and what I really want is an Audial Transformer.
The main reason for this is I want something I can hum a song to and it tells me what the song is. We have Shazam, but AFAIK that works almost entirely using fourier transforms / audio signals in frequency space, and so doesn't work well if it's not an actual snippet of the song. But it would be cool if we could train an audio model like we've trained language models to "fill in the gaps" for me. Potentially there could be some interesting applications where you sing an entirely new song and the model could "fill in the gaps" with instruments, etc.
Google audio search not only does speech to text, but can give you the song if you hum. I was with some friends who tried it outside in SF and it was a noisy environment on the sidewalk, and it was able to identify the song from their hums. Magic.
Something I have pondered for quite some time since I experimented with pitch-shifting devices for my guitar, everything from an SB Live! sound card to various digital effects pedals - they all introduce insane amounts of warbling the further from the original tuning you go.
Would this sort of analog-digital circuitry or something similar possibly work to build a pitch shifter with high enough resolution to effectively allow full audio range pitch shifting of source audio without warbling? Of all the things I've tried, the SBLive! was actually the best, holding semi-stable 2 full steps down on my guitar. Everything else was fairly nasty after a full step.
This is layman speculation, but… The human brain is notoriously efficient; neural networks are modeled on the brain so they’re probably doing something like what the brain does; the brain uses electrical signals and these optical transformers use light signals;
Maxwell’s work unifies electricity and light. It all seems to add up - the general process is “take advantage of an energy field to cheaply perform calculations on a mass scale”, brains build a biological/chemical structure to use the electrical field, optical transformers will build a silicon structure to use the light field.
My understanding is that modern neural networks aren’t much like the brain, but machine learning researchers like the way it sounds to suggest that they are.
ML researchers I've spoken to strongly dislike the way people suggest similarities between biologically-inspired systems and neural networks. The earliest papers did use our understanding of the neuron as inspiration for some of the ideas, but these approaches have far more similarities with statistical analysis than they do to neuroscience. At the end of the day, NNs are just nonlinear function approximation, and if I had my way we would have called it that from the start.
Choosing positioning like "NNs are similar to the brain" or even calling the field "machine learning" makes it harder to speak objectively about the research we perform, makes it hard to understand the sort of explanatory power and limitations of the models we fit, and makes it harder for users to understand how the system works. It's like how quantum mechanics researchers have to deal with readers who misunderstand what linear observable operators are and conflate it with their own ideas about conscious observers or use it as evidence of God's presence or whatever.
This! I was listening to a radio interview with a famous NN researcher and everything was going great until the interviewer asked him a question that only a neurologist could answer, and he kept talking in exactly the same way, answering as though he was a neurologist! I was embarrassed for him. It seemed to me that the correct and useful response would have been something like what you just said.
NNs are inspired by the brain, it's wrong to suggest they're not and also wrong to suggest brains are _merely_ NNs.
Activation functions in NN map to action potentials/postsynaptic potentials, weights map to synaptic strengths/connection patterns between neurons, dropout to synaptic pruning/apoptosis, convolutional layers from the receptive fields of our visual cortex.
Certainly our brains are more complicated. Temporal spike timing, dendritic signals, synaptic plasticity, compute of synapse themselves, lots more I can't recall, none of these concepts are currently modelled in our NN architectures.
NNs start from the simple premise of interconnected neurons. As the field develops and we find what works or what doesn't, perhaps we will take more bits of inspiration from the brain.
And there’s research coming out that suggests that forward propagation can implement the same thing but comparatively cheaper without much loss in accuracy and is likely the next step. I don’t think we really know enough to conclude things either way and we don’t know if brains never did back propagation or evolved it away.
Expanding on this history for readers here who are just catching up on this stuff, modern feedforward NNs descended from the Perceptron paper by Rosenblatt and friends from 1957. This was the first paper that really tried to position "learning machines" based on accumulated stimulus activations, and it built off of a previous paper from 1943, "A Logical Calculus of Ideas Immanent in Nervous Activity." So yes it's true that the field borrowed ideas from the brain and it was positioned that way from the start, though Perceptrons totally weren't a very similar implementation to what Nature herself designs...
A Perceptron model is just a dot product: one multiply+add, summed to one output scalar value. Modern nomenclature might consider Perceptrons to just be "one channel of a single layer," or like "one fully-connected layer with a single output." In fact, that's where the old-fashioned name for neural networks came from: "Multi-layer Perceptrons (MLPs)" are a bunch of these arranged on top of each other. The older 1943 work showed that in some sense lots of possible arrangements are roughly equivalent to each other so it made sense for Rosenblatt and friends to start from the simplest model with the understanding that the field would build from there. (Our feedforward model ancestry is just one branch of the tree rooted in this work; there's a completely separate geneology for other possible arrangements of modeling components, like recurrent Boltzmann machines that simply haven't been so well studied/understood... :-)
The generality of Perceptrons made room for a lot of interpretive flexibility. Right from the start, news reporters saw the theory, attended academic conferences, and started writing about it. The NYTimes had an article in 1958 that spoke of the Perceptron as "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
In fact there was a very famous kerfuffle between Rosenblatt (Perceptron authors) and Minsky & Papert (another AI researchers). The latter two published a spicy book in 1969 saying that a single Perceptron wasn't general enough because linear models can't even learn an XOR function. Everyone who read this book misinterpreted it as a takedown of Perceptrons; this book inadvertantly dried up interest, choked funding across the board, and ultimately caused the first AI winter which didn't get resolved until well into the '80s when these ideas started to be revitalized. You can read about the spilled tea here, it's quite fascinating: https://doi.org/10.1177%2F030631296026003005
I think the cause of the first AI winter had more to do with some of the early researchers suggesting they'd have solutions to problems within a few months to years that are just now becoming tractable. In retrospect it doesn't seem like they could've gone much faster than they did. New ideas mattered but increased computing speeds probably mattered more. It's not like we would have been able to get dramatically better results than those researchers did by taking what we know now and running it on the same hardware they had back then.
I do not know with what hardware GPT-5 will be trained, but GPT-6 will be trained on something with similar levels of efficiency as an OPU (and probably distributable via consumer-accessible hardware)
If the time between successive GPT releases halves after each successive release, we'll have to do what Bungie did with the third game in the Marathon trilogy and call it GPT-∞.
And then the post-singularity open source release, GPT-ℵ₁
GPT-5 level models could very well be distributed, training on millions of internet connected iphones. This is all pure speculation, we don't even know know what GPT-4 will look like.
There are significant differences, because proof-of-work crypto mining is inherently a digital logic operation, so it's not amenable to photonic analog computing used for, e.g., approximate matrix multiplication and summation used in neural nets.
That said, photonic gates for digital circuits hold some promise for power efficiency and speed in their own way. For many of the parameters which apply to crypto mining, it's not so clear whether photonic digital circuits will significantly outperform the best near-future electronic transistors though.
(Source: I worked on a photonic crypto circuit design based on 4-wave (non-linear) mixing. Non-linearity is required for digital logic. Even though 4-wave mixing is energy efficient, the design was less efficient and performant than I'd initially hoped from the high spatial parallelism and basic THz figures. The principles still hold a lot of promise.)
It's funny how ideas come back in some ~30 year cycles. The energy efficiency of optical computing has been discussed quite extensively, e.g. David Miller at Stanford wrote a science paper laying out the fundamental problems with optical nonlinearities for computing, i.e. phonon are bosons and therefore don't like to interact, or optical nonlinearities are very inefficient compared to electronic ones. So in that sense FWM is not efficient.
Now for some specific operations optical signal processing might still make sense, for example if you can take advantage of the bandwidth and inherent parallism.
Seems unlikely. The paper is about performing certain computations in the analog domain, trading precision for power efficiency. Hashing requires lossless computation.
Running the math on a machine with 8x A100 (enough to run today's LLMs), that would be 300w * 8gpus / 100 = 24w.
This is within striking distance of IOT and personal devices. I'm trying to imagine what a world would look like where generative text models are commodetised to the point where you can either generate text locally on your phone, or generate GBs of text in the cloud.
I have to admit it's very hard to make any sort of accurate prediction.