Hacker News new | past | comments | ask | show | jobs | submit login
Bresenham's Circle Drawing Algorithm (2021) (funloop.org)
107 points by RojerGS 15 days ago | hide | past | favorite | 45 comments



Do note that Bresenham’s family of algorithms (and much of the venerable Computer Graphics Principles and Practice) are from a bygone era where computers executed 1 instruction per cycle without pipelining or prediction.

These days processors prefer to draw shapes by having a coarse grained pass that conservatively selects tiles of interest then brute-force evaluates each pixel in each tile independently of all others.

Instead of minimizing total work, the goal is to maximize pipelined parallelism while skipping over unnecessary work in large blocks.


> a bygone era where computers executed 1 instruction per cycle without pipelining or prediction

1 instruction per cycle? What luxury bygone era did you grow up in?

Wikipedia tells me the algorithm is from 1962 on an IBM 1401 (https://en.wikipedia.org/wiki/Bresenham's_line_algorithm#His...

That definitely didn’t have many single cycle instructions. Skimming https://ibm-1401.info/A24-6447-0_1401_1460_Instruction_and_T..., I couldn’t find any.

Certainly, in the era of 8-bit CPUs like Z80 and 6502 programmers would have been lyric about “1 instruction per cycle”

Actually, did any CPU ever “execute 1 instruction per cycle without pipelining or prediction” (or, slightly looser “had fixed time instructions without pipelining or prediction”)?

RISC introduced fixed time instructions, but also pipelining.


Adsp-218x family of harvard architecture DSPs. Each 25 ns clock cycle = 1 MAC, 1 data fetch, and one instruction fetch. All instructions 1 cycle long. And many other gizmos like reversible address bit ranges for DFT in place, separate register bank for IRQ handlers. And all laid out by hand.


no, you're right, you almost always need pipelining to get one instruction per clock cycle

but there are a lot of cpus out there—maybe the majority now—that are pipelined microarchitectures that get one instruction per cycle without much or any prediction. avrs, most of the cortex-m* (all?), most modern implementations of old slow isas like the z80 and 8051, etc. big processors like your laptop cpu and cellphone cpu are of course superscalar, but they are a tiny minority of all processors. even inside the cellphone case, they're outnumbered by in-order scalar microcontrollers

without prediction, of course, you have a pipeline bubble every time you have a branch, so you never quite hit 1 ipc with scalar execution. but it's usually pretty close, and even with prediction, sometimes you miss. and usually if you have branch prediction, you also have a cache, because ain't nobody got time to share one triflin memory bus between instructions and data

so pipelining gets you from, say, 3 clocks per instruction down to 1.2 or so. then prediction gets you from 1.2 down to, say, 1.02


I remember the bragging point on the RTX2000 in the very late eighties was "a MIP per megahertz".


my limited experience with stack machines is that a stack mip is about half a register-machine mip :-(

consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):

            @@ Set sleep for iwait to r0 milliseconds.
            @@ (r0 must be under 1000)
            .thumb_func
    waitis: ldr r2, =wait           @ struct timeval
            movs r3, #0             @ 0 sec
            str r3, [r2]            @ .tv_sec = 0
            ldr r1, =1000           @ multiplier for ms
            mul r0, r1
            str r0, [r2, #4]        @ set .tv_usec
            bx lr

            .bss
    wait:   .fill 8                 @ the struct timeval
            .text
these are all perfectly normal register-machine instructions; you could translate them one-to-one to almost any register machine. on a few of them you could drop one, writing something like str #0, [wait]. the whole function is straight-line execution, a single basic block, and seven instructions long. this is almost a best case for a stack machine; it looks like this:

    lit(0)      ; load immediate constant
    lea(wait)   ; load address of wait onto stack
    !           ; store 0 in wait
    lit(1000)   ; another constant
    *           ; multiply argument by constant
    lea(wait)   ; load address again
    lit(4)      ; offset of .tv_usec
    +           ; calculate address of wait.tv_usec
    !           ; store product in tv_usec
    ret         ; return from subroutine 
that's 10 instructions, about 50% longer than the 6 or 7 of the two-operand register machine. but the typical case is worse. and basically the reason is that the average number of stack manipulations or constant pushes that you need to do to get the right operands on top of the stack for your memory accesses and computational operations is roughly 1. sometimes you'll have a * or ! (store) or + that's not preceded by a stack manipulation or a load-immediate or a load-address operation, and that time the stack machine wins, but other times it'll be preceded by two or three of them, and that time the stack machine loses

so it averages out to about two stack instructions per register-machine instruction. call and return is faster on the stack machine, but passing arguments and return values in registers on the register machine can take away most of that advantage too

the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this

this doesn't necessarily mean that stack machines like the rtx-2000 are a bad design approach! the design rationale is that you get very short path lengths, so you can clock the design higher than you could clock a design that had register-file muxes in the critical path, and you also avoid the branching penalty that pipelines impose on you, and you use less silicon area. mainstream computation took a different path, but plausibly the two-stack designs like the rtx-2000, the novix nc4000, the shboom, the mup21, and the greenarrays f18a could have been competitive with the mainstream risc etc. approach. but you do need a higher clock rate because each instruction does less

i don't remember if dr. ting wrote a book about the rtx-2000, but he did write a very nice book about its predecessor the nc4000, going into some of these tricks: https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf


"oh these new-fangled kids what with their superscalar processors. 100 instructions in a cycle, phooey! Back in my day, it was dang gummed 100 cycles for each instruction, and gosh darnit we liked it! Now that was just for the add instruction, a division, well some say they never figured out how many cycles that was because it took too long. I had an onion in my belt which was the style at the time"

Probably meant 'cycle' in the sense of instruction cycle, rather than clock cycle.


Even then, few CPUs execute all instructions in the same amount of time. Division, for example, still takes longer than a register move on most architectures. Multiplication, historically, did too.

As a sibling comment points out, DSPs can be an exception. I’m far from an expert on them, but CPUs that try to run all instructions in the same fixed amount of time used to accomplish that by avoiding complex addressing modes and by omitting the really slow instructions (division and instructions that push/pop multiple registers onto/from the stack being prime examples).

IIRC, some of them also accomplished it by running some of the simpler instructions slower than necessary (raw speed isn’t as important as being hard real time isn many applications)


> few CPUs execute all instructions in the same amount of time. Division, for example, still takes longer than a register move on most architectures

easy solution, as you allude to, don't have a division instruction! arm doesn't, 8048 doesn't, sparcv7 doesn't, even the cdc 6600 and cray-1 didn't, and even risc-v finally got zmmul: https://wiki.riscv.org/display/HOME/Zmmul. it's not just dsps

the big issue with complex addressing modes is i think fault handling. if your nice orthogonal add instruction updates two registers as it reads operands and a third register with the sum of the operands, what do you do if you get a page fault on the second operand? if the os can service the page fault, how does it restart the instruction?

as you point out, real-time latency is also an issue. on older arm chips you have to be careful not to ldm or stm too many registers in a single instruction so as not to damage interrupt latency. newer arm chips can restart the ldm/stm

i'm no expert on the area either


> don't have a division instruction! arm doesn't, 8048 doesn't

Heh, that's an understatement. The 8048 doesn't even have a subtract instruction, much less divide.


cpl add cpl was good enough for my daddy and it's good enough for me


I think it depends. I had Dr. Bresenham as my graphics instructor (he taught for a while after he retired from IBM) and the class used Borland Turbo Pascal and for it's time it was fast. Not as fast as raw assembly. But faster than Borlands Turbo C that had just come out.

So far as 1 instruction per cycle - Wikipedia says the 80286 (the top dog PC processor of the time) could execute 0.21 "typical" instructions per clock. Optimized code was up to 0.5 instructions per clock. And that agrees with my memories.

Today, I would try and use parallelism if I could. With lots of conditions though - is the process being applied to the image trivially parallelizable, will most/all of it fit in cache, etc. Trying to parallelize Bresenham's algorithms though would be futile - when drawing circles you can reflect it into the different quadrants (big savings) but the algorithm itself is going to be pretty serial because it has to keep up with the error coefficient as it draws.


It's actually not difficult to vectorize Bresenham algorithms, at least the line algorithm. You just have to preroll the algorithm for each of the lanes and then adjust the steps and error factors so each lane steps 4-8 pixels ahead at a time interleaved. I've done this for the floor type 1 render in Ogg Vorbis, which is defined in terms of a Bresenham-like algorithm.


oh wow, this is awesome, thanks! i never realized!


It's also from an era when floats were rather expensive.


I still picture them as expensive. Things like trig functions are still very expensive.


I learned relatively recently that trig functions on the GPU are free if you don’t use too many of them; there’s a separate hardware pipe so they can execute in parallel with floats adds and muls. There’s still extra latency, but it’ll hide if there’s enough other stuff in the vicinity.


The CUDA documenation tells me that there are more performant but less precise trigonometric functions: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

Do you know if that hardware pipeline works only for these intrinsic variants?


Yep, these intrinsics are what I was referring to, and yes the software versions won’t use the hardware trig unit, they’ll be written using an approximating spline and/or Newton’s method, I would assume, probably mostly using adds and multiplies. Note the loss of precision with these fast-math intrinsics isn’t very much, it’s usually like 1 or 2 bits at most.


I couldn't find much information on those. I assume that they don't include range reduction?

I’m not totally sure but I think fast math usually comes with loss of support for denormals, which is a bit of range reduction. Note that even if they had denormals, the absolute error listed in the chart is much bigger than the biggest denorm. So you don’t lose range out at the large ends, but you might for very small numbers. Shouldn’t be a problem for sin/cos since the result is never large, but maybe it could be an issue for other ops.

Just for your information: when calculating trig functions, you first modulo by 2 pi (this is called range reduction). Then you calculate the function, usually as a polynomial approximation, maybe piecewise.

But if it supports larger floats it must be doing range reduction which is impressive for low cycle ops. It must be done in hardware.

It doesn't surprise me regarding denorms. They're really nice numerically but always disabled when looking for performance!


Oh that range reduction. :) I’m aware of the technique, but thanks I did misunderstand what you were referring to. I don’t know what Nvidia hardware does exactly. For __sinf(), the CUDA guide says: “For x in [-pi,pi], the maximum absolute error is 2^(-21.41), and larger otherwise.” That totally doesn’t answer your question, it could still go either way, but it does kinda tend to imply that it’s best to keep the inputs in-range.

Terms like "range reduction" will definitely be loaded differently in different fields, so my bad.

Yeah, maybe they don't by the sounds.

I don't do much on GPUs nowadays, but I still find this stuff interesting. I'm definitely going have to do a deeper dive.

Thanks heaps for the info!


Yes, but here it's about avoiding multiplication (and division).

I suspect on a modern processor the branches (ie "if"s) in Bresenham's algorithm are gonna be more expensive than the multiplications and divisions in the naive algorithm.


Bresenham is easy to implement branchlessly using conditional moves or arithmetic. It also produces a regular pattern of branches that is favorable for modern branch predictors that can do pattern prediction.


Nah, while cycles/instruction where indeed fixed in those days (and for some time yet to come), it was not necessarily 1 cycle but rather depended on the instruction.


Indeed. I used this algorithm on OS/2 1.0 as part of a GUI (OS/2 did not have a GUI until 1.1). That was on an 80386. MASM came with a nice ring-bound A6 book which summarised the instruction set, including timings. I seem to remember that 3 cycles was normal for a short instruction, but many were considerably longer.


In particular, his line-drawing algorithm is easily beaten by fixed-point methods, which also have the advantage of a very short and non-branchy inner loop:

https://news.ycombinator.com/item?id=9954975


"yes, but" there's still places that can take advantage of such algorithms today, namely microcontrollers. I think some scripting languages may even apply here, although many interpreters do some level of compilation/optimization instead of serial execution.


Sounds like your algorithm is from a bygone era where power was not important ;)


If you can get the work done fast and hurry the processor back into a low-power sleep state ASAP, it can actually be power efficient too!


1 instruction per cycle? No, that's only possible with pipelining.

We are talking about instructions which took 8-12 cycles to complete.


usually this is correct, but there are some exceptions. most instructions on a non-pipelined but synchronous stack machine like the mup21 take a single cycle, for example

even with a register file, it isn't really inherent that you need to decode inputs, do alu operations, and write outputs in separate clock cycles; you can do all of that in combinational logic except writing the outputs, and you can even decode which register to write the output to. it just means your max clock rate is in the toilet

for that kind of thing a harvard architecture is pretty useful; it allows you to read an instruction in instruction memory at the same time you're reading or writing data in data memory, instead of in two separate cycles


Got an exsmple?


> Note that if F(x,y)=0, then the point (x,y) is exactly on the circle. If F(x,y)>0, then the point is outside of the circle, and if F(x,y)<0 then the point is inside of it. In other words, given any point (x,y), F(x,y) is the distance from the true circle line [my emphasis].

This last is not quite true. The exact distance from the circle, call it G(x,y), is the corresponding difference of square roots, i.e.,

  def G(x, y, r):
    return math.sqrt(x * x + y * y) - math.sqrt(r * r)
and G(x,y) isn't just the square root of F(x,y), and indeed doesn't behave monotonically with respect to F(x,y).

It's an interest property of Bresenham's algorithm, that I've never seen even stated let alone proved in the literature, that this doesn't matter, and the algorithm is indeed exact in the sense that it always chooses the next point based on which is truly closest to the circle... despite using an error function that is only an approximation.


My computer graphics professor back in the early 80's was Prof. Glen Bresenham, but not The Bresenham. It was a lot of fun being at SIGGRAPH back then and watching people freak upon reading his name badge. He'd let them believe for a bit, and then explain it's not him. Al Acorn was one that freaked, and that was fun.


alcorn?


Allan Alcorn, American electrical engineer and computer scientist, is an American pioneering engineer and computer scientist best known for creating Pong, one of the first video games


yeah, i know him. not acorn


What I found most amazing about Bresenham during my Amiga days, is how you can use it to zoom and shrink bitmaps.


This is easy; anti-aliasing is thougher. I used this algorithm in the '90s to come up with one for drawing ellipses. I've never had to do a disc or filled ellipse with or without outloune, but that would be interesting, too. Line size > 1 would be interesting as well.


Feels like simply using the parametric form of the circle equation would be way easier

For t in 0 to 2 pi in whatever small steps you want, draw_pixel(x=a+rcos t, y=b+rsin t) where a and b are the x and y coordinates you want for the centre of the circle and r is the radius.

The derivation of this form is pretty simple if you know basic trig. https://publish.obsidian.md/uncarved/3+Resources/Public/Unit...


A cool write up with an approachable explanation of algorithm optimization! I wonder how long it would take me to arrive at this algorithm left to my own devices...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: