Hacker News new | past | comments | ask | show | jobs | submit login
MCS-48: The quest for 16-bit division on the 8-bit CPU which can’t divide (mattmillman.com)
39 points by noexani on Oct 30, 2020 | hide | past | favorite | 29 comments



To me, this screams “use a lookup table”. We’re talking, maybe, 100 different “Celsius” values.

If there’s no memory for that, do multiplication by 9/5 the way it’s done in Bresenham’s line algorithm (https://en.wikipedia.org/wiki/Bresenham's_line_algorithm). If you start at zero Celsius = 32 Fahrenheit, that’s 50 steps, at most. Might not be (much) smaller than a lookup table, though.

Precomputing the steps for each degree Celsius difference and storing each in a single bit (a one degree change in Celsius will increase degrees Fahrenheit by either 1 or 2) might be the most memory-efficient approach. For a 100 degree Celsius range that’s a 50 bits table (starting at the center of the range). Edit: even that may be too large. That table will repeat every 9 bits, so one byte almost is enough. It might be a struggle to make that a net gain, given the more complex conversion code, though.


It depends on the application. In this case a lookup table would consume quite a bit more program memory (there isn't any spare), and the division routine is needed for other tasks, so it makes sense to use it for that conversion too. Speed is not an issue.


I just calculated, there are 1800 possible Celsius values. in this application. That would be quite some lookup table.


You can very likely compress the table and/or interpolate between entries. It's not like that those entries are totally random.


You can actually do that as well with the division itself. But you need a kind of fancy interpolation, I forget the name, used for rational functions. Polynomials won't do.

Edit: I think this is either Restrictive Pade's approximation or barycentric rational interpolation. Or both combined.

Normally instead division is approximated with Newton-Raphson iteration. I think 6 steps is enough for 16 bit.


It's a simple linear function, and the delta Y has enough digits to overcome significant rounding error in comparison to measurement error.

In this application no rounding is needed either since the difference between rounding and truncation is also not significant.

You don't need the computer to do any of your multiplication or store any table.

You can start from a sufficiently low reference point, then add delta Y as you increment X.

Xref + deltaX = Yref + deltaY


A loooooong tmie ago (circa... 1992?) I had the same problem with an Hitachi embedded processor. Except our application was hard real-time (controlling the fuel injector of a Ford 150 that was modified to run on natural gas).

We ended-up splitting the division code over multiple loops of the core system so that it would not exceed the allotted time when the engine was running at maximum RPM. It meant that the value that required division would only update at something like 1/6th of the data acquisition speed of all other values, but hey, it worked.

Maybe I still have the source code on some floppy disk somewhere...

Edit: I remembered wrong, it was a GMC truck: https://www.sae.org/publications/technical-papers/content/92...

Edit #2: found a picture of our truck: https://live.staticflickr.com/8487/8186513796_bd60fd3fd2_b.j...


I did something similar for a high speed (10k rpm) stepper motor driver. We were emulating a legacy system and needed to do a 24 bit/16 bit divide, which took a long time. Even worse was that when the speeds increased, we had less time to do the computation. I came up with a trick: when the denominator changed to above a certain value, I knew the output's MSB would be zero, so I switched to a more efficient division routine that computed fewer bits. It was enough to get us smooth performance without running out of time.


I don't know why but the idea of working on an engine management system is incredibly appealing to me. How did you get into the field? I unfortunately have no background or formal training in embedded systems so I fear I have no real possibility of success in this dream


I didn't, it was an extra-curicular university project.


Get a job working in embedded systems. Preferably in the right industry.

Like any programming job, but without formal experience you'll need more (and interesting) informal experience, ie side projects.

Start with something simple, like an STM32Discovery board. Blink the LEDs. Respond to the button.

Get a logic analyzer, like a Saleae ($400 for the Logic8) or Digilent Digital Discovery ($200, worse software IMO) or something similar. Get a bench power supply. Get an oscilloscope (Rigol DS1054Z is cheap ($400 or so) and hackable. Don't get a PC-based one, get one with a UI similar to what you'll use in professional work.) Get a couple of cheap USB-UART adapters, maybe a BusPirate, an ST-Link, and a hand full of STM32BluePill boards.

Get an SPI flash memory dev board, and write code to read/write that external flash. Use the STM32's built-in SPI peripheral. Use the logic analyzer to decode the signals & for debugging.

Get an I2C sensor dev board, and write code to read the sensor value. Use the STM32's built-in I2C peripheral.

Get FreeRTOS running on your Discovery board. Make tasks for your sensor reads, LED control, maybe log data to the SPI chip. Show that it's all hard real-time.

From there, maybe add more inputs & outputs. Making a 3D printer or CNC router from scratch would be impressive. Or an RC car, or some sort of robot, or a drone. Maybe go to a more capable SoC, like a BeagleBone Black board or Raspberry Pi. The product I work on (an eletronic logging device for the trucking industry) was initially prototyped using a BeagleBone Black. (This post is my own opinion, and should not be taken as official statements of my employer.)

Or go to a less capable chip, like a PIC16F18345 ($12 dev boards) and see what you can do with <=1024 bytes of RAM.

Definitely learn to build your code without using the vendor IDE. It'll save you a lot of pain. Makefiles (or CMake, or whatever build system you prefer) + writing linker scripts + pointing your preferred IDE (or editor) at that is a much better way.

Espressif's ESP32 and ESP8266, Nordics NRF52 parts, and some others have built-in networking (BlE, WiFi, Cell, etc). Those can be good for all sorts of projects.

There's some industry bias against Arduino, it's seen as overly abstract and not something "real" programmers use. If you get an Arduino, it can be a great starting place, but also get an ICSP (In Circuit Serial Programmer) and learn to use the ATMEGA328p chip without the Arduino bootloader+libraries. It's a fine chip and a fine platform, but not necessarily great for resume building.

Basically look on Adafruit and Sparkfun and the like, figure out what sorts of sensors & outputs look interesting, and start making things to use them. The product lines I mentioned happen to be popular among hobbyists, which means less chance of getting stuck without community help available.


Never divide by a constant! Dividing by 5 can be done as a multiplication by 0.3333 hex, i.e. a multiplication by 3 * 17 * 257 followed by a right shift by 16. If you need more precision, just add more factors. Multiplication in turn is done by a hardcoded sequence of shifts and adds.


Also, rearrange the calculation. (Celsius * 9) / 5 can trivially be rearranged to Celsius * (9 / 5)

You can implement this as a fixed point multiply by the constant 1.8.

You are already using base 10 fixed point, so it simply becomes a multiply by 18, which can be simplified to a few shifts and adds:

(Celsius << 4) + Celsius + Celsius

Don't forget to add 32 and the algorithm requires a 16bit shift and three 16 bit adds.


>Also, rearrange the calculation.

Yep, for the nand2tetris course, we had division, but it was in the OS, and a really expensive operation. I wrote a program that needed to know which third of a line a point was on. (i.e. identify 4 on a 15-length line as being in the first third).

I naively wanted to do something like "if x < L / 3", but rearranged it to "if x * 3 < L". (Of course, the language only supported integers and not fractions or floating point, so I needed to do that anyway.)


But then you are in base 100 fixed point. Converting it back to base 10 is what requires a division (but it's still by a constant though).


Op is already in base 100 fixed point. I guess because that's what the sensor returns?

Though it might be smart to convert it to another format. Four nibbles of BCD would be ideal and the MCS-48 looks to have native support for it.

Thought you would need a custom version of shift by 4 that handles BCD.

Or simplify the equation to just adds:

    C2 = C + C
    C4 = C2 + C2
    C8 = C4 + C4
    C16 = C8 + C8
    Res = C16 + C2


If he's in base 100 and you multiply by 18 it's the same story: you are now in base 1000 and need a division to go back.

Converting from binary to BCD also needs division by constants, so again it's more or less the same.


Oh, right. I'm an idiot who forgot about the fixed point divide.

And yes, at some point you are going to need to do a divide by constant 10. It's unavoidable if you want a base 10 display. The option of using BCD to do calculations just lets you do that divide upfront, rather than at the end.


Also, this optimization is performed by any half-decent compiler. Of course, if you're on embedded, you may not have a half-decent compiler, or indeed an compiler at all.


IIRC, the DS18B10 (like its cousins) does not return celsius multiplied by 10 (i.e., not '153' for 15.3°C), but returns a 7 bit integer celsius value plus a sign plus a few fractional bits -- by default 1 (i.e, a +-7.1 bit format), but maybe this device has a few more bits in a secondary register -- I remember using one that had four sub-celsius bits, i.e., would output 1/16 celsius steps (a +-7.4 bit value). In any case, not a value multiplied by 10, but a celsius unit with binary fractional bits.

Am I wrong? Looking for DS18B10 data sheet always gives me the DS18S20. Hmm, dunno.

Anyway, assuming a sign + 7 bits integer + 4 bits fractional format, you can calculate inside 16 bit signed without overflow with a 3-bit shifted 9/5 constant without overflowing, up to beyond 100°C. With a bit of shifting, the following is quite accurate:

  int16_t celsius2fahrenheit(int16_t c)
  {
    int16_t f = 0;
    f += c << 4;
    f -= c;
    f -= c >> 1;
    f -= c >> 3;
    f += c >> 5;
    f += 32 << 7;
    f >>= 3;
    return f;
  }
Accuracy:

  C=0    (int: 0),    F=32      (exact: 32,    int: 512)
  C=15.3 (int: 244),  F=59.4375 (exact: 59.54, int: 951)
  C=37   (int: 592),  F=98.625  (exact: 98.6,  int: 1578)
  C=100  (int: 1600), F=212.062 (exact: 212,   int: 3393)


You are correct. the conversion to Celsius * 10 is done earlier on, because it is easy to optionally perform this conversion, and it's easy to render the various formats (positive/negative/DP,no DP) of it to the 7-segment display banks.

Also worth considering that I originally had no interest in fahrenheit (I live in the UK). This was done as an extra later on. Didn't want to poo the original implementation taking fahrenheit straight off the sensors.


OK, I understand. The approach works even better if you multiply by 10:

  int16_t celsius2fahrenheit10(int16_t c)
  {
    int16_t f = 0;
    f += c << 4;
    f -= c;
    f -= c >> 1;
    f -= c >> 3;
    f += c >> 5;
    f += 320 * 8;
    f >>= 3;
    return f;
  }
This is really accurate:

  C=0    (int 0)    F=32   (int: 320,  exact: 32)
  C=15.3 (int 153)  F=59.5 (int: 595,  exact: 59.54)
  C=37   (int 370)  F=98.6 (int: 986,  exact: 98.6)
  C=100  (int 1000) F=212  (int: 2120, exact: 212)


It's really interesting to see that it could be done like that, but, that pseudo code you've given me would translate to a heck of a lot of MCS-48 instructions! 16-bit subtraction in particular is really expensive. Potentially not much gain there?


This is the same code in MCS48 assembly. Or so I hope.

  ; INPUT:  r1:r0  celsius * 10
  ; OUTPUT: r3:r2  fahrenheit * 10

  MOV A, r0
  ADD A, r0
  MOV r2, A
  MOV r4, A
  MOV A, r1
  RLC A
  MOV r3, A
  MOV r5, A
  MOV A, r4
  ADD A, r4
  XCH A, r5
  RLC A
  MOV r4, A
  MOV A, r2
  ADD A, r5
  XCH A, r3
  ADDC A, r4
  MOV r2, A
  MOV A, r5
  ADD A, r5
  XCH A, r4
  RLC A
  MOV r5, A
  MOV A, r3
  ADD A, r4
  XCH A, r2
  ADDC A, r5
  MOV r3, A
  MOV A, r1
  ADD A, r1
  MOV A, r1
  RRC A
  MOV r5, A
  MOV A, r0
  RRC A
  MOV r4, A
  MOV A, r5
  ADD A, r5
  MOV A, r5
  RRC A
  XCH A, r4
  RRC A
  MOV r5, A
  MOV A, r2
  ADD A, r5
  XCH A, r3
  ADDC A, r4
  MOV r2, A
  MOV A, r4
  ADD A, r4
  MOV A, r4
  RRC A
  XCH A, r5
  RRC A
  MOV r4, A
  MOV A, r3
  ADD A, r4
  XCH A, r2
  ADDC A, r5
  MOV r3, A
  MOV A, r5
  ADD A, r5
  MOV A, r5
  RRC A
  XCH A, r4
  RRC A
  MOV r5, A
  MOV A, r4
  ADD A, r4
  MOV A, r4
  RRC A
  XCH A, r5
  RRC A
  MOV r4, A
  MOV A, r2
  ADD A, r4
  XCH A, r3
  ADDC A, r5
  ADD A, 10
  MOV r2, A
  ADD A, r2
  MOV A, r2
  RRC A
  XCH A, r3
  RRC A
  MOV r2, A
  MOV A, r3
  ADD A, r3
  MOV A, r3
  RRC A
  XCH A, r2
  RRC A
  MOV r3, A
  MOV A, r2
  ADD A, r2
  MOV A, r2
  RRC A
  XCH A, r3
  RRC A
  MOV r2, A


But you do have CPL -- that does not help? Anyway, without subtraction, you could do the very same with one additional add:

  int16_t celsius2fahrenheit(int16_t c)
  {
    int16_t f = 0;
    f += c << 1;
    f += c << 2;
    f += c << 3;
    f += c >> 2;
    f += c >> 3;
    f += c >> 5;
    f += 320 * 8;
    f >>= 3;
    return f;
  }


If you really wanted to support any denominator for a 16-bit division in a space efficient way, it seems like you could just subtract the denominator from the numerator in a loop, counting iterations until it's less than the denominator. I can't imagine it's more than a dozen assembly instructions, even on a crappy architecture. Worst case (divide by 2), it takes 32767 iterations, and you even get the remainder out of it.

With this example, say 100C. Convert to deciC and offset, say, 25 degrees to make it unsigned. 1250. Multiply by 9. 11250. Divide by 5 takes 2250 iterations. I am unfamiliar with the MCS-48, but Google says it can be clocked at 11Mhz. It would take maybe 40ms to do the divide?


I didn’t appreciate the complexity here until trying to write CHIP-8 games, and I grew to appreciate libraries like:

https://github.com/jdeeny/chip8-multiply


Your MCS-48 stuff is really cool. I don't have anything to add to the division topic, but I'v been working on a reverse-engineering project on an MCS-48 based engine management computer from the Porsche 944 Turbo (https://jhnbyrn.github.io/951-KLR-PAGES/).

You probably have the official User's Manual already, but there's a link to it there if you don't. I found it essential in addition to the assembly code manual.

I haven't found much '48 stuff out there so I'm looking forward to reading all your stuff on this!


I have done some MCS-48 reverse engineering too. It's really useful to look at the snippets in that manual as you'll find them in disassembled code, making it quite a bit easier to understand.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: