Hacker News new | past | comments | ask | show | jobs | submit login
Prince Of Persia Code Review (2013) (fabiensanglard.net)
217 points by undefined1 on March 24, 2019 | hide | past | favorite | 49 comments



About the disk format: the Amiga native disk format was just one huge sector per track. That is how the Amiga fitted 880K on the same disk MS-DOS filled with 720K and MacOS 800K.

One of the custom chip on the Amiga had hardware for teh track decoding. So the Amiga could read the disk while performing other operations, the data would come in via DMA (direct memopry access) of the co-processor.

(MacOS used variable-speed, that is how it fitted more sectors on outer tracks and achieved 800K.)


> One of the custom chip on the Amiga had hardware for teh track decoding

If memory serves correct, the raw data was pulled in from disk to memory DMA-style, but the actual decoding could be done either using the blitter or 68k.

True about performing operations whilst loading from disk though; I coded a trackloader demo that was loading the next part of the demo whilst the current part was ongoing, using the 68k to decode so that the blitter could be hammered for graphic effects!


The MFM encoding used 2 raw bits for each bit of data to be stored. As such, people figured how to use the Blitter to do just that. I'm not sure if the OS did it; it was probably not the case in the 1.x days.


I don't know either when it did what, but when Amigas got faster CPUs IIRC there were patches (or newer Kickstarts?) to move the decoding to the CPU because the blitter was slower than the CPU then.


>One of the custom chip on the Amiga had hardware for teh track decoding.

sadly no, it was just deserializing head magnetic flux data (encoded as pulses) to raw bytestream, you still had to compute actual data out of it. Whats worse Commodore shipped same, outdated even in 1987, Double Density chip for the whole life of Amiga.


Yeah, constant angular velocity (hub motor RPM is fixed) vs. constant linear velocity (linear speed over the disk is fixed). If memory serves, the 800K Mac and Commodore 1541 disks were actually Z-CLV or Zoned-CLV -- where the drive speed changes every "n" tracks. The reason you do ZCLV is that it reduces the complexity of the controller hardware (fewer taps on the clock divider).

The fun trick is, with a drive controller which works based on "time between pulses" (DiscFerret, Catweasel, etc) you can do all of the above with a CAV disk drive just by tweaking the read/write clock. Same trick probably works for discrete drive controllers like the 765 and 1771 (I say probably because I haven't checked the datasheet!).


The native disk format was actually broken into sectors, see:

http://amigadev.elowar.com/read/ADCD_2.1/Devices_Manual_guid...

It read full tracks at a time, but you can see from the doc there were 11/22 sectors with no inter sector gaps, but there are separate sectors of 512 bytes which are addressed in the file system structures.


That's just a clash of nomenclature between the logical and physical layout. It had logical sectors, but on disk, it was written as one contiguous sector.


Yes, it read the whole track, starting from wherever the head was, then the floppy device handler figured where each logical block was, based on the magic sync word $4489 (which is not supposed to be output by the default MFM encoding) and track/sector IDs embedded in the track's bitstream.


Not even "Not supposed to be" -- it can't be. $4489 decoded is a valid byte, but with one of the clock transitions 'blanked'.

If you decode $4489 via MFM then re-encode that byte, you'll get a 1-bit difference. This is why it works as a sync marker: even if you wrote that byte in the data area of the sector, it wouldn't encode the same way because of the missing clock :)

It's a common trick on radio systems too -- a short burst of data which can't be obtained through normal encoding processes (invalid FEC bits, flipped parity, etc).


I owned an Amiga 500 and at a certain point I bought for it a HDD (20 MBs, if I remember correctly) => then later I switched to an IBM-compatible (386sx with a HDD) and, subjectively, I thought that the HDD of the IBM-compatible was slower than the one I previously used on the Amiga => any HW reason why the Amiga-HDD might be faster than the IBM-HDD, or did I just imagine it?


It might have been a SCSI drive/controller, but it might have also been that you used FastFileSystem, which was pretty efficient.

One of the HD manufacturers of the time, either Conner or Maxtor, was known to use Amiga 3000 machines to test their SCSI disks, because of the CPU/chipset/OS/FS combination. Not many systems were able to extract 90%+ of theoretical throughput like the A3000 could.


Thx - it was just some sort of standard HDD for Amiga 500 - cannot remember anything specific about it... .

Can it be that the Amiga 500 had in its case a place for a HDD? (I don't remember using external cases, but maybe I have forgotten...)


The A590 was a drive that plugged on the left side of the A500. It usually came with a 20MB SCSI disk inside...

There was no room in the A500 for a 3.5" drive. Later, the A1200 and A600 had a built-in IDE controller and an internal bay, but for a 2.5" disk.


The 590 had a SCSI controller with DMA, but only for the external SCSI connector. The internal drive was 8-but IDE and dog slow. So if it was faster, it must be chalked up to better OS, better filesystem and I would assume a 68000 at 7Mhz was faster than a 4.77Mhz 8088 at shuffling data around. Especially since the memory data bus was 16 bit wide for the Amiga but only 8 bit wide for the PC. (Right?!)


The A590 had both a SCSI and an XT header for internal drives (on the right side):

http://amiga.resource.cx/photos/photos/a590-5.jpg

I only ever saw them with SCSI drives in person, but they also sold XT units. Maybe you saw the latter?


If they sold'em with XT-IDE inside, I was probably unfortunate to have on of those. Still beat swapping floppies. Thanks to the HDD I could compile C programs without switching floppies all the time. IIRC I had the compiler and the headers on one floppy and the linker etc on another. Eventually I made DICE (Matt Dillon yo!) fit on one floppy. Or was it an early Lattice C?? Can't remember.:-P By that time I had bought a 486SX for uni, so I reluctantly used Turbo C instead. Turbo C was just too good to pass up...


Yes, you're right - it was the A590, thx!


Part of it may also have been that even expansions that used non-SCSI drives tended to offload most of the processing from the CPU. My A2000 SCSI card had a Z80 on it, for example.


Amiga was very likely a SCSI drive vs IDE


   ORG directives were really just hints. 
   There was no operating system and no 
   linker/loader on Apple II: The developer 
   had to "somehow" manage to transfer the 
   instructions from floppy disc to the 
   intended location. 
Well, no. .ORG directives had nothing to do with any of that. They were (and are) used to tell the assembler how to fix up absolute references such as JMPs, JSRs, and loads/stores to variables defined in the code.

As he points out, when building complex programs on the Apple II, you had to keep track of where to load modules yourself, precisely because there was no metadata in the object file. Nothing prevented you from loading a module compiled at one ORG at a different address... well, nothing except the fact that it would probably crash.


I think that you are misinterpreting that statement. The .ORG directive is indeed somewhat of a hint to the assembler (or rather, obviously, a directive) telling it which memory location it should assemble the code for. It is a hint in the sense that it doesn't necessarily reflect the actual location of the code at run-time, because it is separate from linking.


You could write 6502 code to be completely relocatable, but at a cost of space and a more complex assembler.


The 6502 can only do small +128/-127 bytes relative jumps so that would be pretty complex indeed.


Typically you'd patch the relocatable routine before inserting it at the desired memory location, simply overwriting the location-dependent instructions like jsr and absolute jmp with new operands. This isn't very complex, especially if the relocator doesn't need to be position independent itself.

There are some factors that can simplify relocation: only allow relocation to page boundaries (i.e. $xx00 addresses) and possibly keeping the routines shorter than a page.

For example, here's a single page, page boundary constrained relocator for a routine with two patch points:

    ; x = page
    relocate_routine
        stx rel_sta+2

        ; patch
        stx patch1+2
        stx patch2+2

        ldy #0
    copy
        lda routine_template,y
    rel_sta:
        sta $0000,y
        iny
        bne copy
        rts

    ; some routine assembled for any origin $xx00
    routine_template
        bne continue
    patch1
        jmp done
    continue
        ...
    patch2
        jsr print
        ...
    done
        rts
    print
        ...
        rts


If the relocator is not position independent then the solution is not completely relocateble. I don’t even know how you would get access to the program counter without trashing some memory.


Not completely relocatable, but in practice this means that you have some memory reserved at a specific point in memory for the relocator and the template.

The relocator can be made position independent by using indirect stores. You never need direct access to the program counter for this, but it gets a bit more complex. You don't need to read the program counter directly for this, but I think the simplest way of doing that is the jsr instruction, which puts the PC (plus some fixed offset) on the stack.

The idea is that instead of using a "template" and referring directly to that, you load the relocatable code where you want it, load that location in some zero page vector and in the relocator you do indirect+relative stores based off of that vector:

        lda page ; the page that the routine was loaded to
        sta $ff
        lda #0
        sta $fe
        jsr relocate ; wherever you've loaded the relocator
        ...

    ; vector $fe-$ff = routine location, $fe must be 0
    relocate
        lda $ff

        ; load the lsbyte of the offset of the first patch in the routine into y
        ; in reality, would be calculated at the point of assembling the routine template
        ldy #<2+patch1-routine_start
        sta ($fe),y

        ldy #<2+patch2-routine_start
        sta ($fe),y

        rts
As you can see, the relocate routine is now completely position independent. The relocated routine has the same constraints (has to be loaded to the beginning of a page). We use the y register to load the offset within a page and $ff as the page counter. It's also not much more complex to write the relocator for routines spanning multiple pages, but it comes with a bunch more considerations that I think would take much more code to demonstrate (e.g. cross page jumps and calls).


The problem is that the only way to read the program counter is through jsr or an interrupt and these both can’t do relative jumps.

If you start with that information and you allow patching you can of course relocate the code on the fly, but I wouldn’t really call that position independent code generation by a compiler. It’s more like you’re implementing your own linker.


> The problem is that the only way to read the program counter is through jsr or an interrupt and these both can’t do relative jumps.

No, that's not a problem in this case since you don't need to know the value of the program counter to relocate the routine. You only need to know the relative offsets of the patch points from the start of the routine, which are static. The relocator itself is position independent, though, since it never uses absolute or absolute indexed addressing. No jumps or subroutine calls, just zero page loads and vectored writes.

That said, if you want to know the program counter, you can use a tiny, position independent subroutine to save it:

    savepc
        ; copy return address to $fe-$ff
        pla
        sta $fe
        pla
        sta $ff
        pha
        lda $fe
        pha

        ; increment $fe-$ff by 1 so that it contains the
        ; address of the next instruction after the jsr
        inc $fe
        bne done
        inc $ff
    done
        rts

    ...

    jsr savepc
    lda #0 ; $fe-$ff contains address of this instruction after savepc call
That has no use in this relocation technique, however.

> If you start with that information and you allow patching you can of course relocate the code on the fly, but I wouldn’t really call that position independent code generation by a compiler.

Neither would I, since it's assembled, but that was never the question, as far as I understood. Quoting the post you first responded to: "You could write 6502 code to be completely relocatable, but at a cost of space and a more complex assembler." I demonstrated that with an example of a position independent relocator that patches an otherwise position dependent routine so that it can be relocated to the start of any page.

And yes, if you've built the module that you need to relocate, you have the information required to patch it. You could even join the position independent relocator with the relocatable routine and have it patch itself out with a jump to the routine after relocation.

> It’s more like you’re implementing your own linker.

Maybe a small portion, but specifically, it's a position independent code relocator, exactly what was on topic.

EDIT: I think we may be confusing two different problems here. One is relocating a routine and producing position independent code. That is what I've demonstrated. The other is the problem of knowing the location at the routine at the call site.

In the simple case, the call site of the relocator typically knows what portion of memory needs to be position adjusted. That is what I have assumed so far.

In a more complex case, e.g. a position independent/relocated routine needs to call another position independent/relocated routine without knowing its location. In order to do that you can put the address of each subroutine in a table and use it as a trampoline. The routines calling into each other now only need to know the index of the routine they want to call in the table. The table itself can be located anywhere and be moved around freely.

    call ; $aa-ab = table location, y = subroutine index
        lda ($aa),y
        sta jump+1
        iny
        lda ($aa),y
        sta jump+2
        dey
    jump
        jmp $ffff

    ...
    ldy #11 ; index of PRINT routine
    jsr call
This can also be made position independent via stack trickery or indirect jump. The relocator for each routine could be responsible for populating this table.


Self-modifying code was not that unusual and that's what Apple's own DOS did to relocate itself.


> the 6502 is a simple 16 bits processor

I don't think ability to address 64k means the CPU is 16 bit. The 6502 is a very 8 bit processor AFAIK. The main registers are 8 bit, most operations work on 8 bit registers.


I think it's best to go with the "external data bus size", which is 8-bit on the processors typically known as "8-bit processors" (6502, Z80, i8080, 6809 etc...). Those processors can read one 8-bit byte in one memory access. For instance, even though the Z80 has 16-bit registers, can load 16-bit values from memory (and even has some 16-bit arithmetic operations), a 16-bit memory access takes 2 separate memory accesses through the 8-bit data bus.

By that definition the 68000 and 8086 are 16-bit processor (which makes sense), however the stripped down variants 68008 and 8088 would be considered 8-bit processors... and this is where this simple model breaks down ;)


"bitness" is defined by the natural data size of common operations.

The Z80 is 8-bit as most of its internal operations work over this data size and its registered are 8-bit. It can perform some 16-bit operations over two registers, but this is less natural to the design (based upon the 8-bit 8080).

The 8088 is 16-bit as its registers are naturally 16-bit even though there are also ways to address them as 8-bit halved. While the distinction between registers that are 8-bit but can be paired (Z80) or 16-bit but can be halved (8086/8088) may seem arbitrary on its own but looking into the instructions available and how they perform makes which side a chip sits on more obvious.

How much can be read at once is not the right differentiator, as it is essentially an interface issue. The internals of the 8088 are the same as those of an 8086, apart from those at the interface which break the 16-bit requests into 8-bit ones, they just have to wait longer for the signal that data has arrived/left.

Another common one that can confuse is the 80386SX: these are 32 bit chips (essentially the same design as the original 80386, latterly referred to as 80386DX) with a 16-bit data bus (so 32-bit requests are split as seen with 16-bit ones with the 8088) and a 24-bit address bus (limiting them to 16Mb physical memory like 80286 chips). These limited bus sizes allowed them to be used on motherboard designs originally intended for older 80286 units.

Other confusions come from specialist instructions: floating point operations using larger (40- & 80-bit) special purpose registers, SIMD instructions (such as those in the MMX set, introduced on 32-bit Pentium lines, operated over 64-bit collections of smaller values), etc.

Sometimes it works the other way around: the original Pentium lines (up to the Pentium 4) were 32-bit chips even though they had a 64-bit data bus (this made loading data into the L1 cache faster, rather than marking the chip out as 64-bit overall) and some 64-bit instructions.


I presume he's referring to the address space? It's a bit confusing, indeed.


Sorry you read the comment one second before I updated it, I clarified how AFAIK addressing 16 bits of memory does not make it a 16 bit processor.


Right :)

I think when the Atari Jaguar was released they used the same kind of logic to market it as a 64-bit system since technically some part of it was using 64-bit processing, but the CPUs were still 32-bit.

It's just nitpicking though, I thoroughly enjoy Fabien's writing and I can highly recommend the couple of books he's written about reverse engineering Id Software games. On HN people seem to confuse "hacking" with "growth hacking" a lot of times, and Fabien's writing is thankfully on the right side of that fence.


and the "ST" in Atari ST stood for "sixteen/thirty-two"

referring to the 16-bit 68000 processor and ... 32-bit something else, I forget :)


The 68000 CPU has a 32-bit address space [1], 32-bit internal data bus, and 32-bit registers, but a 16-bit external data bus and 16-bit ALU, so it's sometimes seen as a bit of a hybrid and sometimes referred to as a 16/32 bit CPU.

[1] but only 24 address lines, which led to "interesting" bugs when people had written code that used the top byte of addresses for other purposes and their code was then run on a 68020 or higher.


Indeed, the Nintendo Entertainment System (NES) can also address 16 bits of memory and is one of the most famous "8 bit systems" around.


In fact the NES uses a 6502-derived core (the Ricoh 2A03/2A07 CPUs contain a licensed 6502 core with some minor tweaks)



While nice, it doesn't go very far, yet.

It'd have been more interesting to link it once more complete.


I lost interest. I don't think i will ever finish it :P!


Damn. What a tease.


It's been at this incomplete state for 6 years.


I had so much fun playing this in college on a Mac Plus in our little computer room (closet)!


(2013)


Boy, this makes me feel old.


You were also expecting 1989 as the year? :)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: