Friday, August 22, 2014

Pushing the limits

Today I'm not going to write about ultra fast Hadoop clusters or insanely scalable, globally-distributed databases. This time I want to take you on the journey in the totally opposite direction, straight into the rabbit hole.

Let's say you want to display a nice looking rainbow on the screen of you TV, like this one:
Because a static picture is a little boring, you also want to animate it, so that the rainbow moves up (or down) the screen. The question is: what hardware do you need for this and what will be the software requirements?

What if I tell you that all you need is a computer with 1.8 MHz CPU, 48 kilobytes of RAM and the program that takes 29 (twenty nine) bytes? Yes, bytes. Less than the number of letters in this sentence. If you don't believe me, you can download a zip archive containing the software from this location. Inside you will find a file rainbow.xex which you can run on any Atari 800/800XL emulator, or even on the real hardware if you still have one (you can use a device like SIO2SD to transfer files to the Atari). I have run it on a real Atari 800XL connected to a plasma TV through a composite video lead, and it works perfectly.

How is that possible? The secret lies in talking directly to hardware. 8-bit computers have very limited resources, so any software layer which is not absolutely necessary is just a waste of memory and CPU cycles. A simple example is a boolean value, which can have one of two states: true or false. Any computer in the world, no matter how simple or complicated, needs only one bit to understand boolean values: 1 for true and 0 for false. Since a byte consists of 8 bits, you can store 8 different boolean values in one byte. Yet, in today's software world, the Java boolean primitive takes entire byte, and the simplest Boolean class:
public final class Boolean
{
    public static final Boolean TRUE = new Boolean(true);
    public static final Boolean FALSE = new Boolean(false);
    private final boolean value;

    public Boolean(boolean value) {
        this.value = value;
    }

    public boolean booleanValue() {
        return value;
    }
}
compiles in Java 7 to 451 bytes of bytecode, not to mention it requires a few hundred megabytes runtime environment to load. This is the cost of having programmer friendly, enterprise grade, programming tools and languages.

Going back to the example, the algorithm to make Atari display the rainbow is quite simple:

1. Initialize graphic mode.
2. Read current scanline from Atari display chip - Antic. Antic creates image on a TV screen drawing it line by line, from left to right, 50 times (PAL) or 60 times (NTSC) per second. The horizontal resolution of an Atari screen is 240 pixels (including borders), which means that the scanline number can have values from 0 (first line) to 239 (last line).
3. If the line number is 0, increase the counter, which is used to calculate the colour palette shift. Increasing the counter on every new frame adds the animation effect to the rainbow picture.
3. Wait for Antic to finish drawing the current line on the screen (we don't want to change color in the middle of the line).
4. Change background colour to the value obtained from Antic (Atari has a total palette of 256 colors) summed up with the current shift value.
5. Repeat from point 2.

This algorithm can be easily translated into a C program:
#include <atari.h>

#define SDMCTL (*((unsigned char*)0x22F))

int main() {
  unsigned char line, counter = 0;

  SDMCTL = 0; // disable display in BASIC screen area
  while (1) {
    line = ANTIC.vcount;  // read current screen line
    if (!line) counter++; // increase color shift on new frame
    ANTIC.wsync = line;   // block CPU until vertical sync
    GTIA_WRITE.colbk = line + counter; //change background color
  }
  return 0;
}
To compile this program on a PC, you can use a cross-platform cc65 compiler:
cl65 -t atari -o rainbow.xex rainbow.c
However, the resulting executable is 1104 bytes long. Let's make the compiler to output the assembly code to see the CPU instruction list it produced:
cc65 -t atari rainbow.c
The assembly source code in the resulting rainbow.s file looks like this:
    jsr     decsp1
    lda     #$00
    jsr     pusha
    ldx     #$00
    lda     #$00
    sta     $022F
L0008:  ldx     #$00
    lda     $D40B
    ldy     #$01
    sta     (sp),y
    ldy     #$01
    ldx     #$00
    lda     (sp),y
    jsr     bnega
    jeq     L000E
    ldy     #$00
    ldx     #$00
    lda     (sp),y
    pha
    clc
    adc     #$01
    ldy     #$00
    sta     (sp),y
    pla
L000E:  ldy     #$01
    ldx     #$00
    lda     (sp),y
    sta     $D40A
    ldy     #$01
    ldx     #$00
    lda     (sp),y
    jsr     pushax
    ldy     #$02
    ldx     #$00
    lda     (sp),y
    jsr     tosaddax
    sta     $D01A
    jmp     L0008
    ldx     #$00
    lda     #$00
    jmp     L0002
L0002:  jsr     incsp2
    rts
Quite long, isn't it? If you understand the assembly language, you will probably notice a few problems here. First, the program jumps a few times to some strange subroutines (like pusha, bnega, tosaddax, etc.), which does not seem necessary. Second, it uses a lot of reads and writes to the stack (although the algorithm uses only two local variables (line and counter). Third, some instructions are completely unnecessary (like LDX, which loads data to never used CPU index register X).

Fortunately, the cc65 compiler provides some optimizations: you can use "‑Oi" and "‑Os" command line switches to inline all subroutines and get rid of jsr instructions, you can also make local variables to be stored under fixed memory address instead of the stack through "‑Cl". Also, using "‑Or" switch enables register variables, which is particularly interesting, because it does not make CPU use physical registers (since MOS 6502 processor used in the 8-bit Atari has only one general purpose register, called accumulator), but it makes variables to be stored in first 256 bytes of physical memory, called zero page. The advantage of using zero page is using less CPU cycles to access it: 8-bit memory addresses (from 0 to 255) are decoded faster by 8-bit processing unit than 16-bit addresses (from 0 to 65535 - the memory limit in Atari 800XL is 64KB).

Let's see what happens if we turn the optimizations on:
cc65 -Cl -Osir -t atari rainbow.c
The resulting assembly code is now much smaller:
    lda     #$00
    sta     L0004
    sta     $022F
L000A:  lda     $D40B
    sta     L0003
    lda     L0003
    bne     L0010
    lda     L0004
    clc
    adc     #$01
    sta     L0004
L0010:  lda     L0003
    sta     $D40A
    lda     L0003
    clc
    adc     L0004
    sta     $D01A
    jmp     L000A
To produce a working executable, we have to assemble and link it:
ca65 rainbow.s && cl65 -t atari -o rainbow.xex rainbow.o
Still, the resulting code is 630 bytes long. This is because the compiler still includes some Atari specific code from the standard library. We can work around this using an assembler (like xasm or MADS). But first, we need to clean up the assembly code a bit:
    org $2000  ; start address

    lda #$00  ; load accumulator with 0
    sta $cb   ; store it in memory
    sta $022f ; disable display in BASIC screen area
loop
    lda $d40b ; read current screen line into accumulator
    sta $cc   ; store it in zero page
    lda $cc   ; load current line number from memory
    bne skip  ; if it's not zero jump to code at 'skip' label
    lda $cb   ; load counter from memory into accumulator
    clc       ; clear carry flag
    adc #$01  ; add 1 to accumulator
    sta $cb   ; store counter in memory
skip
    lda $cc   ; load current line number from memory
    sta $d40a ; block CPU until vertical sync
    lda $cc   ; load current line number from memory
    clc       ; clear carry flag
    adc $cb   ; add counter to the line number
    sta $d01a ; change background colour
    jmp loop  ; repeat
I added some comments to make clear what's going on in the code. So, let's assemble it:
xasm rainbow.s /o:rainbow.xex
Now rainbow.xex is only 45 bytes long! Two first bytes are the executable file header ($FFFF) and the next two are memory address the program should be loaded into ($2000). The rest is pure CPU instruction list.

But we can still make it smaller, using some knowledge the compiler does not have. First, we don't need to init the counter stored at $cb memory address, because we don't care about it's initial value (it gets increased infinitely anyway), so we can get rid of "sta $cb" instruction. Second, we don't need "lda $cc" right after "sta $cc" and "sta $d40a", because the "sta" instruction doesn't change the accumulator state, so there's no need to reload it. We can also replace the whole procedure of loading counter from memory into accumulator, adding 1 and storing it back in memory, with only one instruction - "inc $cb" - which increases the memory value by 1. Finally, we don't need "clc" to clear the carry flag (the flag indicating that a mathematical operation resulted in value which does not fit in 8 bits) before adding line number and counter, because Atari colour palette can have only 256 values, so it doesn't matter if the addition result fits in 8 bits or not.

After making the changes, the code now looks as follows:
    org $2000 ; start address

    lda #$00  ; load accumulator with 0
    sta $022f ; disable display in BASIC screen area
loop
    lda $d40b ; read current screen line into accumulator
    sta $cc   ; store it in zero page
    bne skip  ; if it's not zero jump to code at 'skip' label
    inc $cb   ; increase counter in memory by 1
skip
    lda $cc   ; load current line number from memory
    sta $d40a ; block CPU until vertical sync
    adc $cb   ; add counter to the line number
    sta $d01a ; change background colour
    jmp loop  ; repeat
After assembling it with xasm, the resulting executable is now only 33 bytes long. However, we can still cut the instruction list down. You may notice that there is no need to load and store accumulator at $cc memory address, because its state doesn't change between "sta $cc" and "lda $cc" - so we can safely remove those instructions:
    org $2000 ; start address

    lda #$00  ; load accumulator with 0
    sta $022f ; disable display in BASIC screen area
loop
    lda $d40b ; read current screen line into accumulator
    bne skip  ; if it's not zero jump to code at 'skip' label
    inc $cc   ; increase counter in memory by 1
skip
    sta $d40a ; block CPU until vertical sync
    adc $cc   ; add counter to the line number
    sta $d01a ; change background colour
    jmp loop  ; repeat
As we got rid of another couple of bytes, rainbow.xex is now exactly 29 bytes after assembling.

Lessons learned:
1. Compilation with default compiler options sucks.
2. A smart compiler can do a really great job optimizing code.
3. There is always a way for a programmer to optimize it even more.
4. Programming old computers is fun.