...just another vision... Studios Return to Challenge Games

6502 101
This will introduce the 6502 architecture, addressing modes, and instruction set.

A number of coding samples (the blue-text areas) are scattered throughout this document, to illustrate the basics of coding in 6502. Unless otherwise stated, it'll do you the most good to understand completely each example given, before moving on to the next section.

Alternately, you may want to skim through this entire document, and then go through it again more slowly.
Introduction to Assembly

"Assembly" refers to a programming language that represents exact CPU instructions. Assembly-language programs are platform-specific by their very nature (every different CPU has it's own instruction set, and thus it's own Assembly language), which is different than "universal" languages such as Basic or C, which can be compiled on many different platforms. Then again, Basic and C (and other such languages) are not compiled "directly", but are rather translated into the Assembly language of the target platform (Windows, Mac, etc).

Assembly has a number of advantages (as well as disadvantages) in comparison to Basic or C:

The 6502's internal architecure

The 6502 is an 8-bit processor with a 16-bit address space.

The 6502 handles words (16-bit values) in LSB (Least-Significant Byte) format; that is, the low byte comes first. The word $1234 would be stored as $34,$12. When dealing with numbers on a per-byte basis this is not important, but keep it in mind when manipulating things such as addresses and pointers.

The 6502 has a number of internal registers which are not addressable by any 6502 instruction, but are an actual part of the CPU. These include the Accumulator, the X- and Y-Index Registers, the Stack Pointer, the Program Counter, and Processor Status Register. There are a few areas of memory ($0000-$FFFF) that have a special function under the 6502:
Addressing Modes

6502 Operations need operands to work on. There are seven methods for the CPU to acquire these operands. These methods are called Addressing Modes.
6502 Instruction Set

Up to this point I have covered the mechanics behind 6502 Assembly. Now I will cover the actual instruction set, grouped by function.

These are the basic instructions in 6502 Assembly. LDA is the most-often used instruction. These are used for memory transfers. To Store a register's contents is synonymous with "Writing" it.
For those accustomed to Basic or C, you may be used to assigning a value to a register like this:
num = 5
In 6502 Assembly, this is a two-step process:
LDA #$05 (load A with the value $05)
STA NUM (store A at "num")
In practice, the value "NUM" would have been assigned a label by you. When the Assembler encounters this label later, it substitutes it's value. Suppose you declared "NUM = $00" at the beginning of the code; the Assembler would assemble the STA NUM line as STA $00.

Practice:
The register pair $2006 and $2007 are used to access VRAM. VRAM is the video memory internal to the PPU. VRAM contains the pattern tables (VRAM $0000-$1FFF), name and attribute tables (VRAM $2000-$2FFF), and Palette data (VRAM $3F00-$3F1F). In short, it contains all the information- except sprites- to generate the display. To access VRAM, you:
  1. Write the 16-bit address to access, high byte first, to $2006.
  2. Write the data to $2007
So, for example, suppose we wanted to write the value $0F to VRAM $3F00 (ie., set the background colour to Black). You would do this by:
	LDA #$3F	;the high byte of the address
	STA $2006	;write it to $2006
	LDA #$00	;the low byte of the address
	STA $2006	;write it to $2006
	LDA #$0F	;the data to be written
	STA $2007	;write it to $2007
Note: For reference, when you write (or read) a value to/from $2007, the VRAM address is automatically incremented by one.
These should be self-explanatory. When you ADC, the value of the Carry flag is also added. This is to enable you to perform multi-byte arithmetic with ease. When you SBC, the OPPOSITE of the Carry flag is also subtracted from A; again, this is to enable easy multi-byte arithmetic.
Before performing ADC, you will want to clear the carry flag first; this is done by the CLC instruction. Before performing SBC, you will want to set the carry flag first; this is done by the SEC instruction.
To perform multi-byte additions, you CLC at the beginning, and then add from the bottom up. Multi-byte subtractions work the same, except you SEC instead of CLC. Remember to subtract in the right order (2-1 is not 1-2).

Practice:
Add one to the value at $00 (low) and $01 (high)
	CLC		;clear carry flag
	LDA $00		;get low byte to modify
	ADC #$01		;add $01
	STA $00		;store result
	LDA $01		;get high byte to modify
	ADC #$00		;adds zero, as well as any carry
	STA $01		;store result

Increment/Decrement is a useful way to update X/Y or a memory location's value. The arithmetic operation does not affect and is not affected by the carry flag; to INX when X = $FF will simply result in X=0, regardless of the carry flag's status, and it will not affect the carry flag.
You may notice that there is no Increment or Decrement instruction that operates on the Accumulator. This oversight was rectified in a later version of the 6502 called the 65c02 (it was not used in the NES), but for NES programming you'll have to deal with it manually.

Practice:
A simple 16-bit countdown loop.
	LDY #$00		;clear Y
label_a:	LDX #$00		;clear X
label_b:	DEX		;decrement X
	BNE label_b	;(if result not zero, then branch to label_b)
	DEY		;decrement Y
	BNE label_a	;(if result not zero, then branch to label_a)

These are used to control program flow on a larger scale. JSR and RTS are symbiotic pairs (they work together); you would use them to go to "detour" to a particular routine (JSR) and return later (RTS). This has other advantages: you can JSR to a given routine from *anywhere* in the program, so you can reuse one routine and only have to define it once.
JMP and JSR are used to set the PC. Which one you use depends on whether you intend to return or not. Use JMP if you do not intend to return; use JSR if you do.
RTI is similar to RTS, but it first pulls the processor status (P), and then PC. It also doesn't increment PC after popping it like RTS does, but that's not important for the purpose of understanding the general function of RTI/RTS. You would use RTI to end an interrupt (NMI or IRQ/BRK) routine.

Practice:
Write a series of values to the BG palette colour register ($3F00 of VRAM).
values:	.db $0f,$00,$10,$30,$10,$00,$0f	;these are the values to write

fadeinout:	LDX #$00			;reset index register
fadeinout_2:	LDA values,X		;read from the database using X
		JSR write_bg_colour	;write it to the BG register
		INX			;increment index register
		CPX #$07		;(compare X to $07 (there are seven entries))
		BNE fadeinout_2		;(return if not done)
		RTS			;done- return

write_bg_colour:	LDY #$3F			;use Y to set the VRAM address
		STY $2006
		LDY #$00
		STY $2006
		STA $2007		;..and store the read value to $2007
		RTS			;..and return
NOTE: In the above routine, the subroutine ('write_bg_colour') was actually completely unnecessary and just slows it down; you could replace the JSR with the subroutine itself (sans RTS). HOWEVER, it is much more efficient if you reuse it. You could, for instance, have one set of values to fade in, and another to fade out, and have them JSR to the same routine that actually writes the value. These sort of tricks help make your code very efficient, and (as in the above example), help make it more "readable" (so you or others don't get lost when looking over the source code).
These refer to the flags in the Processor Status Register to determine whether or not to branch.
All branch-on-condition instructions use what is called "Relative" addressing. The operand is a signed 8-bit value that indicates the displacement from the start of the *next* instruction (assuming the condition is met and the branch takes place). If the displacement values is in the range of $00-$7F, it is simply that many bytes forward; if not, then you can find the reverse displacement by:
  1. Inverting the bits (1 -> 0, 0 -> 1)
  2. Adding one
This means that you cannot branch farther than 127 bytes forward or 128 bytes backwards. If you find you need to branch farther, use the opposite branch-on-condition and JMP to the destination instead (or, if it's a loop, take a portion of the loop and move it somewhere else, and reroute to it using a JSR (be sure to RTS at the end of the section)).
Fortunately, there's no need to calculate the displacements yourself ("Now you tell me!"): you will type up your code using labels. When you want to branch, you specify the label as the destination, and the Assembler will calculate the displacement automagically (but will report an error if it's too far to branch).

Practice:
Wait for VBlank on an NES unit.

On the NES, the PPU is in constant action. It is, generally speaking, doing one of two things: drawing the current frame, or waiting for the display device's screen-drawing mechanism to return to the top of the display to begin drawing the next frame. The time spent waiting between frames is critical. This is known as the "Vertical Blanking" period, or VBlank. What makes it so important is that a game can safely write values to the display registers without fear of glitching the display, since the PPU isn't doing anything. If a game attempted to update the display while the PPU is drawing the frame, it will interfere with the PPU's addressing lines and glitch the display. (But, if you know what you're doing, you could systematically write certain values to certain registers (such as scroll registers) mid-frame to create special effects such as wavy effects, but that's beyond the scope of this document).
Anyhow, when the PPU has finished drawing the current frame, it will raise a flag in the PPU Status Register ($2002) indicating thus. The game programmer can use this to stall the program until VBlank has begun, to ensure all writes to VRAM are done safely.
The PPU will raise the highest bit of $2002 when VBlank begins.
wait_vblank:	BIT $2002		;(see next section for an explanation of this)
		BPL wait_vblank		;branches on negative sign clear

CMP, CPX and CPY work by subtracting the addressed value from the associated register (A/X/Y) and updating the processor status register accordingly. A/X/Y are *not* modified by this action. Also, the Carry flag does not affect the comparison but will be altered accordingly, as if a real subtraction (SBC) took place. Thus, if you LDA #$05 then CMP #$04, the carry is SET (subtraction uses the opposite of the carry).
BIT is a bit complex (no pun intended). It will perform a bitwise logical AND between A and memory, but modify neither. If the result of the AND is zero, the zero flag is set. The upper two bits of the addressed memory are copied into the processor status register.
Now to explain the BIT $2002 / BPL wait_vblank routine above: as I've said, the highest bit of $2002 is set when the PPU enters VBlank. When you BIT $2002, it compares A and $2002 (but modifies neither), and copies the upper two bits of $2002 to the processor status. The upper bit of $2002 is copied to the upper bit of P (the Negative flag). Since we want to wait until it's set, we want to branch to the start of the stall loop when it's NOT set, right? The BPL instruction will do just that.
Moving along... the AND, ORA and EOR instructions will perform a comparison between each bit of A and the associated bit of the addressed memory, and return the result to A. Here is a table showing what comparisons return what:
LOGICBIT 1BIT 2RESULTMETHOD
AND111If bit 1 AND bit 2 are set, return 1, else return 0.
100
000
ORA111If either bit is set, return 1, else return 0.
101
000
EOR110If one OR the other, but not both, is set, return 1, else return 0.
101
000
(Note: Comparing 1 to 0 is the same as comparing 0 to 1, so I have not bothered to list both)

AND is useful for masking bits. Here's an alternate way to wait for VBlank, which is less efficient but easier to explain:
wait_vblank:	LDA $2002		;get PPU status
		AND #%10000000	;mask-out all but the highest bit
		BNE wait_vblank
ORA is useful for force-setting bits. Suppose you wanted to set the next-to-highest bit of a memory location:
	LDA num		;get the value at the location
	ORA #%01000000	;set the next-to-highest bit and don't touch the rest
	STA num		;store it
EOR is useful for inverting bits. Suppose you wanted to take a negative number ( >= $80) and make it positive. This is done by inverting all the bits and adding one:
	LDA num
	EOR #%11111111	;invert every bit
	CLC
	ADC #$01	;(there's no Increment for A)
	STA num

In all of these, the bit that is moved "out" of the byte is copied to the Carry flag. What makes them different is what is moved "in". In the Shift (ASL/LSR) instructions, a zero is moved in. In the Rotate instructions, instead the Carry flag is moved in.
It should be noted that shifting left is the same as multiplying by two, and shifting right is the same as dividing by two (with the carry set if it was an odd number).

Practice 1 of 2:
Divide a 32-bit number ($00 (low) - $03 (high)) in half (shift all 32 bits right one bit)
	LSR $03		;shift right the highest number (low bit -> carry)
	ROR $02		;rotate right (carry -> $02 -> carry)
	ROR $01		;..and for $01
	ROR $00		;..and for $00
Practice 2 of 2:
16-bit multiply with 32-bit product
Note: You do not need to review this to get started with NES development. This is a fairly advanced technique.
($00,$01) x ($02,$03) = ($04-$07) [in low-high order]
Note: To make the code below more readable, I will use labels instead of addresses for each number:
num1(.lo/.hi) x num2(.lo/.hi) = res(.0/.1/.2/.3)
(eg., NUM1.LO is the low byte of NUM1; RES.0 is the low byte of RES, then RES.1, RES.2, and RES.3)
I've also written the labels in lower-case to distinguish them.
This is what is called "shift-add" multiplication. It is a very efficient method of multiplication. Normal, recursive multiplication is based on the principle that 7n = n+n+n+n+n+n+n, whereas shift-add multiplication is a binary method that reduces it to it's most basic elements, and is based on the principle that 7n = 4n + 2n + 1n. It's harder to grasp but once you do, it'll hit you like a brick wall: "oh, that's all there is to it?". Take some time to study it. Don't worry if you don't get it right away; it took me a while to figure it out too.
	LDX #$00		;reset index and store zeroes too!
        STX res.0       ;clear result field
        STX res.1
        STX res.2
        STX res.3
mult:   LSR num1.hi     ;get next bit of multiplicand ($00,$01)
        ROR num1.lo
	BCC no_add	;if bit not set, skip additions
	CLC		;perform 16-bit addition
        LDA res.2
        ADC num2.lo
        STA res.2
        LDA res.3
        ADC num2.hi
        STA res.3
no_add: ROR res.3       ;rotate result field right
        ROR res.2
        ROR res.1
        ROR res.0
	INX		;increment counter
	CPX #$10
	BNE mult
	RTS
That routine can be easily modified to work with larger numbers (for instance, 32-bit multiply with 64-bit product).
Note that the transfer instructions do not SWAP their associated register, but copy one to the other. The value in the destination register will be overwritten. This is similar to:
    LDA spirit_of_final_fantasy
    LDY money
    LDX idealism,Y
    TXA
After which, A will contain Final Fantasy VIII.

The Push/Pull instructions should be self-explanatory.

TXS and TSX make manipulation of the Stack possible. A good idea at or near the start of a 6502 program would be to reset the stack pointer, like so:
    LDX #$FF    ;stack pointer should point to the top of the stack
    TXS

These should be self-explanatory. There is no SEV (Set Overflow) command despite the existence of CLV, but hell if I know why you'd need it. ;)

CLC and SEC are important in arithmetic operations, as described in that section.

The NES' 6502 does not support Decimal mode. No great loss as far as I'm concerned. Still, I usually CLD before running code, just to be sure.

Practice:
Quick initialization code

This is not "initialization" in the sense of activation, but simply a few commands to be run at the start of the program, to ensure all actors are in place before you begin filming:
    SEI     ;set Interrupt-Disable
    CLD     ;deactivate Decimal mode
    LDX #$FF
    TXS
    INX     ;X = zero
    STX ...     ;use this to clear out whatever you need

NOP is self-explanatory: it does nothing. It's use is limited: you could use it if you were hacking code in a pre-existing ROM and wanted to remove a part of the code, or to add weight to a slowdown routine. It's most highfalutin use is in careful CPU timing loops.
stall:	LDY #$00
stall_a:	LDX #$00
stall_b:	NOP
	DEX
	BNE stall_b
	DEY
	BNE stall_a
	RTS
(alternately, you could create a dynamic (variable) stall effect by LDY'ing a number (the higher the number, the more it stalls) and JSR'ing to "stall_a" instead of "stall").

BRK is the hardest to define. I won't cover it here because you simply won't need it.
So, there's your crash course in 6502 Assembly. This document is by no means complete, but should be more than enough to keep you busy for a while. ;) However, except for BRK, I have covered the entire 6502 Instruction set!

Here's some food for thought: if making a full-size NES game is writing a novel, then you've just learned the alphabet.

What I recommend doing is saving this document onto your hard drive. Goto File -> Save As and save it wherever. You can open it later; the images at the top of the page won't work, but that's not important. You may also want to grab the stylesheet (Right-click and select "Save target as...") so the links, etc, look the same as they are here.
Return to JAVS' NES Development page