Appendix C. ARM Assembler Tutorial

Table of Contents

Register Overview
Register Explanations
Link Register
Subroutine Calls
Syntax
In Practice
Immediate Values
Instruction Timings
Compares with Branches

Register Overview

The ARM processor has 31 general-purpose registers including a program counter and 6 status registers. All registers are 32 bits wide but only 12 are implemented in status registers. From these, 15 general-purpose registers (R0 to R14), one or two status registers and the program counter are visible at any time.

The general-purpose registers can be divided into two categories, unbanked (R0 to R7) and banked (R8 to R14). The unbanked registers refer to the same physical registers in all processor modes whereas the physical register of a banked register depends on the specific processor mode.

Register Explanations

Although general-purpose, some of the registers have special use in applications, for example the program counter is a generel-purpose register whereas any other and can be refered to as R15.

As is specified in the ARM calling standard, the first four registers R0-R3 are temporary registers which are used for passing parameters to subroutines. If a subroutine uses more than four parameters the rest are stored in the stack. So between subroutine calls, these registers can not be expected to save their state.

The rest of the unbanked registers R4-R7 should always be stored in a subroutine before using them. The same goes for the banked registers R8-R12. Their state is expected to be saved between subroutine calls, with the exception of R12 when using GCC. For reasons unknown, R12 is used as a temporary register in GCC and its state is not saved between subroutine calls.

Some of the banked registers have special use, namely R13, R14 and R15. R13 is used as stack pointer, R14 as link register and R15 as program counter. From these, only the program counter has restrictions about its use, the stack pointer and link register can be used as general-purpose registers at any time but their state should be saved prior to use.

R13, R14 and R15 also have predefined alternative names which can be used. The stack pointer can be refered to as SP, the link register as LR and the program counter as PC.

Link Register

The link register has a special use in ARM architecture. In each processor mode the mode's own version of LR holds the return address of a subroutine or if an exception occurs, the appropriate exception mode's version of LR is set to the exception return address. When a subroutine call is performed by a BL or BLX instruction, LR is set to the subroutine return address. The subroutine is ended by copying LR to the program counter. This is normally done in one of the two following ways:

  • By executing either of these instructions:
    MOV PC, LR
    BX LR
    
  • On subroutine entry, storing LR to the stack with other registers:
    STMFD SP!, {<registers>, LR}
    
    and loading it straight to the program counter:
    LDMFD SP!, {<registers>, PC}
    

Subroutine Calls

According to the ARM calling standard, when a subroutine is executed, parameters passed to it have been stored into registers R0-R3 and in stack. If one is going to use registers other than R0-R3 in the subroutine, their state should be saved prior to use. Multiple registers can be saved with a single instructions and should be used if the number of registers to be stored is more than two.

Example 1, subroutine uses registers R0-R6 so registers R4-R6 should be stored on entry:

STMFD SP!, {R4-R6}	; Stores registers R4-R6 into stack

and restored in exit:

LDMFD SP!, {R4-R6}	; Loads values from stack into registers R4-R6
MOV PC, LR		; Returns from subroutine

Example 2, subroutine uses registers R0-R6 and the link register so the registers R4-R6 and the link register should be stored on entry:

STMFD SP!, {R4-R6, LR}	; Stores registers R4-R6 and LR into stack

and restored in exit:

LDMFD SP!, {R4-R6, PC}	; Loads values from stack into registers R4-R6 and program counter

Syntax

In ARM architecture, all instructions can be executed conditionally and can be chosen to update or not to update status registers. Also many different addressing modes are available. ARM assembler instructions are mostly in these forms:

<opcode>{<cond>}{S} <Rd>, <Rn>, <addressing_mode>
<opcode>{<cond>}{S} <Rd>, <addressing_mode>

where Rd is destination register, Rn is source register and S is status register update flag.

Examples:

  • Increment R0 by one:
    ADD R0, R0, #1
    
  • Increment R1 by one and put the result in R0:
    ADD R0, R1, #1
    
  • Multiply R1 by ((2^8)+1), put the result in R2, update status register and load R0 with zero if the operation overflows:
    ADDS R2, R1, R1, LSL #8
    MOVVS R0, #0
    
  • Load 32-bit value from memory pointed by R3 into R0, add R1 shifted left by the value in R2 into R0, load 16-bit value from memory pointed by R4 into R5 and add R4 by the value in R6:
    LDR R0, [R3]		; Address in R3 has to be longword aligned
    ADD R0, R0, R1, LSL R2
    LDRH R5, [R4], R6	; Address in R4 has to be word aligned
    

In Practice

Some considerations are important when using ARM assembler. The architecture has some oddities as well as normal instruction timings which have to be considered.

Immediate Values

The ARM architecture has a unique way of implementing immediate values in operations. Immediate values are stored in the instructions themselves as an 8-bit constant value and a 4-bit right rotate to be applied to that constant. The rotation has to be an even number of bits (0,2,4,8,..,26,28,30). So all immediate values are not acceptable in instructions, instead one can load the value into a register from the literal pool and use the register in place of the immediate value:

  • Load R0 with hex 1000:
    MOV R0, #0x1000
    
  • Load R0 with hex FFFFFFFF:
    MOV R0, #0xFFFFFFFF
    
  • Load R0 with hex 1004:
    LDR R0, =0x1004		; MOV R0, #0x1004 is illegal
    

Loading from literal pool can cause a cache miss and should be avoided if possible by using multiple instructions to load a register with some immediate value:

  • LOAD R0 with hex 1004:
    MOV R0, #0x1000
    ORR R0, R0, #4
    

Instruction Timings

Almost all instructions in ARM architecture take a single clockcycle but some instructions have a result delay so that the processor will stall if the next instruction tries to use the result from the current instruction.

Loading values from memory has a result delay of one cycle so the following instruction after a memory load instruction should not use the result from the memory read:

LDR R0, [R1]
ADD R0, R0, #1		; 1 cycle stall
SUB R2, R2, R3		; Instructions take 4 cycles in total

one should use this instead:

LDR R0, [R1]
SUB R2, R2, R3
ADD R0, R0, #1		; Instructions take 3 cycles in total

The multiplication and multiplication/add instructions suffer from the same kind of result delays, only bigger. The result delay of a multiplication is between 1 and 3 clockcycles depending on the bit format of the multiplier operand. This should be taken into account when using multiplication instructions like in the case of loading a value from memory.

Compares with Branches

Contrary to the usual architectures, ARM can execute every instruction conditionally instead of needing to compare and branch accordingly to certain rule. One should try to take advantage of the fact:

MOV R0, R1		; Copy R1 into R0
CMP R0, #4		; Compare R0 against the value 4
BEQ label		; If R0 is 4 jump to label
MOV R2, #0		; Otherwise load R2 with zero label
MOV R2, #1		; Load R2 with one

can be written more efficiently discarding an unnecessary branch:

MOV R0, R1		; Copy R1 into R0
CMP R0, #4		; Compare R0 against the value 4
MOVNE R2, #0		; If R0 is not 4, load R2 with zero
MOVEQ R2, #1		; If R0 is 4, load R2 with one