Overview

Toe is an instruction set for a 64-bit little-endian CPU. In many respects, it is similar to the 64-bit subset of ARMv8 or AMD64. Compared to those, it is simpler and more regular, and has a higher code density. Its instructions are smaller, but it requires more instructions to express the same code.

I named it "Toe" because one of my previous 64-bit instruction sets was called "Leg", which was a play on "ARM", and because ARM designed an instruction set called "Thumb" which, like "Toe", has 16-bit instructions.

Toe is missing a number of features, including long arithmetic, floating-point arithmetic, vector instructions, and synchronization primitives. These could be added, possibly at the expense or removing something else.

Implementation

While it would be possible to make an in-order CPU that executes the Toe instruction set, it would not be fast compared to (say) an ARM Cortex A53. There is some scope for a compiler to statically reorder Toe instructions such that neighbouring instructions are independent, but much of the freedom has been removed in order to make the code denser. I will therefore say that Toe is designed on the assumption that the CPU will reorder the instructions to find parallelism and hide latency.

The state of the machine is defined after every 16-bit instruction, so it would be possible to make a machine that literally executes the individual instructions. It would be better, however, to merge some of the simpler instructions into the following instructions so that they don't have to be executed at all. This applies to:

Immediate constants, which can be inserted into the subsequent instruction that uses them.
MOV instructions, which can be handled using register renaming.
Conditional and unconditional jumps to the immediately following instruction.

Registers

There are 24 general-purpose registers R0 to R23, and eight short-term registers S0 to S7. General purpose registers can be written in any order. In contrast, the short-term registers hold the results of the most recent instructions, i.e. they are used as FIFO storage.

There is a program counter PC, which holds the address of the current instruction. It is always even.

There are four condition flags: N, Z, C and V. These are set by AND, ADD and SUB instructions, and are preserved by all other instructions. They are tested by conditional branch instructions.

There is a constant continuation flag Q, which is cleared by exery instruction except #n..., which sets it.

Instruction set

Instructions are always 16 bits long, and I will name the bits with hexadecimal digits.

Notation

To describe the instructions, I will use some special notation:

x:y is bits x to y (inclusive) of the instruction.
z[x:y] is bits x to y (inclusive) of z.
[addr] is the contents of memory starting at addr.
sxt(x) is x sign-extended.
uxt(x) is x zero-extended.
even(x) is x or its bitwise inverse, whichever is even.
Ss is a short-term register encoded in 5:7.
Rr is a general-purpose register encoded in 0:4.
Sr is a short-term register encoded in 0:2.
RSr is either Rr or Sr; in the latter case 3:4 is 11.
Fr is a unary operation encoded in 0:2.
w is a width: b, h, s or d, meaning 1, 2, 4 or 8 bytes.
v is a restricted width: s or d only.

Ubiquitous behaviour

The following should be prepended to the meaning of every instruction:

HERE <- PC
PC <- PC + 2

The following should be appended to the meaning of every instruction (RESULT is the 64-bit value computed by the instruction):

S7 <- S6
S6 <- S5
S5 <- S4
S4 <- S3
S3 <- S2
S2 <- S1
S1 <- S0
S0 <- RESULT

The following should be appended to every instruction except #n...:

Q <- 0

Decoding

If f is 0, the instruction is a dataflow instruction.
Instructions where e:f is 10 are reserved.
If d:f is 110, the instruction is a constant continuation.
If c:f is 1110, the instruction is a direct jump of some sort.
If b:f is 11110, the instruction is an immediate constant.
If b:f is 11111 and 8:a is not 111, the instruction is a conditional branch.
If 7:f is 111111110, the instruction is an indirect jump of some sort.
If 7:f is 111111111, the instruction is a privileged instruction.

Dataflow

Dataflow instructions are used to read and write registers, to perform arithmetic, and to access memory. They are distinguished by f being 0. The general encoding is as follows:

Bits    Name    Meaning
0:4     r       A general-purpose register number. Used as a source and maybe
                as a destination.
5:7     s       A short-term register number. Used as a source only.
8:d     op      Determines which operation to perform.
e:e     mode    Determines the sources and destinations.

If the bits 3:4 are 11, then r does not encode a general-purpose register number, and the instruction changes its meaning. Bits 0:2 are then available to encode a short-term register number, or something else.

Modes

There are four modes, distinguished by bit e and by whether r encodes a general-purpose register number. All modes perform the same operation; they differ only in their operands and their destination register(s).

e       3:4     Written     Meaning
0       not 11  op Ss, Rr   RESULT <- Rr; Rr <- op(Ss, Rr)
0       11      op Ss, Sr   RESULT <- op(Ss, Sr)
1       not 11  op Rr, Ss   RESULT <- op(Rr, Ss)
1       11      op Fr, Ss   RESULT <- op(Fr(S0), Ss)

Binary operations

The available op options are:

8:d     Written     Meaning of op(x, y)     Notes
000000  MOV         x
000001  AND         x & y
000010  OR          x | y
000011  XOR         x ^ y
000100  LSR         uxt(x >> y)
000101  SEL         S0 != 0 ? x : y
000110  RLSR        uxt(y >> x)
000111  RSEL        S0 != 0 ? y : x
00100v  ASR.v       sxt(x >> y)
00101v  RASR.v      sxt(y >> x)
00110v  SL.v        x << y
00111v  RSL.v       y << x
01000v  MUL.v       x * y
01001v  ADD.v       x + y
01010v  SUB.v       x - y
01011v  RSUB.v      y - x
011xxx
1000ww  LD.w       uxt([x + w * y])
1001ww  LS.w       sxt([x + w * y])
1010ww  AD.w       x + w * y
1011ww  ST.w       x + w                   side-effect [x] <- y
11xxxx

Instructions ending .v have a 32-bit and a 64-bit version, written as .s and .d resepctively. The 32-bit version treats bit 31 as the sign bit, and clears the upper 32 bits of the result. Some operations, including MOV, AND, OR and XOR, do not have separate versions.

Instructions ending .w have 8-bit, 16-bit, 32-bit and 64-bit versions, written as .b, .h, .s and .d respectively. In their description, w means 1, 2, 4, or 8 respectively, and [address] means the bytes from address to address + w - 1 inclusive. The address calculation is done using 64-bit arithmetic, and RESULT is 64 bits wide, regardless of w.

Most asymmetric operations have a counterpart that performs the same operation but with the operands swapped. The two opcodes differ in bit 9, and in most cases their mnemonic differs by the prefix R. The ability to swap the operands is useless if they are both short-term registers, but it is useful otherwise.

Some operations, including SEL, use S0 as an additional operand.

Setting the condition flags

TODO: Define that MUL sets flags. Similar to ADD.

ADD.w x y sets the condition flags as follows:

N is set to the sign bit of RESULT.
Z is set if RESULT == 0
C is set if the addition carries, i.e. if the unsigned value of RESULT differs from the sum of the unsigned value of x and the unsigned value of y.
V is set if the addition overflows, i.e. if the signed value of RESULT differs from the sum of the signed value of x and the signed value of y.

SUB.w x y sets the condition flags as follows:

N is set to the sign bit of RESULT.
Z is set if RESULT == 0
C is clear if the subtraction borrows, i.e. if the unsigned value of RESULT differs from the difference of the unsigned value of x and the unsigned value of y.
V is set if the subtraction overflows, i.e. if the signed value of RESULT differs from the difference of the signed value of x and the signed value of y.

AND x y serves as both a 64-bit and a 32-bit operation. It sets the condition flags as follows:

N is set to the bit 3f of RESULT.
Z is set if RESULT == 0.
C is set to the bit 1f of RESULT.
V is set if RESULT[0:1f] == 0.

All other operations preserve the condition flags. AD.b might be useful as a version of ADD.d that does not affect the flags. RSUB.w similarly does not affect the flags.

Unary operations

When the first operand is S0 (the result of the previous instruction), there is an opportunity to apply a unary function to it before using it. This is indicated by Fr in the decoding table above. This mechanism is also used to provide immediate constants and PC-relative addressing, and some simple arithmetic operations.

If Q is clear, then the previous instruction was not a constant continuation, and the available Fr options are:

0:2     Written     Meaning of Fr(S0)
000     #0          0
001     #1          1
010     #2          2
011     #3          3
100     #4          4
101     BIT         S0 & 1
110     INC         S0 + 1
111     NOT         ~S0

If Q is set, then the value of S0 is from a constant continuation, and the available Fr options are:

0:2     Written     Meaning of Fr(S0)
000     #n          logic_immediate(S0)
001     #n          S0 << 8
010     #n          S0 << 16
011     #n          S0 << 24
100     label       HERE + even(S0)
101     [label]     [HERE + even(S0)]
110     PR_n        Performance counter `n`
111     #n          ~S0

Logic immediate constants

The logic_immediate() unary function decodes the bottom 13 bits of S0, which must come from a #n... instruction, and constructs a useful bitmask.

The idea is similar to ARMv8's encoding of immediate constants for logic instructions, but differs in the details. Toe can encode a few bitmasks that ARMv8 can't.

S0[0:c]         Test    Meaning
nyyyyyyxxxxxx   y > x   (n ? -1 : 0) ^ 0x0000000000000001 * ((1 << y) - (1 << x))
n0yyyyy1xxxxx   y > x   (n ? -1 : 0) ^ 0x0000000100000001 * ((1 << y) - (1 << x))
n00yyyy11xxxx   y > x   (n ? -1 : 0) ^ 0x0001000100010001 * ((1 << y) - (1 << x))
000yyyy01xxxx           0x0101010101010101 * ((y << 4) + x)

Note that only 5776 of the 8192 possible 13-bit patterns encode a bitmask, and that all these bitmasks are distinct. Other 12-bit patterns should not be used. 0 and -1 cannot be encoded as a logic immediate.

Performance counters

The available performance counters are as follows:

Name    Meaning
PR_0    Clock cycles since reset
PR_1    Instructions retired since reset
PR_2    Nanoseconds since reset
PR_3    A 64-bit random number

Constant continuation

Small immediate constants are encoded as part of the instruction that uses them. Larger immediate constants require one or more constant continuation instructions before the instruction that uses them. A constant continuation instruction is distinguished by d:f being 110.

It is possible to assemble a constant continuation instruction explicitly using the assembly language syntax #n... (in which n is the 13-bit constant). However, it is more usual to allow the assembler to construct the additional instructions. If you write a single assembly language instruction that includes an immediate constant, the assembler will prepend as many #n... instructions as are necessary to express the constant's value.

You can force the assembler to assemble a specified number of instructions using an assembly language prefix. For example, 3.CALL label is guaranteed to assemble two #n... instructions followed by a CALL (or to report an error). This feature is useful in code that will be relocated and linked, in which the final value of label might not be known. In order to support this feature, longer #n... prefixes can encode all the values encodeable by shorter sequences.

d:f     Written         Meaning
110     #n...           RESULT <- uxt(0:c) | (Q ? (S0 << 13) : 0)
                        Q <- 1

This is the only instruction which sets Q to 1. All other instructions set Q to 0. Therefore, the Q flag indicates whether the previous instruction was #n.... If so, the value it computed is used to supply 13 additional bits to the following instruction. Notably, it can be used to supply the high bits of an immediate constant in the immediately following instruction, but it has other uses too.

Direct jump

A direct jump instruction jumps to a constant program address. It is distinguished by c:f being 1110.

b:f     Written         Meaning
11100   JUMP label      RESULT <- PC
                        PC <- HERE + even(uxt(0:a) | (Q ? (S0 << 11) : 0))
11101   CALL label      RESULT <- PC
                        PC <- HERE + even(uxt(0:a) | (Q ? (S0 << 11) : 0))

The only difference between JUMP and CALL is only in the behaviour of the branch predictor. CALL pushes the return address onto the predictor's return stack, whereas JUMP doesn't. Both compute a return address, though it is usually useful after a CALL.

Valid code addresses are all even numbers, so bit 0 is available to indicate a negative jump offset. The even() function (defined earlier) maps the odd values to even negative values.

When not preceded by a constant continuation, direct jump instructions can jump up to 2K in either direction. Each preceding constant continuation instruction multiplies the range by 8192. With one, the range is 16M, which is large enough to reach anywhere in a static binary. With two, the range is 128G, which is large enough to reach a dynamic library loaded at a random address.

Immediate constant

An immediate constant instruction computes an unsigned constant result. It is distinguished by b:f being 11110.

b:f     Written         Meaning
11110   #n              RESULT <- uxt(0:a) | (Q ? (S0 << 11) : 0)

There are a few alternative ways of expressing immediate constants for use in dataflow instructions which might be shorter or otherwise preferable (see "Unary operations"):

Code addresses are encodeable in a PC-relative way.
0 and 1 are encodeable as part of arithmetic instructions.
Negative constants are encodeable as the bitwise inverse of positive constants.
Patterned bitmasks are encodeable as logic immediates.
If the same constant is needed twice in quick succession, one immediate constant instruction suffices.
Constants larger than 26 bits are best loaded from a PC-relative address.

When not preceded by a constant continuation, an immediate constant instruction can express an 11-bit unsigned number. Each preceding constant continuation instruction adds 13 more bits. Using more than two constant continuation instructions, though possible, is in most situations inferior to using a PC-relative load.

Conditional branch

Conditional branch instructions are used to implement "if". They test a condition code: a boolean function of the N, Z, C and V flags. If the test passes, the instruction does a direct jump.

7:f         Written         Meaning
11111ccc0   B.cond label    if cond(N, Z, C, V) {
                                RESULT <- -1
                                PC <- HERE + even(uxt(0:6) | (Q ? (S0 << 7) : 0))
                            } else {
                                RESULT <- 0
                            }
11111ccc1   B.cond label    if !cond(N, Z, C, V) {
                                RESULT <- -1
                                PC <- HERE + even(uxt(0:6) | (Q ? (S0 << 7) : 0))
                            } else {
                                RESULT <- 0
                            }

There are only seven condition codes. If 8:a is 111 then the instruction is not a conditional branch instruction.

The negated form of B.cond is distinguished by setting bit 7. In assembly language, the negated form is indicated by changing the name of the condition code.

Note that a conditional branch to the immediately following instruction can be useful for computing a boolean value.

When not preceded by a constant continuation, conditional branch instructions can jump up to 128 bytes in either direction: a factor of 16 less than a direct jump. Each preceding constant continuation instruction multiplies the range by 8192. With one, the range is 1M, and with two, the range is 8G.

Conditional codes

The available condition codes are similar to the ones that ARM uses. Each has two spellings in assembly language, one of which indicates that the boolean should be inverted:

8:a     Written     Inverted    Meaning         Interpretation after SUB x y
000     ZS or EQ    ZC or NE    Z               x == y
001     CS or HS    CC or LO    C               x >= y (unsigned)
010     NS or MI    NC or PL    N               (x - y) < 0
011     VS          VC          V               x - y overflows (signed)
100     HI          LS          C & !Z          x > y (unsigned)
101     GE          LT          N == V          x >= y (signed)
110     GT          LE          N == V & !Z     x > y (signed)
111     Not a condition code

The formal meaning of the condition codes is in the "meaning" column. It is defined in terms of the condition flags. The section "Setting the condition flags" below explains how the condition flags are set by various arithmetic instructions. Combining the two, you can implement a variety of useful tests. The "interpretation" column gives some examples: the tests that can be achieved by preceding each condition code with a SUB instruction.

Indirect jump

An indirect jump instruction jumps to a computed code address. It is distinguished by 7:f being 111111110.

5:f             0:4     Written         Meaning
11111111000     RSr     JUMP RSr        RESULT <- PC; PC <- RSr
11111111001     RSr     CALL RSr        RESULT <- PC; PC <- RSr
11111111010     RSr     RET RSr         RESULT <- PC; PC <- RSr
11111111011     n       SWI #n          Enter supervisor mode.
                                        RESULT <- PC; PC <- [0]

The difference between RET, JUMP and CALL is only in the behaviour of the branch predictor. RET pops the return address from the predictor's stack, CALL pushes the return address onto the predictor's return stack, and JUMP doesn't use the stack.

SWI generates an interrupt. This can be used for a number of purposes, but the most common is to perform a system call.

Privileged instructions

5:f             0:4     Written     Meaning
11111111100     RSr     MMAP        Map physical page RSr at virtual page 0

TODO.

TODO

Exceptions
MMAP
Make Fr depend more on Q.
Replace SEL with S.cond?
In arithmetic instructions, use bit b (?) to swap operands, not bit 9.