Computer Architecture Lab/Winter2006/PoettschacherRosenblattlWolf/ThreeMicroDiscussion

Atmel AVR
AVR stands for „Advanced Virtual RISC“. The AVR is a 8-bit RISC architecture which was designed by the 2 norwegian students Alf-Egil Bogen and Vegard Wollan. It is generally used in microcontrollers.

Details
The AVR uses a Harvard architecture with separate memories and buses for program and data. It implements a single level pipelining. Most operations are performed in a single clock cylce.

Like other RISC architectures, the AVR offers 32 8-bit general purpose working registers (R0 – R31) with single cycle access time. This allows single-cycle ALU operations. In a typical ALU operation, two operands are output from the Register File, the operation is executed, and the result is stored back in the Register File – in one clock cycle.

Six of the general purpose working registers can be used as three 16-bit indirect address register pointers for Data Space addressing

The AVR offers conditional and unconditional jump and call instructions, with which the whole address space can be addressed directly.

Instruction set
The AVR’s instruction set is register-register type. It can be divided into 5 categories.
 * ARITHMETIC AND LOGIC INSTRUCTIONS
 * BRANCH INSTRUCTIONS
 * DATA TRANSFER INSTRUCTIONS
 * BIT AND BIT-TEST INSTRUCTIONS
 * MCU CONTROL INSTRUCTIONS

Giving a total of 130 instructions.

The AVR uses two-address format, so the result of the operation performed overwrites one of the operands. Most AVR instructions have a single 16-bit word format and most of them are executed in a single clock cycle.

General Purpose Register File
The Register File is optimized for the AVR Enhanced RISC instruction set. To achieve the required performance and flexibility, the following input/output schemes are supported by the Register File: Most of the instructions operating on the Register File have direct access to all registers.
 * One 8-bit output operand and one 8-bit result input.
 * Two 8-bit output operands and one 8-bit result input.
 * Two 8-bit output operands and one 16-bit result input.
 * One 16-bit output operand and one 16-bit result input.

DEC Alpha
DEC Alpha is a 64-Bit RISC load-store von Neumann architecture, used in PCs, Workstations and Servers until further development was cancelled in 2003. In contrast to the others, the Alpha architecture is not designed for microcontrollers, so it has no integrated peripherals (e.g. like timers/counters).

Details
Like other non-embedded microprocessors, Alpha is a von Neumann architecture without separation of data and program bus. It has 29 general-purpose integer registers (R0-R28), 31 general-purpose floating-point registers (F0-F31), one data frame pointer register (R29) one stack pointer register (R30) and two special registers (R31 and F31) reading always as integer and floating-point zero. All registers and busses are 64 bit wide, allowing to address up to 16 exabyte of memory. The floating-point registers follow the IEEE 754-1985 format for single and double precision.

A feature of the Alpha architecture is the lack of a program status register. All instructions operate on registers only, allowing instruction parallelisation without the bottleneck of a single flag register. Together with 7-13 pipeline steps (version-dependent) and the ability of out-of-order execution in version EV6 and above, Alpha was one of the fastest systems available until further development was cancelled.

Implementations of Alpha stared with version EV4 (developed by Digital Equipment Corp) in 1992 running at 192 MHz and ended with Version EV7z (developed by HP) in 2002 running at up to 1300 MHz.

For details, see http://en.wikipedia.org/wiki/DEC_Alpha and The Alpha Architecture Handbook

Instruction Set
Every instruction is 32 Bits wide and exist in four flavours, described below:

PALcodes
The Priviledged Architecture Library (PAL) is an operating-system-dependent set of subroutines, callable by software or hardware.

Branches
Depending on the value of the register Ra, a branch is executed according to the displacement parameter.

Memory Access
One of the registers Ra or Rb is the base address of the memory accessed, possibly extendend by the Displacement. The other register is either loaded with the value of the memory address or stored to the memory address.

Operations
Alpha uses three-address format, so the three registers Ra, Rb and Rc are accessed. All registers used have to be either integer registers or floating-point registers.

Infineon TriCore
The Infineon TriCore architecture is very complex and has a huge feature set, which includes
 * 32-bit architecture
 * RISC with DSP instructions
 * Little-endian byte ordering
 * 4-GByte virtual or physical data, program, and input/output address spaces
 * Full-featured memory management system
 * Memory protection
 * 16-/32-bit instructions for reduced code size
 * Fast automatic context switching between two tasks
 * Multiply-accumulate unit
 * Saturating integer arithmetic
 * Bit handling
 * Byte and bit addressing
 * Packed data operations
 * Zero-overhead loop
 * Low interrupt latency
 * Flexible interrupt prioritization scheme
 * Flexible power management

Currently, there exist two versions of the TriCore architecture, we will deal with version 1.3, which is also used in the current microcontrollers like the TC1796.

All TriCore microcontrollers have large on-chip memory blocks of RAM, ROM, DRAM, OTP, FLASH of different types.

Details
The architecture is mainly a harvard architecture, although the busses are not strictly separated and have bridges for flexible data exchange (with performance penalty). As example we looked at the TC1796, the biggest and most powerful microcontroller of the TriCore family.

The CPU has a 64 bit wide bus to the program memory interface (PMI) which has 48 kb of scratch pad ram and 16 kb of instruction cache which runs with full cpu speed (up to 150 MHz). Both memories are optionally parity protected. Over the program local memory bus the PMI is connected to the program memory unit as well as to the data memory unit and the external bus unit. The program local memory bus also runs at full cpu speed and is 64 bit wide.

The program memory unit hast 64 bit wide access to the 2 MB program flash as well as to the 128 KB data flash, 8 KB boot ROM and 8 KB test ROM. All flash ROMs are ECC protected using 8 ECC bits for each block of 64 bits of data, enabling correction of one bit error and detection of two bit errors per block.

Two 64 bit wide buses connect the CPU with the data memory interface (DMI) which has 56 KB of local data RAM and 8 KB of dual ported ram which is connected to the second master channel of the onboard DMA controller. Both memories are optionally parity protected. Via the data local memory bus, the DMI can access the data memory unit which has 64 KB of SRAM and 16 KB of Stand-by RAM. Also, both are optionally parity protected.

Internal peripherals are connected to the system peripherals bus and can be accessed either by the CPU slave interface or the data local memory bus over the LFI-Bridge. Both, the interface and the bridge can be master of the bus as well as the first master channel of the onboard DMA controller. The second master channel of the DMA controller is connected to the remote peripheral bus which connects additional onboard peripherals and the dual ported ram of the DMI. The system peripheral bus and the remote peripheral bus run with the system clock, which runs at the same speed or at a fraction of the CPU speed. The maximum speed is 75 MHz.

For details, see http://www.infineon.com/cgi-bin/ifx/portal/ep/programView.do?channelId=-64423&programId=33110&programPage=%2Fep%2Fprogram%2Fdocument.jsp&pageTypeId=17099

Instruction set
The instruction set splits up into eight categories with a total of more than 150 instructions.
 * Branch
 * Arithmetic (Integer, DSP and SIMD Packed Arithmetic)
 * Load/Store
 * Comparison
 * System
 * Bit Manipulation
 * 16-Bit Subset
 * Address Arithmetic and Address Comparison

Most instructions are 32 bit wide, some are 16 bit. Most instructions have two or three operands.

CPU internals
The CPU has 16 32 bit data registers and 16 32 bit address registers as well as three status and program counter registers. All registers are also refered to be the context of the running task and can be saved to and loaded from the local data ram for context switching. There exist shadow registers to enable fast context switching.

Most instructions are executed within one CPU cycle, some within 2 or 3 cycles. Branches are executed within 1, 2 or 3 cycles, depending on the branch prediction.

Pipelines
Instructions are fetched by the instruction fetch unit which directs the instruction to the appropriate pipeline. The three pipelines are: The integer and the load/store pipeline have four stages (fetch, decode, execute and write-back), the loop pipeline has two stages (decode and write-back). All pipelines work in parallel, enabling three instructions to be executed within one clock cycle.
 * Integer pipeline
 * Load/store pipeline
 * Loop pipeline