Max Jayapaul's Cyberport Electronics Page - Article 1

Current Microprocessors Used In Computing Devices (Jan 2001)

Introduction

(I) Alpha Processors

21164 21264 21364

(II) ARM Processors (inc. StrongARM)

ARM 6 ARM 7 ARM 8 ARM 9 ARM 10

(III) MIPS Processors

R3000 R4000 R5000 R8000 R10000 R12000
TinyRISC MIPS V

(IV) PowerPC

601 603 604 620 750 POWER2
POWER3

(V) SPARC

UltraSPARC SPARC64

(VI) Intel

80286 80386 80486 Pentium P55C Pentium Pro
Pentium II Pentium III/4 Itanium

(VII) Intel clones

AMD K5 AMD Nx586 AMD K6 AMD K7 Cyrix6x86 Cyrix MediaGX
Cyrix M3 IDT-C6

(VIII) Motorola 680xx (Dragonball, ColdFire)

68000 68008 68010 68020 68030 68040
68060 ColdFire

(IX) Hitachi SuperH

SH3 SH4

(X) HP PA-RISC

8500

(XI) Transmeta Crusoe

TM3200 TM5400/5600

Conclusion

     Microprocessors form the heart of the computer.  It was the advent of these silicon based devices brought about the personal computing revolution.  Early computers were large in size, consumed lot of electricity and ran on discrete components like vacuum tubes and later transistors.  Microprocessors were made of millions of transistors on a single silicon layout that controlled the various operations of the computer.  Intel Corp. of USA introduced the world's first microprocessor, the 4004, in November 1971 that was used in making calculators.  Since then microprocessors have grown in terms of performance and speed.  Currently there are many microprocessors designed for different computers ranging from Personal Digital Assistants to supercomputers.  Given below are some of the common ones in use.

     Alpha processors are used in engineering workstations, large servers and supercomputers.  Alpha prcessors hold the highest benchmark values in the microprocessor industry.  It was designed in 1992 and is expected to have a design life of 25 years.  Designed by Digital Equipment, now owned by Compaq, the Alpha is a 64 bit RISC architecture (32 bit instructions) design that has versions that run  Tru64 UNIX, Open VMS, Linux and Windows NT.  Compaq is building the World's fastest military computer for US Dept. of Energyfor use in nuclear weapons testing.  Code named "Q", it is a 30+ TeraFLOPS machine that contains 12,000 Alpha 21264 chips.   Compaq is also building the World's fastest non-military computer for the Pittsburgh Supercomputing centre.  This is a 6 TeraFLOPS, 2,728 Alpha 21264 machine running TRU64 UNIX and will be mainly used in scientific research.  Compaq's range of servers, NonStop Himalaya, AlphaServers and AlphaStation workstation all run on the Alpha 21264. The microprocessors are manufactured by Intel, IBM and Samsung.
Alpha is a 64 bit architecture (32 bit instructions) that doesn't support 8- or 16-bit operations, but allows conversions, so no functionality is lost. The Alpha processor executes 4 instructions in one clock pulse.  Alpha 32-bit operations differ from 64 bit only in overflow detection. The chip provides both IEEE and VAX 32 and 64 bit floating point operations, and features Privileged Architecture Library (PAL) calls, a set of programmable (non-interruptable) macros written in the Alpha instruction set to simplify conversion from other instruction sets using a binary translator, as well as providing flexible support for a variety of operating systems.   The Alpha concentrates on the original RISC idea of simplicity and a higher clock rate - though that also has its drawback, in terms of very high power consumption.   The first model, 21064 was introduced with one integer, one floating point, and one load/store unit.  Current versions are as follows:
21164:  This microprocessor was released in early 1995.  The 21164 began expanding instruction parallelism by adding one integer/load/store unit with byte vector (multimedia-type) instructions (replacing the load/store unit) and one floating point unit, and increased clock speed from 200 MHz to 300 MHz and introduced the idea of a level 2 cache on chip (8K each inst/data level 1, 96K combined level 2).  Various versions run on Windows NT and the 64bit Digital UNIX with speeds ranging from 300Mhz to 600 Mhz.   It has a SPECfp95 of 27.0 and executes 2.4 billion instructions per second.
21264:  The 21264, released in mid 1998 expanded to four integer units (two add/logic/shift/branch (one also with multiply, one with multimedia) and two add/logic/load/store), two different floating point units (one for add/div/square root and one for multiply), with the ability to load four, dispatch six, and retire eight instructions per cycle (and for the first time including 40 integer and 40 floating point rename registers and out of order execution), at up to 660MHz.   Multimedia extensions introduced with the 21264 are simple, but include VIS-Type motion estimation (MPEG).  This version also includes support for Linux operating system.  It has a SPECfp95 benchmark of 50.
21364 (EV 7):  The 21364, expected 2000 or 2001, added five high speed interconnects (four CPU (10 GB/s) and one I/O (3 GB/s)) to an enhanced 21264 core.

 

ARM processors are mainly used in handheld computers and Personal Digital Assistants.  ARM (Acorn RISC Machine) is designed by VLSI Technologies, UK, now called ARM Research.  Originally designed for the Archimedes home computer in 1986, the original ARM (ARM1, 2 and 3) was a 32 bit CPU, but used 26 bit addressing. The newer ARM6 is completely 32 bits. ARM Research licenses the core to various manufacturers like DEC (now Intel), Compaq, HP etc.  DEC and ARM Research collaborated to make the StrongARM series which was bought by Intel.   ARM processors run the Apple Newton (ARM 610) running NewtOS, Compaq iPAQ pocket PC (StrongARM), HP Journada 840 handheld PC (StrongARM) and Psion handhelds (ARM 710T).
The ARM6 has user, supervisor, and various interrupt modes (including 26 bit modes for ARM2 compatibility). The ARM architecture has sixteen registers  with a multiple load/save instruction, though many registers are shadowed in interrupt modes (2 in supervisor and IRQ, 7 in FIRQ) so need not be saved, for fast response.  A feature introduced by the ARM is that every instruction is predicated, using a 4 bit condition code. Another bit indicates whether the instruction should set condition codes, so intervening instructions don't change them. This easily eliminates many branches and can speed execution. Another unique and useful feature is a barrel shifter which operates on the second operand of most ALU operations, allowing shifts to be combined with most operations (and index registers for addressing), effectively combining two or more instructions into one.  These features make ARM code both dense (unlike most RISC processors) and efficient, despite the relatively low clock rate and short pipeline.   ARM has developed a low cost 16-bit version called Thumb, which recodes a subset of ARM CPU instructions into 16 bits (decoded to native 32-bit ARM instructions without penalty). Thumb programs can be 30-40% smaller than already dense ARM programs. Native ARM code can be mixed with Thumb code when the full instruction set is needed
.
ARM6 (StongARM SA-110, 1100, 1110): The ARM series consists of the ARM6 CPU core (35,000 transistors, which can be used as the basis for a custom CPU) the ARM60 base CPU, and the ARM600 which also includes 4K 64-way set-associative cache, MMU, write buffer, and coprocessor interface (for FPU).  The ARM CPU was chosen for the Apple Newton handheld system because of its speed, combined with the low power consumption, low cost and customizable design (the ARM610 version used by Apple includes a custom MMU supporting object oriented protection and access to memory for the Newton's NewtOS). Intel has developed the StrongARM SA-110 in February 1996, running a 5-stage pipeline at 100 to 233MHz (using only 1 watt of power), with 5-port register file, faster multiplier, single cycle shift-add, and Harvard Architecture (16K each 32-way I/D caches).  Later versions include SA-1100 that runs at 133/190 Mhz and provides system support logic, multiple serial communication channels, LCD controllers and I/O ports and the SA-1110 that runs at 206Mhz
ARM7: The ARM7 series, designed in December 1994, increases performance by optimising the multiplier, and adding DSP-like extensions including 32 bit and 64 bit multiply and multiply/accumulate instructions (operand data paths lead from registers through the multiplier, then the shifter (one operand), and then to the integer ALU for up to three independent operations). It also doubles cache size to 8K, includes embedded In Circuit Emulator (ICE) support, and raises the clock rate significantly.
ARM8: To fill the gap between ARM7 and DEC StrongARM, ARM also developed the ARM8/800 which includes many StrongARM features.
ARM9: ARM9 is an improvement over the previous design with Harvard busses, write buffers, and flexible memory protection mapping.
ARM10: A vector floating-point unit is being added to the ARM10.

 

MIPS Processors produced the first commercial RISC processor with the R2000 in 1986.  Currently microprocessors from this stable power devices from supercomputers to many hand held devices (Casseiopia Windows CE devices) as well as gaming consoles (Playstation) and TV Set Top boxes. MIPS is owned by SGI.  MIPS stands for Microprocessor without Interlocked Pipeline Stages and came out of a Stanford Univ. project.  MIPS processors power handhelds like Casio Cassiopeia (VR 4122) running Windows CE, Symbol (VR 4181), Agenda portable PC (VR 4181) running Linux OS and also powers workstations like SGI O2 (QED RM5200), SGI Octane (MIPS R12000) and servers like SGI Origin 2000 (MIPS R12000) and Fujitsu-Siemens RM300 (MIPS R12000).
MIPS was intended to simplify processor design by eliminating hardware interlocks between the five pipeline stages. This means that only single execution cycle instructions can access the thirty two 32 bit general registers, so that the compiler can schedule them to avoid conflicts. This also means that LOAD/STORE and branch instructions have a 1 cycle delay to account for. However, because of the importance of multiply and divide instructions, a special HI/LO pair of multiply/divide registers exist which do have hardware interlocks, since these take several cycles to execute and produce scheduling difficulties.  The R2000 has no condition code register considering it a potential bottleneck. The PC is user readable. The CPU includes an MMU unit that can also control a cache, and the CPU was one of the first which could operate as a big or little endian processor. An FPU, the R2010, is also specified for the processor.
R3000: released in 1988 has improved cache control.
R4000: released in 1991, expanded to 64 bits and is superpipelined (twice as many pipeline stages do less work at each stage, allowing a higher clock rate and twice as many instructions in the pipeline at once, at the expense of increased latency when the pipeline can't be filled, such as during a branch, (and requiring interlocks added between stages for compatibility, making the original "I" in the "MIPS" acronym meaningless)). The R4400 and above integrated the FPU with on-chip caches. The R4600 and later versions abandoned
superpipelines.
R8000: released in 1994, was superscalar and was optimised for floating point operation, issuing two integer or load/store operations (from four integer and two load/store units) and two floating point operations simultaneously (FP instructions sent to the independent R8010 floating point coprocessor (with its own set of thirty-two 64-bit registers and load/store queues).
R10000: released early 1996, added multiple FPU units, as well as almost every advanced modern CPU feature, including separate 2-way I/D caches (32K each) plus on-chip secondary controller (and high speed 8-way split transaction bus (up to 8 transactions can be issued before the first completes)), superscalar execution (load four, dispatch five instructions to any of two integer, two floating point, and one load/store units), dynamic register renaming (integer and floating point rename registers (thirty two in the R10K), and an instruction cache where instructions are partially decoded when loaded into the cache, simplifying the processor decode stage.  Branch prediction and target caches are also included.
R12000/R14000: released in May 1997, was similar to the R10000 but had forty eight rename registers.  The R10000, R12000 and the R14000 are the last of the high performance microprocessors from the MIPS stable.
R5000: released in January, 1996 is a 2-way (int/float) superscalar architecture and was added to fill the gap between R4600 and R10000, without the out of order or branch prediction buffers.
TinyRISC: released in October 1996 is used in embedded applications. MIPS and LSI Logic added a compact 16 bit instruction set which can be mixed with the 32 bit set.
MIPS V/MDMX: released in October 1996, is similar to the TinyRISC but MIPS V adds parallel floating point (two 32 bit fields in 64 bit registers) operations, MDMX adds integer 8 or 16 bit subwords in 64 bit FPU registers and a 24 and 48 bit subwords in a 192 bit accumulator for multimedia instructions. Vector-scalar operations (ex: multiply all subwords in a register by subword 3 from another register) are also supported. These extensive instructions are partly derived from Cray vector instructions and are much more extensive than the earlier multimedia extensions of other CPUs. Future versions are expected to add Java Virtual Machine support.

 

PowerPC microprocessors: IBM, Motorola, and Apple formed a coalition (around 1992) to produce a microprocessor version of the POWER design as a successor to both the Motorola 68000 and Intel 80x86, resulting in the PowerPC.  PowerPC derivatives power IBM RS/6000 (Power3 - II) workstations and servers and Apple's iMac (PowerPC G3), PowerMac (PowerPC G4) desktops and Macintosh Server G4 (PowerPC G4) running OS X.
PowerPC 601: released in 1993, was the first of the series (considered first generation or G1), and included both POWER and PowerPC features, based strongly on the POWER1, except it had a single 32K cache rather than separate I/D caches.
PowerPC 603: released late 1993, (first second generation G2) separated the main functional units further, removing load/store operations from the integer unit (four functional units total - integer, floating point, load/store (using integer registers), branch), and splitting the branch unit into a fetch/branch unit, a dispatch unit, and a completion/exception unit. The 603 also added a rename buffer in the dispatch unit for speculative execution using renamed integer and floating point registers, which are ordered properly by the completion/exception unit, or discarded for mispredicted branches and exceptions. Separate 8K and 16K I/D cache versions were available. PowerPC 604: released in mid 1995, added dynamic branch prediction using a branch history table, and added two simplified integer units - three integer, two for single-cycle operations, one for multicycle operations such as multiply/divide, plus floating point, load/store and branch, total of six. Four instructions could be dispatched at once The CC register could also be renamed.
PowerPC 620: expanded the 604 design to 64 bits (but with a 'backside' L2 cache bus), and added new 64 bit instructions, but was delivered much later and slower than promised, and was further delayed when it was with drawn for a redesign.
PowerPC 750: released in early 1998 (G3), is 32 bit, refined in design and performance, adding a P620-style backside cache bus, but made no other significant changes (notably though, they used a 603-based 32-bit FPU, rather than the 64-bit 604 FPU).
POWER2: released in 1993,  is a workstation version with a high bandwidth design with two floating point load/store units, 256K of data cache, and added 128-bit floating point support and a square root instruction. Initially a multichip design, it was later combined into one chip (P2SC), and then into an eight CPU "SuperChip". It could issue up to six instructions and four simultaneous loads or stores.
POWER3 (PowerPC A35, PowerPC RS64): released in early 1998, had eight functional units (two FPU, three integer (two single cycle, one multicycle), two load/store, and branch unit), but capable of operating at much higher clock speeds. In addition, a 64 bit designed as the CPU for the AS/400 E series, the PowerPC A35 (Apache) with added decimal arithmatic and string instructions, was also used in the RS/6000 S70 workstation (called the PowerPC RS64).
PowerPC G4 has, in direct response to Intel's MMX instructions, AltiVec extensions for CPUs from Motorola (but not IBM).

 

SPARC or the Scalable (originally Sun) Processor ARChitecture was designed by Sun Microsystems for their own use. Sun was a maker of workstations, and used standard 68000-based CPUs and a standard operating system, Unix. Research versions of load-store processors had promised a major step forward in speed, but existing manufacturers were slow to introduce a RISC processor, so Sun went ahead and developed its own. In keeping with their open philosophy, they licensed it to other companies, rather than manufacture it themselves.  Sun uses the UltraSPARC III in its Sun Blade and Sun Enterprise Server range and UltraSPARC II in its Ultra 5 workstation.  Fujitsu-Siemens run their PrimePower line of servers on HAL/SPARC64 processors running Solaris.
The SPARC design was radical at the time, even omitting multiple cycle multiply and divide instructions (added in later versions), using single-cycle "step" instructions instead, while most RISC CPUs were more conventional.  SPARC usually contains about 128 or 144 registers, (memory-data designs typically had 16 or less). At each time 32 registers are available - 8 are global, the rest are allocated in a 'window' from a stack of registers. The window is moved 16 registers down the stack during a function call, so that the upper and lower 8 registers are shared between functions, to pass and return values, and 8 are local. The window is moved up on return, so registers are loaded or saved only at the top or bottom of the register stack. This allows functions to be called in as little as 1 cycle. Like most RISC processors, global register zero is wired to zero to simplify instructions, and SPARC is pipelined for performance (a new instruction can start execution before a previous one has finished), but not as deeply as others - it has branch delay slots. Also like previous processors, a dedicated CCR holds comparison results.  SPARC is 'scalable' mainly because the register stack can be expanded (up to 512, or 32 windows), to reduce loads and saves between functions, or scaled down to reduce interrupt or context switch time, when the entire register set has to be saved. Function calls are usually much more frequent than interrupts, so the large register set is usually a plus, but compilers now can usually produce code which uses a fixed register set as efficiently as a windowed register set across function calls.  SPARC is not a chip, but a specification, and so there are various designs of it. It has undergone revisions, and now has multiply and divide instructions. Original versions were 32 bits, but 64 bit and superscalar versions were designed and implemented (beginning with the Texas Instruments SuperSparc in late 1992), but performance lagged behind other load-store and even Intel 80x86 processors until the UltraSPARC (late 1995) from Texas Instruments and Sun, and superscalar HAL/Fuji SPARC64 multichip CPU. 
UltraSPARC: is a 64-bit superscalar processor series which can issue up to four instructions at once to any of nine units: two integer units, two of the five floating point/graphics units, the branch and load/store unit. The UltraSparc also added a block move instruction which bypasses the caches (2-way 16K instr, 16K direct mapped data), to avoid disrupting it, and specialized pixel operations (VIS - the Visual Instruction Set) which can operate in parallel on 8, 16, or 32-bit integer values packed in a 64-bit floating point register (for example, four 8 X 16 -> 16 bit multiplications in a 64 bit word, a sort of simple SIMD/vector operation. More extensive than the Intel MMX instructions,  VIS also includes some 3D to 2D conversion, edge processing and pixes distance (for MPEG, pattern-matching support).
HAL/Fuji SPARC64: can issue up to four in order instructions simultaneously to four buffers, then to four integer, two floating point, two load/store, and the branch unit, and may complete out of order (an instruction completes when it finishes without error, is committed when all instructions ahead of it have completed, and is retired when its resources are freed - these are 'invisible' stages in the SPARC64 pipeline). A combination of register renaming, a branch history table, and processor state storage allow for speculative execution while maintaining precise exceptions/interrupts (renamed integer, floating, and CC registers - trap levels are also renamed and can be entered speculatively).

 

Intel processors were the first commercial microprocessors made for use in personal computers, the 8086, which was based on the design of the 8080/8085 (source compatible with the 8080) with a similar register set, but was expanded to 16 bits.  Intel processors power most of the desktops in the world inc. those of IBM (NetVista, IntelliStation), Compaq (Presario, iPAQ), Dell (Dimension, OptiPlex) and HP (Brio).  Intel processors also power servers that mainly run Windows NT like SGI 1000, Compaq's Proliant and Dell's PowerEdge servers.   The computer running the International Space Station runs on a variation of the Intel 386SX processor!
The Bus Interface Unit fed the instruction stream to the Execution Unit through a 6 byte prefetch queue, so fetch and execution were concurrent - a primitive form of pipelining (8086 instructions varied from 1 to 4 bytes).  It featured four 16 bit general registers, which could also be accessed as eight 8 bit registers, and four 16 bit index registers (including the stack pointer). The data registers were often used implicitly by instructions, complicating register allocation for temporary values. It featured 64K 8-bit I/O (or 32K 16-bit) ports and fixed vectored interrupts. There were also four segment registers that could be set from index registers.  Although this was largely acceptable for assembly language, where control of the segments was complete (it could even be useful then), in higher level languages it caused constant confusion (ex. near/far pointers). Even worse, this made expanding the address space to more than 1 meg difficult.
80286: released in 1982, expanded the design to 32 bits only by adding a new mode (switching from 'Real' to 'Protected' mode was supported, but switching back required using a bug in the original 80286, which then had to be preserved) which greatly increased the number of segments by using a 16 bit selector for a 'segment descriptor', which contained the location within a 24 bit address space, size (still less than 64K), and attributes (for Virtual Memory support) of a segment.
80386: released in 1985, broke the 64K segment memory access restriction and included much improved addressing: base reg + index reg * scale (1, 2, 4 or 8 bits) + displacement (8 or 32 bit constant = 32 bit address) in the form of paged segments (using six 16-bit segment registers). It also had several processor modes (including separate paged and segmented modes) for compatibility with the previous awkward design. In fact, with the right assembler, code written for the 8008 can still be run on the most recent Pentium Pro. The 80386 also added an MMU, security modes (called "rings" of privledge - kernal, system services, application services, applications) and new op codes.
80486: released in 1989, added full pipelines, single on chip 8K cache, integrated FPU (based on the eight element 80-bit stack-oriented FPU in the 80387 FPU), and clock doubling versions.
Pentium: released in late 1993, was superscalar (up to two instructions at once in dual integer units and single FPU) with separate 8K I/D caches.  The Pentium was the name Intel gave the 80586 version because it could not legally protect the name "586" to prevent other companies from using it.  MMX (initially reported as MultiMedia eXtension, but later said by Intel to mean Matrix Math eXtension) perform integer operations on vectors of 8, 16, or 32 bit words, using the 80 bit FPU stack elements as eight 64 bit registers.
P55C Pentium: version, released January 1997, is the first Intel CPU to include MMX instructions, followed by the AMD K6, and Pentium II.   (the Pentium P55C (early 1997) version is a pure CMOS design).
Pentium Pro: (Pentium's successor, late 1995) does not clone the Pentium, but emulate it with specialized hardware decoders which convert Pentium instructions to RISC-like instructions which are executed on specially designed superscalar RISC-style cores faster than the Pentium itself. Intel also used BiCMOS in the Pentium and Pentium Pro to achieve clock rates competitive with CMOS load-store processors.  Pentium Pro (code named "P6") is a 1 or 2-chip (CPU plus 256K or 512K L2 cache - I/D L1 cache (8K each) is on the CPU), 14-stage superpipelined processor. It uses extensive multiple branch prediction and speculative executing via register renaming. Three decoders (one for complex instructions, two for simpler ones (four or fewer micro-ops)) each decode one 80x86 instruction into micro-ops (one per simple decoder + up to four from the complex decoder = three to six per cycle). Up to five (usually three) micro-ops can be issued in parallel and out of order (six units - FPU, 2 integer, 2 address, 1 load/store), but are held and retired (results written to registers or memory) as a group to prevent an inconsistant state (equivalent to half an instruction being executed when an interrupt occurs, for example). 80x86 instructions may produce several micro-ops in CPUs like this   so the actual instruction rate is lower. In fact, due to problems handling instruction alignment in the Pentium Pro, emulated 16-bit instructions execute slower than on a Pentium.
Pentium II: released April 1997, version of the Pentium Pro added MMX instructions, doubled cache to 32K, and was packaged in a processor card instead of an IC package.
Pentium III/Pentium 4: Faster versions of the Pentium line ranging from 600 Mhz to 1 Ghz. Celeron versions are similar to Pentium but have lesser L2 cache (about 128KB compared to 256KB on normal pentium) and Xeon versions have 512KB L2 cache.
Itanium: (code named Merced or IA-64) Intel, with partner Hewlett-Packard, has begun development of a next generation 64-bit processor. It is expected to be a variable length instruction group (or what HP/Intel call EPIC (Explicit Parallel Instruction Computing)) with instruction dependencies grouped from 1 to 9+. The processor is expected to read instructions in 128 bit bundles of three plus three "template bits" which indicate dependancies. Most instructions are predicated with 128 general 64-bit and 128 floating point registers, and 64 predicate bits (a type of condition code). To reduce page faults, speculative load instructions execute a load, but does not trap if there is an exception until a second instruction completes it - if the second instruction is predicated and never executes, then a page fault is avoided, and loads can be rescheduled more flexibly. It's expectged to be compatible in some way with both the PA-RISC and 80x86 - it will include 80x86 data and segment registers, with additional instructions to switch between instruction/register sets and transfer data between 80x86 and IA-64 registers. It is expected to translate 80x86 instructions into VLIW instructions (or directly to decoded instructions) the same way that Pentium Pro and AMD K5/K6 CPUs do, but with a larger number of instructions issued using the VLIW design, it should be faster. However, native IA-64 code should be even faster, and this may finally produce the incentive to let the 80x86 architecture finally fade away.

 

Intel clone processors: Due to the popularity of the Intel processors, other companies started to clone the design to make cheaper and faster versions. Earliest clones of the Intel processors were the NEC V20/V30 (slightly faster clones of the 8088/8086 (could also run 8085 code)).  Intel clones run desktops made by companies like HP, Compaq, IBM, Gateway and Everex.
AMD K5: translates 80x86 code to ROPs (RISC OPerations), which execute on a RISC-style core. Up to four ROPs can be dispatched to six units (two integer, one FPU, two load/store, one branch unit), and five can be retired at a time. The complexity led to low clock speeds for the K5, prompting AMD to buy NexGen and integrate its designs for the next generation K6.
NexGen/AMD Nx586: released early 1995, is unique by being able to execute its micro-ops (called RISC86 code) directly, allowing optimised RISC86 programs to be written which are faster than an equivalent x86 program would be, but this feature is seldom used. It also features two 16K I/D L1 caches, a dedicated L2 cache bus (like that in the Pentium Pro 2-chip module) and an off-chip FPU (either separate chip, or later as in 2-chip module).
AMD K6: released April 1997, actually has three caches - 32K each for data and instructions, and a half-size 16K cache containing instruction decode information. It also brings the FPU on-chip and eliminates the dedicated cache bus of the Nx586, allowing it to be pin-compatible with the P54C model Pentium. Another decoder is added (two complex decoders, compared to the Pentium Pro's one complex and two simple decoders) producing up to four micro-ops and issuing up to six (to seven units - load, store, complex/simple integer, FPU, branch, multimedia) and retiring four per cycle. It includes MMX instructions, licensed from Intel, and AMD has designed and added 3DNow! graphics extensions without waiting for Intel's 3D MMX extensions.
AMD K7: released expected 1999, is based on superscalar design , decoding x86 instructions into 'MacroOps' (made up of one or two 'micro-ops') in two decoders (one for simple and one for complex instructions) producing up to three MacroOps per cycle. Up to nine decoded operations per cycle can be issued in six MacroOps to six functional units (three integer, each able to execute one simple integer and one address op simultaneously, and three FPU/MMX/3DNow! instructions with extensive stack and register renaming, and a separate integer multiply unit which follows integer ALU 0, and can forward results to either ALU 0 or 1). The K7 it replaces the Intel-compatible bus of the K6 with the high speed Alpha EV6 bus because Intel decided to prevent competitors from using its own higher speed bus designs. This makes it easier to use either Alpha or AMD K7 processors in a single design.
Cyrix 6x86: released early 1996, initially manufactured by IBM before Cyrix merged with National Semiconductor, still directly executes 80x86 instructions (in two integer and one FPU pipeline), but partly out of order, making it faster than a Pentium at the same clock speed.
Cyrix MediaGX: is an integrated version with graphics and audio on-chip called the . MMX instructions were added to the 6x86MX, and 3DNow! graphics instructions to the 6x86MXi.
Cyrix M3: released mid 1998, turned to superpipelining (eleven stages compared to six (seven?) for the M2) for a higher clock rate (partly for marketing purposes, as MHz is often preferred to performance in the PC market), and provides dual floating point/MMX/3DNow! units.
IDT-C6 WinChip: released May 1997, manufactured by Centaur, a subsidiary of Integrated Device Technology, which uses a much simpler (6-stage, 2 way integer/simple-FPU execution) desgn than Intel and AMD translation-based designs by using micro-ops more closely resembling 80x86 than RISC code, which allows for a higher clock rate and larger L1 (32K each I/D) and TLB caches in a lower cost, lower power consumption design. Simplifications include replacing branch prediction (less important with a short pipeline) with an eight entry call/return stack, depending more on caches. The FPU unit includes MMX support (the C6+ version added a second FPU/MMX unit and 3D graphics enhancements).  Like Cyrix, IDT opted for a superpipelined eleven-stage design for added performance, combined with sophisticated early branch prediction in its WinChip 4. The design also pays attention to supporting common code sequences - for example, loads occur earlier in the pipeline than stores, allowing load-alu-store sequences to be more efficient.

 

Motorola 680xx (Dragonball, ColdFire): was commonly used to power the early home and personal computers like Apple IIe, Amiga and Atari ST to compete with Intel powered IBM PCs. Currently Motorola 68040 and 68060 power the Amiga 4000T video workstation.  A dragonball version, 68328, powers the Palm Pilot computers running the Palm OS.
68000: was initially an 8 Mhz, 32 bit architecture internally, but had only a 16 bit data bus and 24 bit address bus to fit in a 64 pin package. Lack of forced segments made programming the 68000 easier than some competing processors, without the 64K size limit on directly accessed arrays or data structures.  It was designed for expansion, including specifications for floating point and string operations. The 68000 could fetch the next instruction during execution (a 2 stage pipeline).
68008: reduced the data bus to 8 bits and address to 20 bits.
68010: released in 1982, added virtual memory support (the 68000 couldn't restart interrupted instructions) and a special loop mode - small decrement-and-branch loops could be executed from the instruction fetch buffer.
68020: released in 1984, was fully 32 bit externally. Addresses were computed as 32 bits (without using segment registers) - unused upper bits in the 68000 or 68008 bits were ignored, but some programmers stored type tags in the upper 8 bits, causing compatibility problems with the 68020's 32 bit addresses. It expanded external data and address bus to 32 bits,simple 3-stage pipeline, and added a 256 byte cache.
68030: released in 1987, brought the MMU onto the chip.
68040: released in 1991, added floating point with eight 80 bit floating point registers (compatible with the 68881/2 coprocessors). It also added fully cached Harvard busses (4K each for data and instructions), 6 stage pipeline, and on chip FPU.   
68060: released in April 1994, expanded the design to a superscalar version, the the third stage of the 10-stage 68060 pipeline translates the 680x0 instructions to a decoded RISC-like form (stored in a 16 entry buffer in stage four). There is also a branch cache, and branches are folded into the decoded instruction stream, then dispatched to two pipelines (three stages: Decode, addr gen, operand fetch) and finally to two of three execution units - 2 integer, 1 floating point) before reaching two 'writeback' stages. Cache sizes are doubled over the 68040. The 68060 also also includes many innovative power-saving features (3.3V operation, execution unit pipelines could actually be shut down, reducing power consumption at the expense of slower execution, and the clock could be reduced to zero) so power use is lower than the 68040 (4-6 watts vs. 3.9-4.9). Another innovation is that simple register-register instructions which don't generate addresses may use the the address stage ALU to execute 2 cycles early.
Coldfire: released early 1995, in which complex instructions and addressing modes (added to the 68020) were removed and the instruction set was recoded, simplifying it at the expense of compatibility (source only, not binary) with the 680x0 line. The embedded market became the main market for the 680x0 series after workstation venders (and the Apple Macintosh) turned to faster load-store processors, so a variety of embedded versions were introduced.

Hitachi SuperH:   The Hitachi SH series was meant to replace the 8-bit and 16-bit H8 microcontrollers, a series of memory-data CPUs with sixteen 16-bit registers. The SH is also designed for the embedded marked, and is similar to the ARM architecture in many ways. The SH is used in many of Hitachi's own products, as well as being a pioneer of wide popularity for a Japanese CPU outside of Japan. It's most prominently featured in the Sega Saturn video game system (which uses two SH2 CPUs) and many Windows CE palmtop computers (SH3 chip set) inc. some HP Journada versions.
SH is a 32 bit processor, but with a 16 bit instruction format, has sixteen general purpose registers and a load/store architecture. This results in a very high code density.   Because of the small instruction size, there are no immediate load instruction, but a PC-relative addressing mode is supported to load 32 bit values. The SH also has a Multiply ACcumulate (MAC) instruction, and MACH/L (high/low word) result registers - 42 bit results (32 low, 10 high) in the SH1, 64 bit results (both 32 bit) in the SH2 and later.
SH3: includes an MMU and 2K to 8K of unified cache.
SH4: released in mid-1998 is a superscalar version with extensions for 3-D graphics support. It can issue two instructions at a time to any of four units: integer, floating point, load/store, branch (except for certain non-superscalar instructions, such as modifying control registers). Certain instructions, such as register-register move, can be executed by either the integer or load/store unit, two can be issued at the same time. Each unit has a separate pipeline, five stages for integer and load/store, five or six for floating point, and three for branch.
Other enhancements such as support for MPEG operations are planned for the SH5.

 

PA-RISC: The PA-RISC (Precision Architecture, originally code-named Spectrum) was designed to replace older processors in HP-3000 MPE minicomputers, and Motorola 680x0 processors in the HP-9000 HP/UX Unix minicomputers and workstations. It has an unusually large instruction set for a RISC processor (including a conditional skip instruction, partly because initial design took place before RISC philosophy was popular, and partly because careful analysis showed that performance benefited from the instructions chosen - in fact, version 1.1 added new multiple operation instructions combined from frequent instruction sequences, and HP was among the first to add multimedia instructions (the MAX-1 and MAX-2 instructions). Much of the RISC philosophy was independently invented at HP from lessons learned from FOCUS (pre 1984), HP's (and the world's) first fully 32 bit microprocessor. It has a 5 stage pipeline, which had hardware interlocks from the beginning for instructions which take more than one cycle, as well as result forwarding (a result can be used by a previous instruction without waiting for it to be stored in a register first). It is a load/store architecture, originally with a single instruction/data bus, later expanded to a Hardware architecture (separate instruction and data buses). It has thirty-two 32-bit integer registers (GR0 wired to constant 0, GR31 used as a link register for procedure calls), with seven 'shadow registers' which preserve the contents of a subset of the GR set during fast interrupts, and thirty-two 64-bit floating point registers (also as sixty-four 32-bit and sixteen 128-bit), in an FPU (which could execute a floating point instruction simultaneously, from the Apollo-designed Prism architecture (1988?) after Hewlett-Packard acquired the company). Later versions (the PA-RISC 7200 in 1994) added a second integer unit (still dispatching only two instructions at a time to any of the three units). Addressing originally was 48 bits, and expanded to 64 bits, using a segmented addressing scheme.  The PA-RISC 8000 (April 1996), intended to compete with the R10000, SPARC, and others) expands the registers and architecture to 64 bits (eliminating the need for segments), and adds aggressive superscalar design - up to 5 instructions out of order, using fifty six rename registers, to ten units (five pairs of: ALU, shift/merge, FPU mult/add, divide/sqrt, load/store). The CPU is split in two, with load/store (high latency) instructions dispatched from a separate queue from operations (except for branch or read/modify/write instructions, which are copied to both queues). It also has a deep pipeline and speculative execution of branches.
PA-RISC 8500: released mid 1998, breaks with HP tradition and adds on-chip cache - 1.5Mb L1 cache.

 

Transmeta's premier product is the Crusoe processor, a revolutionary x86-compatible family of solutions specially designed for the new world of Mobile Internet Computing.Remarkably low power consumption, allowing the processor to run cooler than conventional chips. Battery life is extended up to a whole day. Transmeta has pioneered a revolutionary new approach to microprocessor design. Rather than implementing the entire x86 processor in hardware, the Crusoe processor solution consists of a compact hardware engine surrounded by a software layer.
The hardware component is a very simple, high-performance, low-power VLIW (Very Long Instruction Word) engine with an instruction set that bears no resemblance to that of x86 processors. Instead, it is the surrounding software layer that gives programs the impression that they are running on x86 hardware. This innovative software layer is called the Code Morphing software because it dynamically "morphs" (that is, translates) x86 instructions into the hardware engine's native instruction set. Transmeta's software translates blocks of x86 instructions once, saving the resulting translation in a translation cache. The next time the (now translated) code is executed, the system skips the translation step and directly executes the existing optimized translation at full speed.  This unique approach to executing x86 code eliminates millions of transistors, replacing them with software. The current implementation of the Crusoe processor uses roughly one-quarter of the logic transistors required for an all-hardware design of similar performance. This offers the following benefits:
The hardware component is considerably smaller, faster, and more power efficient than conventional chips.
The hardware is fully decoupled from the x86 instruction set architecture, enabling Transmeta's engineers to take advantage of the latest and best in hardware design trends without affecting legacy software.
The Code Morphing software can evolve separately from hardware. This means that upgrades to the software portion of the microprocessor can be rolled out independently of hardware chip revisions.
Transmeta's Code Morphing technology is obviously not limited to x86 implementations. As such, it has the potential to revolutionize the way microprocessors are designed in the future
Crusoe processors are versatile enough to power a broad spectrum of ultra-light mobile PCs and Internet devices. Currently, designers of Mobile Internet Computers can choose from three processor models: TM3200 (333-400MHz), TM5400 (500-700MHz) and TM5600 (500-700MHz)
TM3200: The TM3200 is the ideal engine for a new class of mobile Internet devices weighing just a pound or two. With up to 400 MHz in performance, the TM3200 is designed to allow a full day of web browsing on a single battery charge.  The TM3200 delivers the full performance needed to run a wide range of Internet applications - from web browsers and email applications to heavy-duty streaming video clips. The TM3200 with the new Mobile Linux operating system implements many of the same power management features found in today's laptop computers. These include a deep-sleep idle mode that operates at levels as low as 20 mW. The TM3200 is compatible with the complete range of x86-based operating systems, including those offered by Microsoft and Linux suppliers.
TM5400/TM5600: The TM5400/TM5600 is the first solution designed to solve the problems of poor battery life and sub-par performance in the emerging class of ultra-light (weighing less than four pounds) mobile PCs. It performs at speeds up to 700 MHz and provides deep-sleep power levels as low as 60 mW. TM5400/5600-based laptops can last up to eight hours on battery running everyday office applications and three to four hours running heavy-duty multimedia applications like DVD movies. The TM5400/5600 offers application performance comparable to many desktop processors but with significantly lower power consumption.  The model TM5400/5600 offers LongRun technology, a new feature that allows the processor to adjust both its frequency and voltage to exactly the levels required by an application. This approach achieves unprecedented power savings. The TM5400/5600 typically operates at less than 1 watt while running ordinary office applications and as little as 60 mW when idle between keystrokes. Heavy-duty applications such as DVD movies consume on average less than 2 watts. The TM5400/5600 is compatible with the complete range of x86-based operating systems. This includes all versions of Linux, as well as Microsoft's popular Windows 98, Windows NT, and Windows 2000 operating systems.

Conclusion: There are many micro-processors that are used to power computing devices.  Intel's Pentium range is the most common in the 32-bit range.  Other 64-bit chips like that from Sun and MIPS are mainly used to power high powered servers.  Low power chips from ARM, MIPS and Hitachi power the handheld PCs.  Intel is working on the new 64-bit Itanium (Merced or IA-64) to replace the existing 32-bit Pentium, along with HP now Agilent, which wants it to replace its PA-RISC.  Even SGI has shelved further development in the MIPS range in favor of the IA-64 architecture.  Therefore IA-64 is the upcoming chip to watch.

control1.jpg (3868 bytes)

Last Updated on 1 Janaury 2001

1