Introduction
|
(I) Alpha Processors
|
|
21164 |
21264 |
21364 |
|
|
|
(II) ARM Processors
(inc. StrongARM)
|
|
ARM 6 |
ARM 7 |
ARM 8 |
ARM 9 |
ARM 10 |
|
(III) MIPS
Processors
|
|
R3000 |
R4000 |
R5000 |
R8000 |
R10000 |
R12000 |
|
TinyRISC |
MIPS V |
|
|
|
|
(IV) PowerPC
|
|
601 |
603 |
604 |
620 |
750 |
POWER2 |
|
POWER3 |
|
|
|
|
|
(V) SPARC
|
|
UltraSPARC |
SPARC64 |
|
|
|
|
(VI) Intel
|
|
80286 |
80386 |
80486 |
Pentium |
P55C |
Pentium Pro |
|
Pentium II |
Pentium
III/4 |
Itanium |
|
|
|
(VII) Intel clones
|
|
AMD K5 |
AMD Nx586 |
AMD K6 |
AMD K7 |
Cyrix6x86 |
Cyrix
MediaGX |
|
Cyrix M3 |
IDT-C6 |
|
|
|
|
(VIII) Motorola 680xx
(Dragonball, ColdFire)
|
|
68000 |
68008 |
68010 |
68020 |
68030 |
68040 |
|
68060 |
ColdFire |
|
|
|
|
(IX) Hitachi SuperH
|
|
SH3 |
SH4 |
|
|
|
|
(X) HP PA-RISC
|
|
8500 |
|
|
|
|
|
(XI) Transmeta Crusoe
|
|
TM3200 |
TM5400/5600 |
|
|
|
|
Conclusion
|
|
Microprocessors form the heart of the computer. It
was the advent of these silicon based devices brought about the personal computing
revolution. Early computers were large in size, consumed lot of electricity and ran
on discrete components like vacuum tubes and later transistors. Microprocessors were
made of millions of transistors on a single silicon layout that controlled the various
operations of the computer. Intel Corp. of USA introduced the world's first
microprocessor, the 4004, in November 1971 that was used in making calculators.
Since then microprocessors have grown in terms of performance and speed. Currently
there are many microprocessors designed for different computers ranging from Personal
Digital Assistants to supercomputers. Given below are some of the common ones in
use.
|
Alpha processors are used in engineering workstations, large servers and
supercomputers. Alpha prcessors hold the highest benchmark values in the
microprocessor industry. It was designed in 1992 and is expected to have a design
life of 25 years. Designed by Digital Equipment, now owned by Compaq, the Alpha is a
64 bit RISC architecture (32 bit instructions) design that has versions that run
Tru64 UNIX, Open VMS, Linux and Windows NT. Compaq is building the World's fastest
military computer for US Dept. of Energyfor use in nuclear weapons testing. Code
named "Q", it is a 30+ TeraFLOPS machine that contains 12,000 Alpha 21264 chips.
Compaq is also building the World's fastest non-military computer for the
Pittsburgh Supercomputing centre. This is a 6 TeraFLOPS, 2,728 Alpha 21264 machine
running TRU64 UNIX and will be mainly used in scientific research. Compaq's range of
servers, NonStop Himalaya, AlphaServers and AlphaStation workstation all run on the Alpha
21264. The microprocessors are manufactured by Intel, IBM and Samsung.
Alpha is a 64 bit architecture (32 bit instructions) that doesn't support 8- or 16-bit
operations, but allows conversions, so no functionality is lost. The Alpha processor
executes 4 instructions in one clock pulse. Alpha 32-bit operations differ from 64
bit only in overflow detection. The chip provides both IEEE and VAX 32 and 64 bit floating
point operations, and features Privileged Architecture Library (PAL) calls, a set of
programmable (non-interruptable) macros written in the Alpha instruction set to simplify
conversion from other instruction sets using a binary translator, as well as providing
flexible support for a variety of operating systems. The Alpha concentrates on the
original RISC idea of simplicity and a higher clock rate - though that also has its
drawback, in terms of very high power consumption. The first model, 21064 was
introduced with one integer, one floating point, and one load/store unit. Current
versions are as follows:
21164: This microprocessor was released in early 1995.
The 21164 began expanding instruction parallelism by adding one integer/load/store unit
with byte vector (multimedia-type) instructions (replacing the load/store unit) and one
floating point unit, and increased clock speed from 200 MHz to 300 MHz and introduced the
idea of a level 2 cache on chip (8K each inst/data level 1, 96K combined level 2).
Various versions run on Windows NT and the 64bit Digital UNIX with speeds ranging from
300Mhz to 600 Mhz. It has a SPECfp95 of 27.0 and executes 2.4 billion instructions
per second.
21264: The 21264, released in mid 1998 expanded to four
integer units (two add/logic/shift/branch (one also with multiply, one with multimedia)
and two add/logic/load/store), two different floating point units (one for add/div/square
root and one for multiply), with the ability to load four, dispatch six, and retire eight
instructions per cycle (and for the first time including 40 integer and 40 floating point
rename registers and out of order execution), at up to 660MHz. Multimedia
extensions introduced with the 21264 are simple, but include VIS-Type motion estimation
(MPEG). This version also includes support for Linux operating system. It has
a SPECfp95 benchmark of 50.
21364 (EV 7): The 21364, expected 2000 or 2001, added five high
speed interconnects (four CPU (10 GB/s) and one I/O (3 GB/s)) to an enhanced 21264 core.
|
ARM processors are mainly used in handheld computers and Personal Digital
Assistants. ARM (Acorn RISC Machine) is designed by VLSI Technologies, UK, now
called ARM Research. Originally designed for the Archimedes home computer in 1986, the original ARM (ARM1, 2 and 3) was a 32 bit
CPU, but used 26 bit addressing. The newer ARM6 is completely 32 bits. ARM Research
licenses the core to various manufacturers like DEC (now Intel), Compaq, HP etc. DEC
and ARM Research collaborated to make the StrongARM series which was bought by Intel.
ARM processors run the Apple Newton (ARM 610) running NewtOS, Compaq iPAQ pocket PC
(StrongARM), HP Journada 840 handheld PC (StrongARM) and Psion handhelds (ARM 710T).
The ARM6 has user, supervisor, and various interrupt modes (including 26 bit modes for
ARM2 compatibility). The ARM architecture has sixteen registers with a multiple
load/save instruction, though many registers are shadowed in interrupt modes (2 in
supervisor and IRQ, 7 in FIRQ) so need not be saved, for fast response. A feature
introduced by the ARM is that every instruction is predicated, using a 4 bit condition
code. Another bit indicates whether the instruction should set condition codes, so
intervening instructions don't change them. This easily eliminates many branches and can
speed execution. Another unique and useful feature is a barrel shifter which operates on
the second operand of most ALU operations, allowing shifts to be combined with most
operations (and index registers for addressing), effectively combining two or more
instructions into one. These features make ARM code both dense (unlike most RISC
processors) and efficient, despite the relatively low clock rate and short pipeline.
ARM has developed a low cost 16-bit version called Thumb, which recodes a subset of
ARM CPU instructions into 16 bits (decoded to native 32-bit ARM instructions without
penalty). Thumb programs can be 30-40% smaller than already dense ARM programs. Native ARM
code can be mixed with Thumb code when the full instruction set is needed.
ARM6 (StongARM SA-110,
1100, 1110): The ARM series consists of the ARM6 CPU core (35,000
transistors, which can be used as the basis for a custom CPU) the ARM60 base CPU, and the
ARM600 which also includes 4K 64-way set-associative cache, MMU, write buffer, and
coprocessor interface (for FPU). The ARM CPU was chosen for the Apple Newton
handheld system because of its speed, combined with the low power consumption, low cost
and customizable design (the ARM610 version used by Apple includes a custom MMU supporting
object oriented protection and access to memory for the Newton's NewtOS). Intel has
developed the StrongARM SA-110 in February 1996, running a 5-stage pipeline at 100 to
233MHz (using only 1 watt of power), with 5-port register file, faster multiplier, single
cycle shift-add, and Harvard Architecture (16K each 32-way I/D caches). Later
versions include SA-1100 that runs at 133/190 Mhz and provides system support logic,
multiple serial communication channels, LCD controllers and I/O ports and the SA-1110 that
runs at 206Mhz
ARM7: The ARM7 series, designed in
December 1994, increases performance by optimising the multiplier, and adding DSP-like
extensions including 32 bit and 64 bit multiply and multiply/accumulate instructions
(operand data paths lead from registers through the multiplier, then the shifter (one
operand), and then to the integer ALU for up to three independent operations). It also
doubles cache size to 8K, includes embedded In Circuit Emulator (ICE) support, and raises
the clock rate significantly.
ARM8: To fill the gap between ARM7
and DEC StrongARM, ARM also developed the ARM8/800 which includes many StrongARM features.
ARM9: ARM9 is an improvement over
the previous design with Harvard busses, write buffers, and flexible memory protection
mapping.
ARM10: A vector floating-point
unit is being added to the ARM10.
|
MIPS
Processors produced the first commercial RISC processor with the R2000 in
1986. Currently microprocessors from this stable power devices from supercomputers
to many hand held devices (Casseiopia Windows CE devices) as well as gaming consoles
(Playstation) and TV Set Top boxes. MIPS is owned by SGI. MIPS stands for
Microprocessor without Interlocked Pipeline Stages and came out of a Stanford Univ.
project. MIPS processors power handhelds like Casio Cassiopeia (VR 4122) running
Windows CE, Symbol (VR 4181), Agenda portable PC (VR 4181) running Linux OS and also
powers workstations like SGI O2 (QED RM5200), SGI Octane (MIPS R12000) and servers like
SGI Origin 2000 (MIPS R12000) and Fujitsu-Siemens RM300 (MIPS R12000).
MIPS was intended to simplify processor design by eliminating hardware interlocks between
the five pipeline stages. This means that only single execution cycle instructions can
access the thirty two 32 bit general registers, so that the compiler can schedule them to
avoid conflicts. This also means that LOAD/STORE and branch instructions have a 1 cycle
delay to account for. However, because of the importance of multiply and divide
instructions, a special HI/LO pair of multiply/divide registers exist which do have
hardware interlocks, since these take several cycles to execute and produce scheduling
difficulties. The R2000 has no condition code register considering it a potential
bottleneck. The PC is user readable. The CPU includes an MMU unit that can also control a
cache, and the CPU was one of the first which could operate as a big or little endian
processor. An FPU, the R2010, is also specified for the processor.
R3000: released in 1988 has
improved cache control.
R4000: released in 1991, expanded
to 64 bits and is superpipelined (twice as many pipeline stages do less work at each
stage, allowing a higher clock rate and twice as many instructions in the pipeline at
once, at the expense of increased latency when the pipeline can't be filled, such as
during a branch, (and requiring interlocks added between stages for compatibility, making
the original "I" in the "MIPS" acronym meaningless)). The R4400 and
above integrated the FPU with on-chip caches. The R4600 and later versions abandoned
superpipelines.
R8000: released in 1994, was
superscalar and was optimised for floating point operation, issuing two integer or
load/store operations (from four integer and two load/store units) and two floating point
operations simultaneously (FP instructions sent to the independent R8010 floating point
coprocessor (with its own set of thirty-two 64-bit registers and load/store queues).
R10000: released early 1996,
added multiple FPU units, as well as almost every advanced modern CPU feature, including
separate 2-way I/D caches (32K each) plus on-chip secondary controller (and high speed
8-way split transaction bus (up to 8 transactions can be issued before the first
completes)), superscalar execution (load four, dispatch five instructions to any of two
integer, two floating point, and one load/store units), dynamic register renaming (integer
and floating point rename registers (thirty two in the R10K), and an instruction cache
where instructions are partially decoded when loaded into the cache, simplifying the
processor decode stage. Branch prediction and target caches are also included.
R12000/R14000: released in May
1997, was similar to the R10000 but had forty eight rename registers. The R10000,
R12000 and the R14000 are the last of the high performance microprocessors from the MIPS
stable.
R5000: released in January, 1996
is a 2-way (int/float) superscalar architecture and was added to fill the gap between
R4600 and R10000, without the out of order or branch prediction buffers.
TinyRISC: released in October
1996 is used in embedded applications. MIPS and LSI Logic added a compact 16 bit
instruction set which can be mixed with the 32 bit set.
MIPS V/MDMX: released in October
1996, is similar to the TinyRISC but MIPS V adds parallel floating point (two 32 bit
fields in 64 bit registers) operations, MDMX adds integer 8 or 16 bit subwords in 64 bit
FPU registers and a 24 and 48 bit subwords in a 192 bit accumulator for multimedia
instructions. Vector-scalar operations (ex: multiply all subwords in a register by subword
3 from another register) are also supported. These extensive instructions are partly
derived from Cray vector instructions and are much more extensive than the earlier
multimedia extensions of other CPUs. Future versions are expected to add Java Virtual
Machine support.
|
PowerPC
microprocessors: IBM, Motorola, and Apple formed a coalition (around 1992) to produce a
microprocessor version of the POWER design as a successor to both the Motorola 68000 and
Intel 80x86, resulting in the PowerPC. PowerPC derivatives power IBM RS/6000 (Power3
- II) workstations and servers and Apple's iMac (PowerPC G3), PowerMac (PowerPC G4)
desktops and Macintosh Server G4 (PowerPC G4) running OS X.
PowerPC 601: released in 1993, was the first of the series (considered
first generation or G1), and included both POWER and PowerPC features, based strongly on
the POWER1, except it had a single 32K cache rather than separate I/D caches.
PowerPC 603: released late 1993, (first second generation G2) separated
the main functional units further, removing load/store operations from the integer unit
(four functional units total - integer, floating point, load/store (using integer
registers), branch), and splitting the branch unit into a fetch/branch unit, a dispatch
unit, and a completion/exception unit. The 603 also added a rename buffer in the dispatch
unit for speculative execution using renamed integer and floating point registers, which
are ordered properly by the completion/exception unit, or discarded for mispredicted
branches and exceptions. Separate 8K and 16K I/D cache versions were available. PowerPC 604: released in mid 1995, added dynamic branch prediction using a branch
history table, and added two simplified integer units - three integer, two for
single-cycle operations, one for multicycle operations such as multiply/divide, plus
floating point, load/store and branch, total of six. Four instructions could be dispatched
at once The CC register could also be renamed.
PowerPC 620: expanded the 604 design to 64 bits (but with a 'backside'
L2 cache bus), and added new 64 bit instructions, but was delivered much later and slower
than promised, and was further delayed when it was with drawn for a redesign.
PowerPC 750: released in early 1998 (G3), is 32 bit, refined in design
and performance, adding a P620-style backside cache bus, but made no other significant
changes (notably though, they used a 603-based 32-bit FPU, rather than the 64-bit 604
FPU).
POWER2: released in 1993, is a workstation version with a
high bandwidth design with two floating point load/store units, 256K of data cache, and
added 128-bit floating point support and a square root instruction. Initially a multichip
design, it was later combined into one chip (P2SC), and then into an eight CPU
"SuperChip". It could issue up to six instructions and four simultaneous loads
or stores.
POWER3 (PowerPC A35, PowerPC
RS64): released in early 1998, had eight
functional units (two FPU, three integer (two single cycle, one multicycle), two
load/store, and branch unit), but capable of operating at much higher clock speeds. In
addition, a 64 bit designed as the CPU for the AS/400 E series, the PowerPC A35 (Apache)
with added decimal arithmatic and string instructions, was also used in the RS/6000 S70
workstation (called the PowerPC RS64).
PowerPC G4 has, in direct response to Intel's MMX instructions, AltiVec
extensions for CPUs from Motorola (but not IBM).
|
SPARC
or the Scalable (originally Sun) Processor ARChitecture was designed by Sun Microsystems
for their own use. Sun was a maker of workstations, and used standard 68000-based CPUs and
a standard operating system, Unix. Research versions of load-store processors had promised
a major step forward in speed, but existing manufacturers were slow to introduce a RISC
processor, so Sun went ahead and developed its own. In keeping with their open philosophy,
they licensed it to other companies, rather than manufacture it themselves. Sun uses
the UltraSPARC III in its Sun Blade and Sun Enterprise Server range and UltraSPARC II in
its Ultra 5 workstation. Fujitsu-Siemens run their PrimePower line of servers on
HAL/SPARC64 processors running Solaris.
The SPARC design was radical at the time, even omitting multiple cycle multiply and divide
instructions (added in later versions), using single-cycle "step" instructions
instead, while most RISC CPUs were more conventional. SPARC usually contains about
128 or 144 registers, (memory-data designs typically had 16 or less). At each time 32
registers are available - 8 are global, the rest are allocated in a 'window' from a stack
of registers. The window is moved 16 registers down the stack during a function call, so
that the upper and lower 8 registers are shared between functions, to pass and return
values, and 8 are local. The window is moved up on return, so registers are loaded or
saved only at the top or bottom of the register stack. This allows functions to be called
in as little as 1 cycle. Like most RISC processors, global register zero is wired to zero
to simplify instructions, and SPARC is pipelined for performance (a new instruction can
start execution before a previous one has finished), but not as deeply as others - it has
branch delay slots. Also like previous processors, a dedicated CCR holds comparison
results. SPARC is 'scalable' mainly because the register stack can be expanded (up
to 512, or 32 windows), to reduce loads and saves between functions, or scaled down to
reduce interrupt or context switch time, when the entire register set has to be saved.
Function calls are usually much more frequent than interrupts, so the large register set
is usually a plus, but compilers now can usually produce code which uses a fixed register
set as efficiently as a windowed register set across function calls. SPARC is not a
chip, but a specification, and so there are various designs of it. It has undergone
revisions, and now has multiply and divide instructions. Original versions were 32 bits,
but 64 bit and superscalar versions were designed and implemented (beginning with the
Texas Instruments SuperSparc in late 1992), but performance lagged behind other load-store
and even Intel 80x86 processors until the UltraSPARC (late 1995) from Texas Instruments
and Sun, and superscalar HAL/Fuji SPARC64 multichip CPU.
UltraSPARC: is a 64-bit
superscalar processor series which can issue up to four instructions at once to any of
nine units: two integer units, two of the five floating point/graphics units, the branch
and load/store unit. The UltraSparc also added a block move instruction which bypasses the
caches (2-way 16K instr, 16K direct mapped data), to avoid disrupting it, and specialized
pixel operations (VIS - the Visual Instruction Set) which can operate in parallel on 8,
16, or 32-bit integer values packed in a 64-bit floating point register (for example, four
8 X 16 -> 16 bit multiplications in a 64 bit word, a sort of simple SIMD/vector
operation. More extensive than the Intel MMX instructions, VIS also includes some 3D
to 2D conversion, edge processing and pixes distance (for MPEG, pattern-matching support).
HAL/Fuji SPARC64: can issue up
to four in order instructions simultaneously to four buffers, then to four integer, two
floating point, two load/store, and the branch unit, and may complete out of order (an
instruction completes when it finishes without error, is committed when all instructions
ahead of it have completed, and is retired when its resources are freed - these are
'invisible' stages in the SPARC64 pipeline). A combination of register renaming, a branch
history table, and processor state storage allow for speculative execution while
maintaining precise exceptions/interrupts (renamed integer, floating, and CC registers -
trap levels are also renamed and can be entered speculatively).
|
Intel processors
were the first commercial microprocessors made for use in personal computers, the 8086,
which was based on the design of the 8080/8085 (source compatible with the 8080) with a
similar register set, but was expanded to 16 bits. Intel processors power most of
the desktops in the world inc. those of IBM (NetVista, IntelliStation), Compaq (Presario,
iPAQ), Dell (Dimension, OptiPlex) and HP (Brio). Intel processors also power servers
that mainly run Windows NT like SGI 1000, Compaq's Proliant and Dell's PowerEdge servers.
The computer running the International Space Station runs on a variation of the
Intel 386SX processor!
The Bus Interface Unit fed the instruction stream to the Execution Unit through a 6 byte
prefetch queue, so fetch and execution were concurrent - a primitive form of pipelining
(8086 instructions varied from 1 to 4 bytes). It featured four 16 bit general
registers, which could also be accessed as eight 8 bit registers, and four 16 bit index
registers (including the stack pointer). The data registers were often used implicitly by
instructions, complicating register allocation for temporary values. It featured 64K 8-bit
I/O (or 32K 16-bit) ports and fixed vectored interrupts. There were also four segment
registers that could be set from index registers. Although this was largely
acceptable for assembly language, where control of the segments was complete (it could
even be useful then), in higher level languages it caused constant confusion (ex. near/far
pointers). Even worse, this made expanding the address space to more than 1 meg difficult.
80286: released in 1982, expanded
the design to 32 bits only by adding a new mode (switching from 'Real' to 'Protected' mode
was supported, but switching back required using a bug in the original 80286, which then
had to be preserved) which greatly increased the number of segments by using a 16 bit
selector for a 'segment descriptor', which contained the location within a 24 bit address
space, size (still less than 64K), and attributes (for Virtual Memory support) of a
segment.
80386: released in 1985, broke the
64K segment memory access restriction and included much improved addressing: base reg +
index reg * scale (1, 2, 4 or 8 bits) + displacement (8 or 32 bit constant = 32 bit
address) in the form of paged segments (using six 16-bit segment registers). It also had
several processor modes (including separate paged and segmented modes) for compatibility
with the previous awkward design. In fact, with the right assembler, code written for the
8008 can still be run on the most recent Pentium Pro. The 80386 also added an MMU,
security modes (called "rings" of privledge - kernal, system services,
application services, applications) and new op codes.
80486: released in 1989, added
full pipelines, single on chip 8K cache, integrated FPU (based on the eight element 80-bit
stack-oriented FPU in the 80387 FPU), and clock doubling versions.
Pentium: released in late 1993,
was superscalar (up to two instructions at once in dual integer units and single FPU) with
separate 8K I/D caches. The Pentium was the name Intel gave the 80586 version
because it could not legally protect the name "586" to prevent other companies
from using it. MMX (initially reported as MultiMedia eXtension, but later said by
Intel to mean Matrix Math eXtension) perform integer operations on vectors of 8, 16, or 32
bit words, using the 80 bit FPU stack elements as eight 64 bit registers.
P55C Pentium: version, released
January 1997, is the first Intel CPU to include MMX instructions, followed by the AMD K6,
and Pentium II. (the Pentium P55C (early 1997) version is a pure CMOS design).
Pentium Pro: (Pentium's successor,
late 1995) does not clone the Pentium, but emulate it with specialized hardware decoders
which convert Pentium instructions to RISC-like instructions which are executed on
specially designed superscalar RISC-style cores faster than the Pentium itself. Intel also
used BiCMOS in the Pentium and Pentium Pro to achieve clock rates competitive with CMOS
load-store processors. Pentium Pro (code named "P6") is a 1 or 2-chip (CPU
plus 256K or 512K L2 cache - I/D L1 cache (8K each) is on the CPU), 14-stage
superpipelined processor. It uses extensive multiple branch prediction and speculative
executing via register renaming. Three decoders (one for complex instructions, two for
simpler ones (four or fewer micro-ops)) each decode one 80x86 instruction into micro-ops
(one per simple decoder + up to four from the complex decoder = three to six per cycle).
Up to five (usually three) micro-ops can be issued in parallel and out of order (six units
- FPU, 2 integer, 2 address, 1 load/store), but are held and retired (results written to
registers or memory) as a group to prevent an inconsistant state (equivalent to half an
instruction being executed when an interrupt occurs, for example). 80x86 instructions may
produce several micro-ops in CPUs like this so the actual instruction rate is
lower. In fact, due to problems handling instruction alignment in the Pentium Pro,
emulated 16-bit instructions execute slower than on a Pentium.
Pentium II: released April 1997,
version of the Pentium Pro added MMX instructions, doubled cache to 32K, and was packaged
in a processor card instead of an IC package.
Pentium III/Pentium 4: Faster
versions of the Pentium line ranging from 600 Mhz to 1 Ghz. Celeron versions are similar
to Pentium but have lesser L2 cache (about 128KB compared to 256KB on normal pentium) and
Xeon versions have 512KB L2 cache.
Itanium: (code named Merced or
IA-64) Intel, with partner Hewlett-Packard, has begun development of a next generation
64-bit processor. It is expected to be a variable length instruction group (or what
HP/Intel call EPIC (Explicit Parallel Instruction Computing)) with instruction
dependencies grouped from 1 to 9+. The processor is expected to read instructions in 128
bit bundles of three plus three "template bits" which indicate dependancies.
Most instructions are predicated with 128 general 64-bit and 128 floating point registers,
and 64 predicate bits (a type of condition code). To reduce page faults, speculative load
instructions execute a load, but does not trap if there is an exception until a second
instruction completes it - if the second instruction is predicated and never executes,
then a page fault is avoided, and loads can be rescheduled more flexibly. It's expectged
to be compatible in some way with both the PA-RISC and 80x86 - it will include 80x86 data
and segment registers, with additional instructions to switch between instruction/register
sets and transfer data between 80x86 and IA-64 registers. It is expected to translate
80x86 instructions into VLIW instructions (or directly to decoded instructions) the same
way that Pentium Pro and AMD K5/K6 CPUs do, but with a larger number of instructions
issued using the VLIW design, it should be faster. However, native IA-64 code should be
even faster, and this may finally produce the incentive to let the 80x86 architecture
finally fade away.
|
Intel clone processors: Due to the popularity of the Intel processors, other companies started to
clone the design to make cheaper and faster versions. Earliest clones of the Intel
processors were the NEC V20/V30 (slightly faster clones of the 8088/8086 (could also run
8085 code)). Intel clones run desktops made by companies like HP, Compaq, IBM,
Gateway and Everex.
AMD K5: translates 80x86 code to ROPs
(RISC OPerations), which execute on a RISC-style core. Up to four ROPs can be dispatched
to six units (two integer, one FPU, two load/store, one branch unit), and five can be
retired at a time. The complexity led to low clock speeds for the K5, prompting AMD to buy
NexGen and integrate its designs for the next generation K6.
NexGen/AMD Nx586: released early
1995, is unique by being able to execute its micro-ops (called RISC86 code) directly,
allowing optimised RISC86 programs to be written which are faster than an equivalent x86
program would be, but this feature is seldom used. It also features two 16K I/D L1 caches,
a dedicated L2 cache bus (like that in the Pentium Pro 2-chip module) and an off-chip FPU
(either separate chip, or later as in 2-chip module).
AMD K6: released April 1997, actually
has three caches - 32K each for data and instructions, and a half-size 16K cache
containing instruction decode information. It also brings the FPU on-chip and eliminates
the dedicated cache bus of the Nx586, allowing it to be pin-compatible with the P54C model
Pentium. Another decoder is added (two complex decoders, compared to the Pentium Pro's one
complex and two simple decoders) producing up to four micro-ops and issuing up to six (to
seven units - load, store, complex/simple integer, FPU, branch, multimedia) and retiring
four per cycle. It includes MMX instructions, licensed from Intel, and AMD has designed
and added 3DNow! graphics extensions without waiting for Intel's 3D MMX extensions.
AMD K7: released expected 1999, is
based on superscalar design , decoding x86 instructions into 'MacroOps' (made up of one or
two 'micro-ops') in two decoders (one for simple and one for complex instructions)
producing up to three MacroOps per cycle. Up to nine decoded operations per cycle can be
issued in six MacroOps to six functional units (three integer, each able to execute one
simple integer and one address op simultaneously, and three FPU/MMX/3DNow! instructions
with extensive stack and register renaming, and a separate integer multiply unit which
follows integer ALU 0, and can forward results to either ALU 0 or 1). The K7 it replaces
the Intel-compatible bus of the K6 with the high speed Alpha EV6 bus because Intel decided
to prevent competitors from using its own higher speed bus designs. This makes it easier
to use either Alpha or AMD K7 processors in a single design.
Cyrix 6x86: released early 1996,
initially manufactured by IBM before Cyrix merged with National Semiconductor, still
directly executes 80x86 instructions (in two integer and one FPU pipeline), but partly out
of order, making it faster than a Pentium at the same clock speed.
Cyrix MediaGX: is an integrated
version with graphics and audio on-chip called the . MMX instructions were added to the
6x86MX, and 3DNow! graphics instructions to the 6x86MXi.
Cyrix M3: released mid 1998, turned
to superpipelining (eleven stages compared to six (seven?) for the M2) for a higher clock
rate (partly for marketing purposes, as MHz is often preferred to performance in the PC
market), and provides dual floating point/MMX/3DNow! units.
IDT-C6 WinChip: released May
1997, manufactured by Centaur, a subsidiary of Integrated Device Technology, which uses a
much simpler (6-stage, 2 way integer/simple-FPU execution) desgn than Intel and AMD
translation-based designs by using micro-ops more closely resembling 80x86 than RISC code,
which allows for a higher clock rate and larger L1 (32K each I/D) and TLB caches in a
lower cost, lower power consumption design. Simplifications include replacing branch
prediction (less important with a short pipeline) with an eight entry call/return stack,
depending more on caches. The FPU unit includes MMX support (the C6+ version added a
second FPU/MMX unit and 3D graphics enhancements). Like Cyrix, IDT opted for a
superpipelined eleven-stage design for added performance, combined with sophisticated
early branch prediction in its WinChip 4. The design also pays attention to supporting
common code sequences - for example, loads occur earlier in the pipeline than stores,
allowing load-alu-store sequences to be more efficient.
|
Motorola 680xx
(Dragonball, ColdFire): was commonly used to power the early home and personal
computers like Apple IIe, Amiga and Atari ST to compete with Intel powered IBM PCs.
Currently Motorola 68040 and 68060 power the Amiga 4000T video workstation. A
dragonball version, 68328, powers the Palm Pilot computers running the Palm OS.
68000: was initially an 8 Mhz, 32
bit architecture internally, but had only a 16 bit data bus and 24 bit address bus to fit
in a 64 pin package. Lack of forced segments made programming the 68000 easier than some
competing processors, without the 64K size limit on directly accessed arrays or data
structures. It was designed for expansion, including specifications for floating
point and string operations. The 68000 could fetch the next instruction during execution
(a 2 stage pipeline).
68008: reduced the data bus to 8
bits and address to 20 bits.
68010: released in 1982, added
virtual memory support (the 68000 couldn't restart interrupted instructions) and a special
loop mode - small decrement-and-branch loops could be executed from the instruction fetch
buffer.
68020: released in 1984, was fully
32 bit externally. Addresses were computed as 32 bits (without using segment registers) -
unused upper bits in the 68000 or 68008 bits were ignored, but some programmers stored
type tags in the upper 8 bits, causing compatibility problems with the 68020's 32 bit
addresses. It expanded external data and address bus to 32 bits,simple 3-stage pipeline,
and added a 256 byte cache.
68030: released in 1987, brought
the MMU onto the chip.
68040: released in 1991, added
floating point with eight 80 bit floating point registers (compatible with the 68881/2
coprocessors). It also added fully cached Harvard busses (4K each for data and
instructions), 6 stage pipeline, and on chip FPU.
68060: released in April 1994,
expanded the design to a superscalar version, the the third stage of the 10-stage 68060
pipeline translates the 680x0 instructions to a decoded RISC-like form (stored in a 16
entry buffer in stage four). There is also a branch cache, and branches are folded into
the decoded instruction stream, then dispatched to two pipelines (three stages: Decode,
addr gen, operand fetch) and finally to two of three execution units - 2 integer, 1
floating point) before reaching two 'writeback' stages. Cache sizes are doubled over the
68040. The 68060 also also includes many innovative power-saving features (3.3V operation,
execution unit pipelines could actually be shut down, reducing power consumption at the
expense of slower execution, and the clock could be reduced to zero) so power use is lower
than the 68040 (4-6 watts vs. 3.9-4.9). Another innovation is that simple
register-register instructions which don't generate addresses may use the the address
stage ALU to execute 2 cycles early.
Coldfire: released early 1995,
in which complex instructions and addressing modes (added to the 68020) were removed and
the instruction set was recoded, simplifying it at the expense of compatibility (source
only, not binary) with the 680x0 line. The embedded market became the main market for
the 680x0 series after workstation venders (and the Apple Macintosh) turned to faster
load-store processors, so a variety of embedded versions were introduced.
|
Hitachi SuperH:
The Hitachi SH series was meant to replace the 8-bit and 16-bit H8
microcontrollers, a series of memory-data CPUs with sixteen 16-bit registers. The SH is
also designed for the embedded marked, and is similar to the ARM architecture in many
ways. The SH is used in many of Hitachi's own products, as well as being a pioneer of wide
popularity for a Japanese CPU outside of Japan. It's most prominently featured in the Sega
Saturn video game system (which uses two SH2 CPUs) and many Windows CE palmtop computers
(SH3 chip set) inc. some HP Journada versions.
SH is a 32 bit processor, but with a 16 bit instruction format, has sixteen general
purpose registers and a load/store architecture. This results in a very high code density.
Because of the small instruction size, there are no immediate load instruction, but
a PC-relative addressing mode is supported to load 32 bit values. The SH also has a
Multiply ACcumulate (MAC) instruction, and MACH/L (high/low word) result registers - 42
bit results (32 low, 10 high) in the SH1, 64 bit results (both 32 bit) in the SH2 and
later.
SH3: includes an MMU and 2K to 8K of
unified cache.
SH4: released in mid-1998 is a superscalar
version with extensions for 3-D graphics support. It can issue two instructions at a time
to any of four units: integer, floating point, load/store, branch (except for certain
non-superscalar instructions, such as modifying control registers). Certain instructions,
such as register-register move, can be executed by either the integer or load/store unit,
two can be issued at the same time. Each unit has a separate pipeline, five stages for
integer and load/store, five or six for floating point, and three for branch.
Other enhancements such as support for MPEG operations are planned for the SH5.
|
PA-RISC:
The PA-RISC (Precision Architecture, originally code-named Spectrum) was designed to
replace older processors in HP-3000 MPE minicomputers, and Motorola 680x0 processors in
the HP-9000 HP/UX Unix minicomputers and workstations. It has an unusually large
instruction set for a RISC processor (including a conditional skip instruction, partly
because initial design took place before RISC philosophy was popular, and partly because
careful analysis showed that performance benefited from the instructions chosen - in fact,
version 1.1 added new multiple operation instructions combined from frequent instruction
sequences, and HP was among the first to add multimedia instructions (the MAX-1 and MAX-2
instructions). Much of the RISC philosophy was independently invented at HP from lessons
learned from FOCUS (pre 1984), HP's (and the world's) first fully 32 bit microprocessor.
It has a 5 stage pipeline, which had hardware interlocks from the beginning for
instructions which take more than one cycle, as well as result forwarding (a result can be
used by a previous instruction without waiting for it to be stored in a register first).
It is a load/store architecture, originally with a single instruction/data bus, later
expanded to a Hardware architecture (separate instruction and data buses). It has
thirty-two 32-bit integer registers (GR0 wired to constant 0, GR31 used as a link register
for procedure calls), with seven 'shadow registers' which preserve the contents of a
subset of the GR set during fast interrupts, and thirty-two 64-bit floating point
registers (also as sixty-four 32-bit and sixteen 128-bit), in an FPU (which could execute
a floating point instruction simultaneously, from the Apollo-designed Prism architecture
(1988?) after Hewlett-Packard acquired the company). Later versions (the PA-RISC 7200 in
1994) added a second integer unit (still dispatching only two instructions at a time to
any of the three units). Addressing originally was 48 bits, and expanded to 64 bits, using
a segmented addressing scheme. The PA-RISC 8000 (April 1996), intended to compete
with the R10000, SPARC, and others) expands the registers and architecture to 64 bits
(eliminating the need for segments), and adds aggressive superscalar design - up to 5
instructions out of order, using fifty six rename registers, to ten units (five pairs of:
ALU, shift/merge, FPU mult/add, divide/sqrt, load/store). The CPU is split in two, with
load/store (high latency) instructions dispatched from a separate queue from operations
(except for branch or read/modify/write instructions, which are copied to both queues). It
also has a deep pipeline and speculative execution of branches.
PA-RISC 8500: released mid 1998, breaks
with HP tradition and adds on-chip cache - 1.5Mb L1 cache.
|
Transmeta's
premier product is the Crusoe processor, a revolutionary x86-compatible family of
solutions specially designed for the new world of Mobile Internet Computing.Remarkably low
power consumption, allowing the processor to run cooler than conventional chips. Battery
life is extended up to a whole day. Transmeta has pioneered a revolutionary new approach
to microprocessor design. Rather than implementing the entire x86 processor in hardware,
the Crusoe processor solution consists of a compact hardware engine surrounded by a
software layer.
The hardware component is a very simple, high-performance, low-power VLIW (Very Long
Instruction Word) engine with an instruction set that bears no resemblance to that of x86
processors. Instead, it is the surrounding software layer that gives programs the
impression that they are running on x86 hardware. This innovative software layer is called
the Code Morphing software because it dynamically "morphs" (that is, translates)
x86 instructions into the hardware engine's native instruction set. Transmeta's software
translates blocks of x86 instructions once, saving the resulting translation in a
translation cache. The next time the (now translated) code is executed, the system skips
the translation step and directly executes the existing optimized translation at full
speed. This unique approach to executing x86 code eliminates millions of
transistors, replacing them with software. The current implementation of the Crusoe
processor uses roughly one-quarter of the logic transistors required for an all-hardware
design of similar performance. This offers the following benefits:
The hardware component is considerably smaller, faster, and more power efficient than
conventional chips.
The hardware is fully decoupled from the x86 instruction set architecture, enabling
Transmeta's engineers to take advantage of the latest and best in hardware design trends
without affecting legacy software.
The Code Morphing software can evolve separately from hardware. This means that upgrades
to the software portion of the microprocessor can be rolled out independently of hardware
chip revisions.
Transmeta's Code Morphing technology is obviously not limited to x86 implementations. As
such, it has the potential to revolutionize the way microprocessors are designed in the
future
Crusoe processors are versatile enough to power a broad spectrum of ultra-light mobile PCs
and Internet devices. Currently, designers of Mobile Internet Computers can choose from
three processor models: TM3200 (333-400MHz), TM5400 (500-700MHz) and TM5600 (500-700MHz)
TM3200: The TM3200 is the ideal engine
for a new class of mobile Internet devices weighing just a pound or two. With up to 400
MHz in performance, the TM3200 is designed to allow a full day of web browsing on a single
battery charge. The TM3200 delivers the full performance needed to run a wide range
of Internet applications - from web browsers and email applications to heavy-duty
streaming video clips. The TM3200 with the new Mobile Linux operating system implements
many of the same power management features found in today's laptop computers. These
include a deep-sleep idle mode that operates at levels as low as 20 mW. The TM3200 is
compatible with the complete range of x86-based operating systems, including those offered
by Microsoft and Linux suppliers.
TM5400/TM5600: The TM5400/TM5600 is the
first solution designed to solve the problems of poor battery life and sub-par performance
in the emerging class of ultra-light (weighing less than four pounds) mobile PCs. It
performs at speeds up to 700 MHz and provides deep-sleep power levels as low as 60 mW.
TM5400/5600-based laptops can last up to eight hours on battery running everyday office
applications and three to four hours running heavy-duty multimedia applications like DVD
movies. The TM5400/5600 offers application performance comparable to many desktop
processors but with significantly lower power consumption. The model TM5400/5600
offers LongRun technology, a new feature that allows the processor to adjust both its
frequency and voltage to exactly the levels required by an application. This approach
achieves unprecedented power savings. The TM5400/5600 typically operates at less than 1
watt while running ordinary office applications and as little as 60 mW when idle between
keystrokes. Heavy-duty applications such as DVD movies consume on average less than 2
watts. The TM5400/5600 is compatible with the complete range of x86-based operating
systems. This includes all versions of Linux, as well as Microsoft's popular Windows 98,
Windows NT, and Windows 2000 operating systems.
|
Conclusion:
There are many micro-processors that are used to power computing devices. Intel's
Pentium range is the most common in the 32-bit range. Other 64-bit chips like that
from Sun and MIPS are mainly used to power high powered servers. Low power chips
from ARM, MIPS and Hitachi power the handheld PCs. Intel is working on the new
64-bit Itanium (Merced or IA-64) to replace the existing 32-bit Pentium, along with HP now
Agilent, which wants it to replace its PA-RISC. Even SGI has shelved further
development in the MIPS range in favor of the IA-64 architecture. Therefore IA-64 is
the upcoming chip to watch.
|
|