The Intel 8086 microprocessor was introduced 42 years ago this month,1 so I made some high-res die photos of the chip to celebrate. The 8086 is one of the most influential chips ever created; it started the x86 architecture that still dominates desktop and server computing today. By looking at the chip's silicon, we can see the internal features of this chip.
The photo below shows the die of the 8086. In this photo, the chip's metal layer is visible, mostly obscuring the silicon underneath. Around the edges of the die, thin bond wires provide connections between pads on the chip and the external pins. (The power and ground pads each have two bond wires to support the higher current.) The chip was complex for its time, containing 29,000 transistors.
To examine the die, I started with the 8086 integrated circuit below. Most integrated circuits are packaged in epoxy, so dangerous acids are necessary to dissolve the package. To avoid that, I obtained the 8086 in a ceramic package instead. Opening a ceramic package is a simple matter of tapping it along the seam with a chisel, popping the ceramic top off.
With the top removed, the silicon die is visible in the center. The die is connected to the chip's metal pins via tiny bond wires. This is a 40-pin DIP package, the standard packaging for microprocessors at the time. Note that the silicon die itself occupies a small fraction of the chip's size.
Using a metallurgical microscope, I took dozens of photos of the die and stitched them into a high-resolution image using a program called Hugin (details). The photo at the beginning of the blog post shows the metal layer of the chip, but this layer hid the silicon underneath.
For the die photo below, the metal and polysilicon layers were removed, showing the underlying silicon with its 29,000 transistors.2 The labels show the main functional blocks, based on my reverse engineering. The left side of the chip contains the 16-bit datapath: the chip's registers and arithmetic circuitry. The adder and upper registers form the Bus Interface Unit that communicates with external memory, while the lower registers and the ALU form the Execution Unit that processes data. The right side of the chip has control circuitry and instruction decoding, along with the microcode ROM that controls each instruction.
One feature of the 8086 was instruction prefetching, which improved performance by fetching instructions from memory before they were needed. This was implemented by the Bus Interface Unit in the upper left, which accessed external memory. The upper registers include the 8086's infamous segment registers, which provided access to a larger address space than the 64 kilobytes allowed by a 16-bit address. For each memory access, a segment register and a memory offset were added to form the final memory address. For performance, the 8086 had a separate adder for these memory address computations, rather than using the ALU. The upper registers also include six bytes of instruction prefetch buffer and the program counter.
The lower-left corner of the chip holds the Execution Unit, which performs data operations. The lower registers include the general-purpose registers and index registers such as the stack pointer. The 16-bit ALU performs arithmetic operations (addition and subtraction), Boolean logical operations, and shifts. The ALU does not implement multiplication or division; these operations are performed through a sequence of shifts and adds/subtracts, so they are relatively slow.
One of the hardest parts of computer design is creating the control logic that tells each part of the processor what to do to carry out each instruction. In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control logic from complex logic gate circuitry, the control logic could be replaced with special code called microcode. To execute an instruction, the computer internally executes several simpler micro-instructions, which are specified by the microcode. With microcode, building the processor's control logic becomes a programming task instead of a logic design task.
Microcode was common in mainframe computers of the 1960s, but early microprocessors such as the 6502 and Z-80 didn't use microcode because early chips didn't have room to store microcode. However, later chips such as the 8086 and 68000, used microcode, taking advantage of increasing chip densities. This allowed the 8086 to implement complex instructions (such as multiplication and string copying) without making the circuitry more complex. The downside was the microcode took a large fraction of the 8086's die; the microcode is visible in the lower-right corner of the die photos.3
The photo above shows part of the microcode ROM. Under a microscope, the contents of the microcode ROM are visible, and the bits can be read out, based on the presence or absence of transistors in each position. The ROM consists of 512 micro-instructions, each 21 bits wide. Each micro-instruction specifies movement of data between a source and destination. It also specifies a micro-operation which can be a jump, ALU operation, memory operation, microcode subroutine call, or microcode bookkeeping. The microcode is fairly efficient; a simple instruction such as increment or decrement consists of two micro-instructions, while a more complex string copy instruction is implemented in eight micro-instructions.3
The path to the 8086 was not as direct and planned as you might expect. Its earliest ancestor was the Datapoint 2200, a desktop computer/terminal from 1970. The Datapoint 2200 was before the creation of the microprocessor, so it used an 8-bit processor built from a board full of individual TTL integrated circuits. Datapoint asked Intel and Texas Instruments if it would be possible to replace that board of chips with a single chip. Copying the Datapoint 2200's architecture, Texas Instruments created the TMX 1795 processor (1971) and Intel created the 8008 processor (1972). However, Datapoint rejected these processors, a fateful decision. Although Texas Instruments couldn't find a customer for the TMX 1795 processor and abandoned it, Intel decided to sell the 8008 as a product, creating the microprocessor market. Intel followed the 8008 with the improved 8080 (1974) and 8085 (1976) processors. (I've written more about early microprocessors here.)
In 1975, Intel's next big plan was the 8800 processor designed to be Intel's chief architecture for the 1980s. This processor was called a "micromainframe" because of its planned high performance. It had an entirely new instruction set designed for high-level languages such as Ada, and supported object-oriented programming and garbage collection at the hardware level. Unfortunately, this chip was too ambitious for the time and fell drastically behind schedule. It eventually launched in 1981 (as the iAPX 432) with disappointing performance, and was a commercial failure.
Because the iAPX 432 was behind schedule, Intel decided in 1976 that they needed a simple, stop-gap processor to sell until the iAPX 432 was ready. Intel rapidly designed the 8086 as a 16-bit processor somewhat compatible with the 8-bit 8080,4 released in 1978. The 8086 had its big break with the introduction of the IBM Personal Computer (PC) in 1981. By 1983, the IBM PC was the best-selling computer and became the standard for personal computers. The processor in the IBM PC was the 8088, a variant of the 8086 with an 8-bit bus. The success of the IBM PC made the 8086 architecture a standard that still persists, 42 years later.
Why did the IBM PC pick the Intel 8088 processor?7 According to Dr. David Bradley, one of the original IBM PC engineers, a key factor was the team's familiarity with Intel's development systems and processors. (They had used the Intel 8085 in the earlier IBM Datamaster desktop computer.) Another engineer, Lewis Eggebrecht, said the Motorola 68000 was a worthy competitor6 but its 16-bit data bus would significantly increase cost (as with the 8086). He also credited Intel's better support chips and development tools.5
In any case, the decision to use the 8088 processor cemented the success of the x86 family. The IBM PC AT (1984) upgraded to the compatible but more powerful 80286 processor. In 1985, the x86 line moved to 32 bits with the 80386, and then 64 bits in 2003 with AMD's Opteron architecture. The x86 architecture is still being extended with features such as AVX-512 vector operations (2016). But even though all these changes, the x86 architecture retains compatibility with the original 8086.
The 8086 chip was built with a type of transistor called NMOS. The transistor can be considered a switch, controlling the flow of current between two regions called the source and drain. These transistors are built by doping areas of the silicon substrate with impurities to create "diffusion" regions that have different electrical properties. The transistor is activated by the gate, made of a special type of silicon called polysilicon, layered above the substrate silicon. The transistors are wired together by a metal layer on top, building the complete integrated circuit. While modern processors may have over a dozen metal layers, the 8086 had a single metal layer.
The closeup photo of the silicon below shows some of the transistors from the arithmetic-logic unit (ALU). The doped, conductive silicon has a dark purple color. The white stripes are where a polysilicon wire crossed the silicon, forming the gate of a transistor. (I count 23 transistors forming 7 gates.) The transistors have complex shapes to make the layout as efficient as possible. In addition, the transistors have different sizes to provide higher power where needed. Note that neighboring transistors can share the source or drain, causing them to be connected together. The circles are connections (called vias) between the silicon layer and the metal wiring, while the small squares are connections between the silicon layer and the polysilicon.
The 8086 was intended as a temporary stop-gap processor until Intel released their flagship iAPX 432 chip, and was the descendant of a processor built from a board full of TTL chips. But from these humble beginnings, the 8086's architecture (x86) unexpectedly ended up dominating desktop and server computing until the present.
Although the 8086 is a complex chip, it can be examined under a microscope down to individual transistors. I plan to analyze the 8086 in more detail in future blog posts8, so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed. Here's a bonus high-resolution photo of the 8086 with the metal and polysilicon removed; click for a large version.
The 8086 was released on June 8, 1978. ↩
To expose the chip's silicon, I used Armour Etch glass etching cream to remove the silicon dioxide layer. Then I dissolved the metal using hydrochloric acid (pool acid) from the hardware store. I repeated these steps until the bare silicon remained, revealing the transistors. ↩
The designers of the 8086 used several techniques to keep the size of the microcode manageable. For instance, instead of implementing separate microcode routines for byte operations and word operations, they re-used the microcode and implemented control circuitry (with logic gates) to handle the different sizes. Similarly, they used the same microcode for increment and decrement instructions, with circuitry to add or subtract based on the opcode. The microcode is discussed in detail in New options from big chips and patent 4449184. ↩↩
The 8086 was designed to provide an upgrade path from the 8080, but the architectures had significant differences, so they were not binary compatible or even compatible at the assembly code level. Assembly code for the 8080 could be converted to 8086 assembly via a program called CONV-86, which would usually require manual cleanup afterward. Many of the early programs for the 8086 were conversions of 8080 programs. ↩
Eggebrecht, one of the original engineers on the IBM PC, discusses the reasons for selecting the 8088 in Interfacing to the IBM Personal Computer (1990), summarized here. He discussed why other chips were rejected: IBM microprocessors lacked good development tools, and 8-bit processors such as the 6502 or Z-80 had limited performance and would make IBM a follower of the competition. I get the impression that he would have preferred the Motorola 68000. He concludes, "The 8088 was a comfortable solution for IBM. Was it the best processor architecture available at the time? Probably not, but history seems to have been kind to the decision." ↩
The Motorola 68000 processor was a 32-bit processor internally, with a 16-bit bus, and is generally considered a more advanced processor than the 8086/8088. It was used in systems such as Sun workstations (1982), Silicon Graphics IRIS (1984), the Amiga (1985), and many Apple systems. Apple used the 68000 in the original Apple Macintosh (1984), upgrading to the 68030 in the Macintosh IIx (1988), and the 68040 with the Macintosh Quadra (1991). However, in 1994, Apple switched to the RISC PowerPC chip, built by an alliance of Apple, IBM, and Motorola. In 2006, Apple moved to Intel x86 processors, almost 28 years after the introduction of the 8086. Now, Apple is rumored to be switching from Intel to its own ARM-based processors. ↩
The main reason I haven't done more analysis of the 8086 is that I etched the chip for too long while removing the metal and removed the polysilicon as well, so I couldn't photograph and study the polysilicon layer. Thus, I can't determine how the 8086 circuitry is wired together. I've ordered another 8086 chip to try again. ↩
The Nintendo Game Boy contains an audio amplifier chip for sound through a speaker or headphones. In this post, I reverse-engineer this chip and compare it with the later Game Boy Color chip (reverse-engineered earlier). Unexpectedly the Game Boy Color uses an entirely different amplifier design from the original Game Boy, which may explain why the two systems sound different.
The diagram below shows the Game Boy amplifier's silicon die, with the main functional components labeled.1 The upper-left part of the chip has the two large driver transistors for the speaker output (one to pull the signal low and the other to pull the signal high). The headphone amplifier consists of two nearly-identical blocks: one for the left channel and one for the right. The circuitry for the current sources and current mirrors is shared by both headphone channels. The lower-left of the chip contains digital logic to enable either the speaker amp or the headphone amp, switching when the headphones are plugged in.
By examining the die closely, components such as transistors and resistors can be identified. From this, the complete circuit can be determined. In the photo above, the white lines are the chip's metal layer, connecting the components. The silicon itself appears greenish and is underneath the metal. The green squares around the outside are the pads where tiny bond wires connected the silicon die to the chip's 18 pins. Regions of the chip are treated (doped) to change the electrical properties of the silicon. The next sections explain how components are created from these different types of silicon.
The amplifier chip is built from transistors known as NPN and PNP bipolar transistors, different from the low-power MOS transistors used in processors. These transistors have three connections: the emitter, the base, and the collector. The magnified photo below shows an NPN transistor from above. The slightly different tints in the silicon indicate regions that have been doped to form N and P regions, with dark lines separating the regions. The bubbly silverish areas are the metal layer of the chip on top of the silicon—these form the wires connected to the emitter, base, and collector.
Underneath the photo is a vertical cross-section illustrating how the transistor is constructed. The emitter (E) wire is connected to N+ silicon. Below that is a P layer connected to the base contact (B). And below that is an N+ layer connected (indirectly) to the collector (C). If you look at the vertical cross-section below the 'E', you can find the N-P-N layers that form the transistor.
A different structure (below) is used for the high-current output transistors that drive the speaker. These transistors are much larger and have multiple interlocking "fingers" of the emitter and base, surrounded by the large collector. If you look back at the die photo, you can see two of these transistors filling the upper left part of the die.
The chip also uses PNP transistors, which have an entirely different construction, as shown in the diagram below. The most obvious difference is that the PNP transistors are round.2 A PNP transistor has a small circular emitter (P-silicon), surrounded by a ring-shaped base region (N-silicon), which in turn is surrounded by the collector (P-silicon). (The emitter metal covers both the emitter and the base, but is only connected to the emitter.) These regions form a P-N-P sandwich horizontally (laterally), unlike the vertical structure of the NPN transistors. Note that although the base region physically surrounds the emitter, the metal connection to the base is further away; the base signal passes through the N region underneath the collector to reach the base region.
Resistors are an important component of analog chips. The photo below shows some long, zig-zagging resistors, formed from strips of P silicon, which appears beige in the photo. Its resistance is proportional to the length of the resistor, so large-value resistors have a zig-zag shape to fit in the available space. Because resistors are relatively large and inaccurate, chip designs try to minimize the number of resistors required. Even so, an analog chip like this one requires numerous resistors.
The photo below shows seven small resistors, but only the two in the middle are connected (in parallel) to the circuit. These extra resistors allow the resistance to be modified by modifying the metal layer, which is much easier than changing the silicon. (These resistors bias the output transistor, and it appears this is a critical resistance that required adjustment.)
This chip has three large capacitors, one for each amplifier. The photo below shows one of the capacitors. The capacitors are simply a large layer of metal over the underlying silicon, separated by a thin insulating oxide layer. At the top and right of the photo, you can see the connections between the metal wiring and the underlying silicon. In this chip, capacitors are used to ensure the stability of the amplifiers. Because they are large, the three capacitors are easy to spot in the chip die photo.
The Game Boy amplifier chip has a design very similar to the popular LM380 power audio amplifier chip (1972), so I'll start with an overview of how the LM380 works. (See the footnote5 for details.) The LM380 has positive and negative inputs and an output that amplifies the difference between the inputs by a fixed factor of 50. This may sound like an op-amp, but the LM380 is intended as an audio amplifier and is different from an op-amp in several ways: it has a small, fixed gain, it doesn't have a negative power supply, and its internal implementation is different.
The schematic below shows the main functional blocks of the LM380. The inputs go into a differential pair circuit (blue)3. The output from the differential pair (green) goes into a single-transistor amplification stage that provides more gain. The capacitor across the amplification stage stabilizes the amplifier to prevent oscillation. Finally, the output stage (purple) produces the high-current output: power transistor Q7 pulls the output high, while Q8 and Q94 pull the output low. The feedback network controls the gain of the LM380, fixing the gain at a factor of 50. Note that unlike an op-amp, the LM380's feedback network is connected to internal points of the amplifier, not the inputs.
The Game Boy amplifier chip contains three amplifiers: two identical amplifiers for the left and right headphone channels, and a more powerful mono amplifier for the speaker. The Game Boy headphone amplifiers and the speaker amplifier are somewhat different, but they are both similar to the LM380.
The schematic below shows the Game Boy headphone amplifier. Comparing it with the LM380 schematic above shows the similarities between the LM380 and the headphone amplifier, but also some differences. The input stage and feedback circuit of the LM380 are the most distinctive parts of that chip, and the headphone amplifier's circuit is essentially identical.6 The "Amplification" stage of the headphone amplifier has three transistors compared to one in the LM380, probably to produce more gain. The headphone amplifier's output stage is similar but simplified; the PNP/NPN pair that pulls the LM380 output low is replaced with a single PNP transistor. The biggest difference is the "Control" section of the headphone amplifier, which is not present in the LM380. This control circuitry powers down the headphone amplifier if headphones are not plugged in, conserving battery life.
The photo below shows the left headphone amplifier. The output pin (lower-right, next to the part number SBG14) is driven by seven PNP transistors in parallel (top-left) and seven smaller NPN transistors in parallel (lower center). The capacitor is in the upper left, near the center. Many resistors snake around the die.
The next schematic shows the Game Boy speaker amplifier. Unlike the two channels for headphone amplification, there is a single speaker amplifier, producing a mixture of the left and right channels. Again, the input stage and feedback are almost identical to the LM380. The output stage has only minor differences. However, the amplification stage for the speaker is completely different: it includes a four-transistor differential amplifier stage, which will provide much more amplification.7 Although this amplification stage looks very similar to the input stage at first glance, its is wired differently and uses NPN transistors.8
The chip provides pins for bypass capacitors to reduce the effect of power supply fluctuations.9 The headphone amplifiers have external bypass capacitors, but the speaker bypass capacitor is omitted for some reason (see the Game Boy schematic). Lack of this capacitor may contribute to the background hum that people hear in the Game Boy's sound.
I recently reverse-engineered the Game Boy Color's amplifier chip, so it's interesting to compare the two chips. The amplifier chips for the Game Boy and the Game Boy Color provide similar functions. Even at the die level (below), the two chips look similar. They both have power transistors in the upper-left for the speaker, control circuitry in the lower-left, and two headphone channels on the right.
Surprisingly, the implementations of the two chips are completely different. While the Game Boy uses LM380-style audio amplifiers, the Game Boy Color uses power op-amps with more complicated circuitry. The most important difference is that the Game Boy chip has internal feedback to control the gain, while the Game Boy Color also has an external feedback capacitor, which causes it to act as a high-pass filter. For more information, see my Game Boy Color amplifier article and schematic.
Collectors of Game Boy systems have noticed that the different versions have a very different sound (discussion). The original Game Boy has a "warm, bassy sound", while the Game Boy Color has a "thin sound" with background noise and hum. These aren't just subjective differences, but show up in the waveforms:
What's interesting is that we can explain much of the sound difference through the analysis of the amplifier chips. The Game Boy's output is close to a square wave, but the waveform drops somewhat due to the speaker's 100µF DC blocking capacitor (schematic). The amplifier in the Game Boy Color, on the other hand, is configured as a high-pass filter, so it outputs higher-frequency spikes, losing the bass sound.
The Game Boy (1989) and Game Boy Color (1998) use custom amplifier chips. By examining die photos, the circuitry can be reverse engineered. The chips are different from common amplifier chips in two main ways, which probably explains why custom chips were created. First, each chip has three amplifiers: two for headphone channels and one for the speaker. Second, to conserve power the chip has circuitry to power-down the unused amplifiers, based on whether or not headphones are plugged in. Reverse-engineering the chips also explains much of the difference in sound between the Game Boy and the Game Boy Color. The Game Boy Color's chip implements a high-pass filter, so the sound is thin and lacks the bass of the Game Boy.
I announce my latest blog posts on Twitter, so follow me @kenshirriff for future articles. I also have an RSS feed. My KiCad files for the schematic are on Github. Thanks to John McMaster for providing the chip photos; his page is here. Thanks to Herbert Weixelbaum for the sound waveforms.
Internally, the chip is labeled SBG14.
Most of the PNP transistors on this chip are round. However, when multiple PNP transistors are combined, a square structure is used instead. The square PNP transistors are larger than the square NPN transistors. The chip also has some PNP transistors with multiple collectors. Other PNP transistors have no explicit collector connection but use the substrate (ground). ↩
The inputs to the LM380 (or Game Boy amplifier) go into a differential pair (Q3, Q4), but this differential pair is different from the standard one used in op-amps. In particular, the emitters receive different, varying currents, and this is where the feedback happens. ↩
The output stages of the LM380 and the Game Boy speaker amplifier use two transistors for pull-down configured as a Szilaki pair. The combined PNP and NPN transistors act as a higher-performance PNP transistor, somewhat like a Darlington pair. ↩
I'll explain the feedback network since the Game Boy chip operates the same way. The diagram below shows how the feedback network in the LM380 operates with no input. In the upper left, the supply voltage VS across R1 creates a current I. Transistors Q5 and Q6 form a current mirror: this forces the current through Q6 to match the current (I) through Q5. The current from Q4 to the rest of the chip must be approximately 0 (since it is strongly amplified by the rest of the chip). Putting this all together, the current through R2 (generated from the output voltage feedback) must also be I. Since R2 is half the resistance of R1, the output voltage must be half of the supply voltage. The conclusion is that the output voltage at idle will be half of the supply voltage, as desired.
When inputs are applied, the feedback network acts as seen below. Suppose a voltage ΔV is applied to the positive input. Emitter-follower transistors Q3 and Q4 buffer and raise the inputs, so the same ΔV appears across resistor R3. This generates a current ΔI through the resistor. This increases the current through Q5 to I+ΔI, and because of the current mirror, the same current will flow through Q6. Adding up the various currents, the current through R2 must be I+2ΔI. Since R2 has 25 times the resistance of R3, 2ΔI corresponds to an increase in the output voltage of 50ΔV. Therefore, the input voltage is multiplied by a factor of 50. The point of this is that the feedback network fixes the gain at 50.
It seems to me that the best way to understand the LM380 is to consider it as constructed from an operational transresistance amplifier (OTRA), an obscure relative of the op-amp. An OTRA acts like an op-amp, except the two inputs are currents instead of voltages, and the difference between the currents is amplified to produce the output voltage. The two currents (I) into the OTRA must be approximately equal, but the input voltages can diverge (unlike an op-amp).
The schematic above shows the LM380's circuitry represented as an operational transconductance amplifier and feedback network. Equating the two currents yields Vout = Vs/2 + 51V+ - 50.5V- or approximately Vout = Vs/2 + 50*(V+-V-). In other words, the output is centered at half the supply voltage, and the difference in input voltages is amplified by a factor of 50. (Nobody else describes the LM380 in this way, so it's quite possible that I am looking at it wrong, but this analysis makes sense to me.) ↩
I don't know the exact values of the resistances on the die, but by comparing lengths on the die I can determine ratios of resistances. Looking at resistors R48, R49, R50, and R51, I calculate that the speaker amplifier has a gain factor of 22. From resistors R2, R3, R4, and R7, I calculate that the speaker amplifier has a gain factor of 30, significantly more than the headphone amplifiers. ↩
Note that the overall amplification of the chip is limited by the feedback network. The idea of an op-amp is the raw gain will be something like 100,000, but the feedback reduces the gain to something reasonable like a factor of 50. The "extra" gain improves performance and reduces distortion. In other words, the additional amplification stage in the Game Boy compared to the LM380 isn't going to make it 100 times louder. ↩
I'm a bit puzzled by the second amplification stage for the speaker amplifier. It looks like a differential amplifier, except a differential amplifier normally has the emitters connected and this circuit has the collectors connected. ↩
The bypass capacitors used by the Game Boy chip (and the LM380) help reduce the impact of power supply fluctuations. It's common for chips to have a bypass capacitor between power and ground, but this bypass capacitor is a bit different. It is connected to a point in the feedback network where it is more effective than a regular bypass capacitor. ↩
The revolutionary Intel 8086 microprocessor was introduced 42 years ago this month so I've been studying its die.1 I came across two 8086 dies with different sizes, which reveal details of how a die shrink works. The concept of a die shrink is that as technology improved, a manufacturer could shrink the silicon die, reducing costs and improving performance. But there's more to it than simply scaling down the whole die. Although the internal circuitry can be directly scaled down,2 external-facing features can't shrink as easily. For instance, the bonding pads need a minimum size so wires can be attached, and the power-distribution traces must be large enough for the current. The result is that Intel scaled the interior of the 8086 without change, but the circuitry and pads around the edge of the chip were redesigned.
The photo below shows an 8086 chip from 1979, and a version with a visibly smaller die from 1986.3 (The ceramic lids have been removed to show the silicon dies inside.) In the updated 8086, the internal circuitry was scaled to about 64% of the original size by length, so it took 40% of the original area. The die as a whole wasn't reduced as much; it was about 54% of the original area. (The chip's package was unchanged, the 40-pin DIP package commonly used for microprocessors of that era.)
The 8086 is one of the most influential chips ever created; it started the x86 architecture that still dominates desktop and server computing today. Unlike modern CMOS processors, the 8086 was built from NMOS transistors, as were the 6502, Z-80, and other early processors.4 The first chip was built with HMOS,5, Intel's name for this process. Intel introduced improved HMOS-II in 1979 and in 1982, Intel moved to HMOS-III, the process used for the newer 8086 chip.6 Each newer HMOS version shrunk the size of features on the chip and improved performance.
The photo above shows the two 8086 dies at the same scale. The two chips have identical layout in the interior,7 although they may look different at first. The chip on the right has many dark lines in the middle that don't appear on the left, but this is an artifact. These lines are the polysilicon layer, underneath the metal; the die on the left has the same wiring, but it is very faint. I think the newer chip has a thinner metal layer, making the polysilicon more visible.
The magnified photo below shows the same circuitry on the two dies. There is an exact correspondence between components in the two images, showing the circuitry was reduced in size, not redesigned. (These photos show the metal layer on top of the chip; some polysilicon is visible in the right photo.)
However, there are significant differences around the edges of the dies. The bond pads around the outside are closer together, especially in the bottom right. There are two reasons for this. First, the bond pads can't shrink very much, since they need to be large enough to attach bond wires. Second, the power distribution traces around the edges are wider in order to support the necessary current. (Look to the right of the microcode ROM in the lower right, for instance.) Part of this is because the power traces in the middle of the circuitry were scaled down with the rest of the circuitry, so they are smaller; the outside traces need to pick up the slack. In addition, the thinner metal layer in the newer chip can't support as much current without being widened.
The photo above shows a bonding pad with an attached bond wire. The drive transistors are above the pad. The newer chip has almost the same size pad, but the power drive transistors have both shrunk and been redesigned. Note the much thicker metal power wiring on the newer chip. The Intel logo was moved from the bottom right to the bottom left, probably because that's where there was room.
First, a bit of background on the NMOS construction used in the 8086 and other chips of that era. These chips consist of a silicon substrate, which is doped (diffusion) with arsenic or boron to form transistors. On top, a layer of polysilicon creates the gates of the transistors as well as providing wiring between components. Finally, a single metal layer on top wires up the components.
A semiconductor process (such as HMOS-III) has specific rules on the minimum size and spacing for features on the silicon, polysilicon, and metal layers. By looking closely at the chips, we can see how the features correspond to the design rules for HMOS I and HMOS III. The table below (from HMOS III Technology) summarizes the characteristics of the different HMOS processes. The features get smaller and the performance gets better with each version. (Intel got a 40% overall performance improvement going from HMOS-II to HMOS-III.)
|HMOS I||HMOS II||HMOS III|
|Diffusion Pitch (µ)||8.0||6.4||5.0|
|Poly Pitch (µ)||7.0||5.6||4.0|
|Metal Pitch (µ)||11.0||8.0||6.4|
|Gate Oxide Thickness (Å)||700||400||250|
|Channel Length (µ)||3.0||2.0||1.5|
|Minimum Gate-Delay (ps)||1000||400||200|
|Speed-Power Product (pJ)||1.0||0.5||0.25|
|Linear Shrink Factor||1.0||0.8||0.64|
The microscope photo below shows a complex arrangement of transistors in the older 8086 chip. The dark regions are doped silicon, while the white rectangles are the transistor gates. (There are about 21 transistors in this photo.) A key measurement is the channel length, the length of the gate between the source and drain. (This is the narrower dimension of the white rectangles.) I measured 3 μm for these transistors, which nicely matches the published value for HMOS I.8 This indicates the chip was manufactured with a 3 μm process; in comparison, processors are now moving to a 5 nm process, 600 times smaller.
The photo below shows transistors in newer 8086 at the same scale; the transistors are much smaller. The linear dimensions are scaled by 64%, so the transistors have 40% of their original area. Because I processed this die differently, the polysilicon remained on the die, the yellowish lines. The doped silicon appears pinkish, much less visible than before. I measure the gate length as 1.9 μm, which is 64% of the previous 3 μm. Note that HMOS-III supports a considerably smaller 1.5 μm channel length, but since everything shrinks by the same 64% factor, the channel length is larger than necessary. This illustrates that uniformly shrinking the die wastes some of the potential gain from the new process, but it is much easier than completely redesigning the chip.
I also looked at the spacing (pitch) of lines in the metal layer. The photo below shows some horizontal and vertical metal wiring in the older chip. I measured 11μm pitch for the metal lines, which matches the published HMOS I figure. The shrink to 64% yields 7 μm pitch on the new chip, even though HMOS III supported 6.4 μm. As before, the constant shrink factor doesn't take full advantage of the new process.
Finally, I looked at the pitch of the polysilicon wiring. The photo below shows the older 8086; the polysilicon has been removed leaving faint white traces. These parallel polysilicon lines probably formed a bus, routing signals from one part of the chip to another. I measured 7 μm pitch for the polysilicon lines, matching the published HMOS figure. (Interestingly, polysilicon wiring can be denser than metal wiring under HMOS rules.) The newer chip has 4.5 μm polysilicon pitch, compared to possible 4.0 μm.
A die shrink provides a way to improve the performance of a processor and reduce its cost without the effort of a complete redesign. Comparing the two chips, however, shows that a die shrink is more complex than uniformly shrinking the whole die. While most of the circuitry is a straightforward shrink, the bond pads didn't shrink to the same degree, so they needed to be moved around. The power distribution was also modified, adding more power wiring around the outer part of the chip.
Modern microprocessors still use die shrinks. In 2007, Intel moved to a tick-tock model, where they would alternate shrinks of an existing chip (the "tick") with the production of a new microarchitecture (the "tock").
The 8086 was released on June 8, 1978. ↩
It's actually quite remarkable that MOSFET circuits still work after being scaled down over a large range, since most things don't scale as easily. For instance, you can't scale down an engine by a factor of 10 and expect it to work. Most physical things suffer from the square-cube law: the area scales with the square of the ratio, while the volume scales with the cube of the ratio. For MOS circuits, however, most things either stay the same with scaling, or get better (such as frequency and power consumption). For more details on scaling, see Mead and Conway's Introduction to VLSI Systems Ch 1 sect 2. Interestingly, that 1978 book says that scaling had a fundamental limit of 1/4 micron (250 nm) channel length due to physical effects. That limit was wildly wrong; transistors are now moving to 5 nm, through technologies such as FinFETs. ↩
The older chip says ©'78, ©'79 on the package and ©1979 on the die and has a 7947 (47th week of 1979) date code on the underside. The newer chip says ©1978 on the package but ©1986 on the die and has no identifiable date code, so I figure it is from 1986 or slightly later. It's unclear why the newer chip has an older copyright date on the external package. ↩
A brief description of the technologies in early processors. N-channel MOSFETs are a particular type of MOSFET transistor. They have considerably better performance than the P-channel MOSFETs used in the earliest microprocessors, such as the Intel 4004. (Modern processors use N-channel and P-channel transistors together for lower power consumption; this is CMOS.) Gates built from N-channel MOSFETs require a pull-up resistor, which is implemented by a transistor. Depletion load transistors are a type of transistor introduced in the mid-1970s that perform better as pull-up resistors and don't require an extra power supply voltage. Finally, MOS transistors originally used metal for the gate (the M in MOS). But in the late 1960s, Fairchild developed the use of polysilicon for the gate instead of metal. This provided much better performance and was easier to manufacture. The point of all this is that between the late 1960s and mid-1970s, several radical changes were introduced in MOS integrated circuit production, and these led to the success of the 6502, Z-80, 8085, 8086, and other early processors. In the 1980s, CMOS processors took over due to their lower power consumption and better performance. ↩
Strangely, it's unclear what the "H" stands for in HMOS. I couldn't find anywhere that Intel expands the acronym; databooks refer to "Intel's advanced N-channel silicon gate HMOS process" or say "HMOS is a high-performance n-channel MOS process". Intel later defined CHMOS as Complementary High Speed Metal Oxide Semiconductor) (example). Motorola defined HMOS as High-density MOS (example) while other sources defined it as High-speed MOS or High-density, short channel MOS. Intel has a patent on "High density/high speed MOS process and device", so perhaps the "H" stands for both "high density" and "high speed". ↩
Interestingly, Intel used a 4K static RAM chip to develop each of their HMOS processes, before using the process for their microprocessors and other chips. They probably developed with the RAM chip because it has dense circuitry, but is relatively easy to design because it repeats the same memory cell over and over. Once they had all the design rules figured out, then they could create the much more complex processor. ↩
I scaled complete, high-resolution images of the two chips to compare and the main part of the chips is an exact match except for some trivial changes. I found a couple of places where a via was slightly moved, which is puzzling because I see no logical reason for that. The circuit was unchanged, so it's not a bug fix. One question is if there were any microcode changes. The microcode looks identical, but I didn't do a bit-by-bit comparison. ↩
You may have noticed that three transistors in the photo have much larger gates. These are transistors that are acting as pull-up resistors, as is typical for NMOS circuits. The larger size makes the transistors weaker, so they provide a weak pull-up current. ↩
The Intel 8086 microprocessor is one of the most influential chips ever created; it led to the x86 architecture that dominates desktop and server computing today. I've been reverse-engineering the 8086 from die photos, and in this post I discuss how its register file is implemented.
The photo above shows the silicon die of the 8086 processor under a microscope. The metal layer on top of the chip is visible, with the silicon hidden underneath. Around the outside edge, bond wires connect pads on the die to the chip's 40 external pins.
The highlighted region indicates the 8086's fifteen 16-bit registers and six bytes of instruction prefetch queue.1 Registers take up a significant portion of the die, even though they are just 36 bytes in total. Due to space limitations, early microprocessors had a relatively small number of registers; in comparison, a modern processor chip has kilobytes of registers and megabytes of cache storage.2
I'll start by explaining how the 8086 is built from NMOS transistors. Then I'll explain how an inverter is constructed, how a single bit is stored using inverters, and how a register is constructed.
The 8086 and other chips of that era were built from a type of transistor called NMOS. These chips consisted of a silicon substrate, which was "doped" by diffusion of arsenic or boron to form transistors. Above the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (Modern processors, in comparison, use CMOS technology, which combines NMOS and PMOS transistors, and they have many metal layers.)
The schematic below shows an inverter built from an NMOS transistor and a resistor3 With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on, connecting ground and the output, pulling the output low. Thus, the input signal is inverted.
The photo above shows how an inverter is physically constructed in the 8086. The pinkish regions are conductive doped silicon and the sparkly copper-colored lines are polysilicon on top. A transistor is created where polysilicon crosses silicon: the polysilicon forms the transistor's gate, while the silicon regions on either side are the transistor's source and drain. The large polysilicon rectangle forms the pull-up resistor between +5 volts and the output. Thus, the chip's circuitry matches the inverter schematic. Under a microscope, circuits such as this inverter are visible and can be reverse-engineered.
The building block for the registers is two inverters in a feedback loop, storing a single bit, as shown below. If the top wire has a 0, the right inverter will output a 1 to the bottom wire. The left inverter will then output a 0 to the top wire, completing the cycle. Thus, the circuit is stable and will "remember" the 0. Likewise, if the top wire is a 1, this will get inverted to a 0 at the bottom wire, and back to a 1 at the top. Thus, this circuit can store either a 0 or a 1, forming a 1-bit memory.
Three transistors are added to make a usable register cell from the inverter pair.4 One transistor selects the cell for reading, another transistor selects the cell for writing, and the third amplifies the signal when reading. In the center of the schematic below, two inverters store the bit. To read the bit, the read line is energized. This connects the inverter output to the bit line through the amplifying transistor. To write the bit, the write line is energized, connecting the bit line to the inverters. By putting a high-current 0 or 1 signal on the bit line, the inverters (and thus the stored bit) are forced to the desired value. Note that the bit line is used for both reading and writing.
The register file consists of a matrix of register cells like the one above. The matrix is 16 cells wide since registers are 16 bits wide. Each register is arranged horizontally, so a read line or write line select all the cells for a particular register. The 16 vertical bit lines form a bus, so all 16 bits in the selected register are read or written in parallel.
The photo below zooms in on the 8086's general-purpose register file, showing the matrix of register cells: 16 columns and 8 rows for eight 16-bit registers. It then zooms in on a single register cell in the register file. I'll now explain how this cell is implemented.
The 8086 is constructed from doped silicon and polysilicon wiring with metal wiring on top. The left photo below shows the vertical metal wiring of a register cell. The ground, power, and bit line wires are indicated. (The remaining wire crosses the register file but isn't connected to it.) In the right photo, the metal layer has been dissolved to show the polysilicon and silicon underneath. The read and write lines are horizontal polysilicon wires. (Because the chip has only one layer of metal, the register uses metal for the vertical lines and polysilicon for the horizontal lines so they don't run into each other.) The connections (called vias) between the metal and the silicon are visible as brighter circles in the metal photo and as circular spots in the silicon photo.
The diagram below shows how the physical layout of the register cell matches up with the schematic. The inverters are formed from transistors A and B, along with the resistors. Transistors C, D, and E are formed by the labeled strips of polysilicon. The bit line is not visible below, since it is in the metal layer. Note that the layout of the memory cell is highly optimized to minimize its size. Also note that transistor A is much smaller than the other transistors; inverter A has a weak output so it can be overpowered by the bit line when a value is written.
Careful examination of the die shows that some of the register cells have a slightly different structure. On the left is a pair of the register cells discussed above,5 while the right photo shows a pair of register cells with two write control lines instead of one. In the left photo, the write line crosses the silicon in both register cells. However, in the right photo, the "write right" line crosses the silicon on the right side but goes between the silicon regions on the left. Conversely, the "write left" line crosses the silicon on the left side and goes between the silicon on the right. Thus, one write line controls writes to the right-hand bit, while the other controls writes to the left-hand bit. In the full 16-bit register, this allows alternating 8-bit parts to be written separately.6
Why do some registers have two write lines while others have one? The reason is that the 8086 has 16-bit registers, but four of them can also be accessed as 8-bit registers, as shown below. For example, the 16-bit accumulator A can be accessed as an 8-bit AH (accumulator high) register and an 8-bit AL (accumulator low) register. By implementing the registers with two write control lines, either half of the register can be written separately.7
So far, I've discussed the eight general-purpose "lower registers". The 8086 also has seven "upper registers" used for memory accesses, including the infamous segment registers.8 These registers have a more complex "multi-port" design, allowing multiple reads and writes to take place simultaneously.9 For instance, the multi-ported register file would allow the program counter to be read, a segment register to be read, and a different segment register to be written, all at the same time.
The multi-ported register cell below is built around the same two-inverter circuit as before but it has three bit lines (compared to one earlier) and five control lines (compared to two). The three read control lines allow the register cell contents to be read to any of the three bit lines, while the two write control lines allow bit line A or bit line C to be written to the register cell.
At first glance, the 8086's register file looked like a uniform set of registers, but close examination reveals that each register has been optimized based on its function.10 Some registers are simple 16-bit registers, which have the most compact layout. Other 16-bit registers can also be accessed as two 8-bit registers, requiring another control line. The most complex registers have two or three read ports and one or two write ports. In each case, the physical layout of the register cell has been carefully designed to be as compact as possible, with elaborate transistor shapes, as seen below. Intel's engineers shrunk the register layout as much as possible to fit all the registers in the available space.
Although the 8086 processor is 42 years old, it still heavily influences modern computing through the x86 architecture in heavy use today. The registers of the 8086 still exist in modern x86 computers, although the registers are now 64 bits long and have been joined by many new registers.
The 8086 is an interesting subject for die analysis since its transistors are large enough to be visible under a microscope. It was a complex processor at the time, with 29,000 transistors, but it is still simple enough that the circuitry can be traced out and understood. I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter @kenshirriff or on RSS for updates.
The 8086 was apparently the first microprocessor to implement instruction prefetching. The Motorola 68000 (1979) had a 4-byte instruction prefetch buffer. Prefetching in mainframes goes back to the IBM Stretch (1961), CDC 6600 (1964), and IBM System/360 Model 91 (1966). ↩
It's difficult to determine how many registers are in a modern processor; the only accurate description I could find was in The Anatomy of a High-Performance Microprocessor, which describes the AMD K6 processor (1997) in detail. Due to register renaming modern processors have many more physical registers than architectural registers (the registers visible to a programmer), and the number of physical registers is not documented. (In addition to the eight general-purpose x86 registers, the K6 had 16 microarchitecture scratch registers for renaming.)
Processors supporting AVX-512 include 32 512-bit registers, so that's 2 kilobytes of registers for that feature alone. This makes it even harder to determine the register size. As for cache size, high-end processors have up to 77 MB of cache storage.) ↩
Other processors use slightly different register storage cells. The 6502 uses an additional transistor in the inverter feedback loop to break the feedback loop when writing a new value. The Z-80 writes to both inverters at the same time, making the transition "easier" but requiring two write wires. While the 8086 has an amplification transistor in each register cell for reads, other processors read the outputs from both inverters and use an external differential amplifier to strengthen the signal. The 8086's basic register cell uses 7 transistors (7T), more than a typical 6-transistor (6T) or 4-transistor (4T) static RAM cell, but it only uses one bit line rather than two differential bit lines. Dynamic memory (DRAM) is much more efficient, using one transistor and a capacitor, but data will be lost without refresh. ↩
On the die, register cells are not repeated uniformly, but instead alternating cells are mirror images. This improves the density of the register cells because a power line running between two mirror-image cells can feed both of them (and the same with ground). Thus, the mirror-image layout reduces the number of power and ground lines by half. ↩
Although block diagrams always show the 16-bit registers split into a left half and a right half, the actual implementation alternates the bits from each half instead of storing one 8-bit part on the left and the other on the right. This implementation makes it easier to swap the two halves of a 16-bit word, which is required in several cases. (One is an unaligned memory read or write. Another is an ALU operation using the top half of a register, such as AH.) Swapping bits between the left half and the right half would require running long wires between the halves for each bit. But with the interleaved implementation, swapping the two halves is a matter of swapping each pair of neighboring bits, which doesn't need long wires. In other words, the interleaved layout in the 8086's registers simplifies the wiring for swapping the two halves of a word. ↩
If the register file only supported 16-bit registers instead of 8-bit half-registers, the processor could still work but would be less efficient. Writes to an 8-bit half could be done by reading the full 16 bits, modifying the 8-bit half, and then writing back the full 16 bits. This would take three registers accesses instead of one. Note that the register file doesn't need special support for 8-bit reads since the unwanted half can be ignored. ↩
The block diagram below is different from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. In particular, this diagram shows two "Internal Communication Registers" in the Bus Interface Unit registers (right) along with the segment registers, matching the 7 registers visible on the die. (The temporary registers below are physically part of the ALU, so I'm not discussing them in this blog post.)
The book Modern Processor Design discusses the complex register systems of processors from the early 2000s. It says that circuit complexity increases rapidly beyond 3 ports, but some high-end processors had register files with 20 ports or more. ↩
The upper registers have differing numbers of read and write ports, as follows: two registers with 3 read control lines and 2 write lines, one register with 2 read lines and 2 write lines, and four registers with 2 read lines and 1 write line. The first three registers are probably the program counter, the "indirect" temporary register, and the "operand" temporary register. The last four are probably the SS, DS, SS, and ES segment registers. There are also three instruction prefetch buffer registers, each with 1 read line and 1 write line.
The 8088 processor, used in the original IBM PC was essentially identical to the 8086, except it had an external 8-bit bus instead of a 16-bit bus to reduce system cost. The 8088's prefetch buffer was four bytes instead of six, presumably because four bytes was sufficient with the 8088's slower memory bus.
Unlike the 8086, the prefetch registers in the 8088 support writing to 8-bit halves independently (similar to the 8088's A, B, C, and D registers, but with a different register cell design). The reason is the 8088 fetched instructions one byte at a time instead of one word at a time, due to its narrower bus. Thus, the 8088's prefetch registers need to support byte-sized writes, while the 8086 does word-sized prefetches. ↩
Introduced in 1978, the revolutionary Intel 8086 microprocessor led to the x86 processors used in most desktop and server computing today. This chip is built from digital circuits, as you would expect. However, it also has analog circuits: charge pumps that turn the 8086's 5-volt supply into a negative voltage to improve performance.1 I've been reverse-engineering the 8086 from die photos, and in this post I discuss the construction of these charge pumps and how they work.
The photo above shows the tiny silicon die of the 8086 processor under a microscope. The metal layer on top of the chip is visible, with the silicon hidden underneath. Around the outside edge, bond wires connect pads on the die to the chip's 40 external pins. However, careful examination shows that the die has 42 bond pads, not 40. Why are there two extra ones?
An integrated circuit starts with a silicon substrate, and transistors are built on this. For high-performance integrated circuits, it is beneficial to apply a negative "bias" voltage to the substrate. 2 To obtain this substrate bias voltage, many chips in the 1970s had an external pin that was connected to -5V,3 but this additional power supply was inconvenient for the engineers using these chips. By the end of the 1970s, however, on-chip "charge pump" circuits were designed that generated the negative voltage internally. These chips used a single convenient +5V supply, making engineers happier.
On the 8086 die, the two extra pads feed this negative bias voltage to the substrate. The photo above shows the silicon die as mounted in the chip, with bond wires connected to the lead frame that forms the pins. Looking carefully, there are two small gray squares above and below the die; each connected to one of the "extra" bond pads. The charge pumps on the 8086 die generate a negative voltage, which passes through the bond wires to these squares, and then through the metal plate underneath to the 8086's substrate.
The photo below highlights the two charge pumps in the 8086. I'll discuss the top one; the bottom one has the same circuitry but a different layout to fit in the available space. Each pump has driver circuitry, a large capacitor, and a pad with the bond wire to the substrate. Each pump is located next to one of the 8086's two ground pads, presumably to minimize electrical noise.
You might wonder how a charge pump can turn a positive voltage into a negative voltage. The trick is a "flying" capacitor, as shown below. On the left, the capacitor is charged to 5 volts. Now, disconnect the capacitor and connect the positive side to ground. The capacitor still has its 5-volt charge, so now the low side must be at -5 volts. By rapidly switching the capacitor between the two states, the charge pump produces a negative voltage.
The 8086's charge pump circuit uses MOSFET transistors and diodes to switch the capacitor between the two states, with an oscillator to control the transistors, as shown in the schematic below. The ring oscillator consists of three inverters connected in a loop (or ring). Because the number of inverters is odd, the system is unstable and will oscillate.5 For instance, if the input to the first inverter is 0, its output will be 1, the second output will be 0, and the third output will be 1. This will flip the first inverter, and the "flip" will travel through the loop causing oscillation. To slow down the oscillation rate, two resistor-capacitor networks are inserted into the ring. Since the capacitors will take some time to charge and discharge, the oscillations will be slowed, giving the charge pump time to operate.4
The outputs from the ring oscillator are fed to the transistors that drive the capacitor. In the first step, the upper transistor is switched on, causing the capacitor to charge through the first diode to 5 volts with respect to ground. The second step is where the magic happens. The lower transistor turns on, connecting the high side of the capacitor to ground. Since the capacitor is still charged to 5 volts, the low side of the capacitor must now be at -5 volts, producing the desired negative voltage. This goes through the second diode and the bond wire to the substrate. When the oscillator flips again, the upper transistor turns on and the cycle repeats. The charge pump gets its name because it pumps charge from the output to ground.6 The diodes are similar to check valves in a water pump, making sure charge moves in the right direction.
The photo below shows the charge pump as it is implemented on the chip. In this photo, the metal wiring is visible on top, with reddish polysilicon underneath and beige silicon at the bottom. The main capacitor is visible in the center, with H-shaped wiring connecting it to the circuitry on the left. (Part of the capacitor is hidden under the wide metal power trace at the top.) On the right, the substrate bond wire is attached to the pad. A test pattern is below the pad; it has a square for each mask used to produce a layer of the chip.
Removing the metal layer shows the circuitry more clearly, below. The large charge pump capacitor takes up the right half of the photo. Although microscopic, this capacitor is huge by chip standards, about the size of a 16-bit register. The capacitor consists of polysilicon over a silicon region, separated by insulating oxide; the polysilicon and silicon form the plates of the capacitor. On the left side are the smaller capacitors and the resistors that provide the R-C delay for the oscillator. Below them is the oscillator circuitry and the drive transistors.7
One interesting feature of the charge pump is the two diodes, each built from eight transistors in a regular pattern. The diagram below shows the structure of a transistor. Regions of the silicon are doped with impurities to create diffusion regions with desired properties. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. A high voltage on the gate lets current flow between the source and drain, while a low voltage blocks current flow. These tiny transistors can be combined to form logic gates, the components of microprocessors and other digital chips. But in this case, the transistors are used as diodes.
The photo below shows a transistor in the charge pump, viewed from above. As in the diagram, polysilicon forms the gate between the silicon diffusion regions on either side. A diode can be formed from a MOSFET by connecting the gate and drain together (details) through the silicon/polysilicon connection at the bottom of the photo. The silicon can also be connected to the metal layer through a "via". The metal layer was removed for this photo, but faint circles indicate the position of silicon/metal vias.
The diagram below shows how the two diodes are implemented from 16 transistors. To support the relatively high current of the charge pump, eight transistors are used in parallel for each diode. Note that neighboring transistors share source or drain regions, allowing transistors to be packed densely. The blue lines indicate the metal wires; the metal was removed for this photo. The dark circles indicate connections (vias) between the metal and silicon.
Putting this all together, the upper eight transistors have their sources connected to ground by a metal wire. Their gates and drains connected together by the polysilicon below the transistors, making them into diodes, and they are connected to the capacitor by a metal wire. The lower eight transistors form a second diode; their gates and drains are wired together by the lower metal wire loop. Note how the layout has been optimized; for example, the gates have bent shapes to avoid the vias (black dots).
The substrate bias generator on the 8086 chip9 is an interesting combination of digital circuitry (a ring oscillator formed from inverters) and an analog charge pump. While the bias generator may seem like an obscure part of 1970s computer history, bias generation is still part of modern integrated circuits. It is much more complex in modern chips which have multiple carefully regulated biases in multiple power domains. 8 In a sense it is analogous to the x86 architecture, something that started in the 1970s and is even more popular today, but has become unimaginably more complex in the quest for higher performance.
If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier. I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter @kenshirriff or RSS for updates.
Strictly speaking, the entire chip is analog: there's an old saying that "Digital computers are made from analog parts". This saying came from DEC engineer Don Vonada and was published in DEC's Computer Engineering in 1978.
Putting a negative bias voltage on the substrate had several benefits. It decreased parasitic capacitance making the chip faster, made the transistor threshold voltage more predictable, and reduced leakage current. ↩
Early DRAM memory chips and microprocessor chips often required three supplies: +5V (Vcc), +12V (Vdd) and -5V (Vbb) bias voltage. In the late 1970s, improvements in chip technology allowed a single supply to be used instead. For example, Mostek's MK4116 (a 16 kilobit DRAM from 1977) required three voltages while the improved MK4516 (1981) operated on a single +5V supply, simplifying hardware designs. (Amusingly, some of these chips still kept the Vbb and Vcc pins for backward compatibility but left them unconnected.) Intel's memory chips followed a similar path, with the 2116 DRAM (16K, 1977) using three voltages and the improved 2118 (1979) using a single voltage. Similarly, the famous Intel 8080 microprocessor (1974) used enhancement-mode transistors and required three voltages. An improved version, the 8085 (1976), used depletion-mode transistors and was powered by a single +5V supply. The Motorola 6800 microprocessor (1974) used a different approach for a single supply; although the 6800 was built from the older enhancement-load transistors it avoided the +12 supply by implementing an on-chip voltage doubler, a charge pump that increased the voltage. ↩
I tried to measure the frequency of the charge pump by looking at the chip's current to see fluctuations due to the charge pump. I measured 90 MHz fluctuations, but I suspect I was measuring noise and not the charge pump's oscillations. ↩
Because the circuit has an odd number of inverters, it oscillates. If, on the other hand, it had an even number of inverters, it would be stable in two different states. This technique is used in the 8086's registers: a pair of inverters stores each bit (details). ↩
I've simplified the charge pump discussion slightly. Due to voltage drops in the transistors, the substrate voltage will probably be around -3V, not -5V. (If a chip requires a larger voltage drop, charge pump stages can be cascaded.) For the pump direction, I'm referring to current flow. If you think of it as pumping electrons, the negative electrons are being pumped the opposite direction, into the substrate. ↩
The oscillator is built from 13 transistors. Seven transistors form the 3 inverters (one inverter has an extra transistor to provide extra output current. The six drive transistors consist of two transistors pulling the output high and four transistors pulling the output low. The layout is strangely different from normal inverter circuitry, probably because the current requirements are different from normal digital logic. ↩
Bias generators are now available as IP blocks that can be licensed and be plugged into a chip design. For more information on bias in modern chips, see Body bias, Multi bias domain implementation, or this presentation. There is even a standard IEEE 1801 power format that allows IC design tools to generate the necessary circuitry. ↩
The Intel 8087, the math coprocessor chip that goes along with the 8086, also has a substrate bias generator. It uses the same principles, but unexpectedly has a different circuit, using 5 inverters. I wrote about it in detail here. ↩
The Intel 8086 processor contains many interesting components that can be understood through reverse engineering. In this article, I'll discuss the adder that is used for address calculations. The photo below shows the tiny silicon die of the 8086 processor under a microscope. The left part of the chip has the 16-bit datapath including the registers and the Arithmetic-Logic Unit (ALU); you can see the pattern of circuitry repeated 16 times. The rectangle in the lower-right is the microcode ROM, defining the execution of each instruction.
The 16-bit adder, the topic of this post, is in the upper left. The magnified view shows how the adder is constructed from 16 stages, one for each bit. The upper row handles the top bits (15-8) and the lower row handles the low bits (7-0).1 Studying the die reveals how this 16-bit adder was optimized through clever circuit design, specialized logic gates, and careful layout techniques.
You might wonder why the 8086 contains both an adder and an ALU (arithmetic-logic unit). The reason is that the adder is used for address calculations, while the ALU is used for data calculations. The 8086 prefetches instructions using a "Bus Interface Unit", which runs semi-independently from the "Execution Unit" that executed instructions. It would have been difficult for the Bus Interface Unit and the Execution Unit to share the ALU without conflicts. By providing both an adder2 and the ALU, the two calculations can take place in parallel.
Microprocessors of the early 1970s typically had 16-bit addresses, capable of accessing 64 kilobytes of memory. At first, 64 kilobytes seemed like more memory than anyone would need (or afford), but as the price of memory chips plunged, the demand for memory grew.4 To support a larger address space, Intel added segment registers to the 8086, a hack that allowed the processor to access a megabyte of memory but led to years of gnashed teeth. The concept is to break memory into 64-kilobyte segments. A segment register specifies the start of the memory segment, and a 16-bit address indicates an address within that segment. These are combined in the adder, as shown below, to obtain the memory address. One downside is that accessing regions of memory larger than 64 kilobytes is difficult; the segment register must be modified to get outside the current segment.3
How does the 16-bit adder compute a 20-bit address? The trick is that since the segment register is shifted 4 bits, the adder sums the 16 bits of the segment register and the top 12 bits of the offset. The four low bits of the offset bypass the adder since they are unchanged. For other purposes (such as incrementing the instruction counter), the adder operates on unshifted 16-bit addresses. Thus, the register circuitry has logic to feed either shifted or non-shifted values to the adder.
The diagram below, from the 8086 patent, shows how the adder sits between the segment registers and the address pins, computing the address. In the patent, the segment registers were named RC, RD, RS, and RA, not their current names: CS, DS, SS, and ES.
If you've studied digital logic, you may be familiar with the full adder, a building-block for adding binary numbers. Specifically, a full adder takes two bits and a carry-in bit. It adds these three bits and outputs the 1-bit sum, as well as a carry-out bit. (For instance 1+0+1 = 10 in binary, so the carry-out is 1 and the sum bit is 0.) A 16-bit adder can be created by joining 16 full-adders, with the carry-out from one feeding into the carry-in of the next. Just as you add two decimal numbers, moving carries to the next column on the left, each full adder adds one column in the binary numbers, and the carry is passed on to the left.
A full adder can be implemented in different ways; the 8086's circuit is shown below. (This circuit is repeated 16 times in the 16-bit adder.) Each adder stage takes two inputs (at the bottom) and the carry-in (inverted, at the right). These are summed to form a 1-bit sum output (bottom) and a carry-out (at the left). The sum bit is formed by the two exclusive-NOR gates that combine the two inputs and the carry-in.5 The output passes through a tri-state buffer (at the top), allowing it to be connected to an internal data bus.6
The carry computation uses an optimization called the Manchester carry chain7, dating back to 1959. The problem with addition is carries are slow; in the straightforward approach, each bit sum can't be computed until the carry to the right has been computed. (Similar to computing 99999999+1 with long addition; each digit requires you to carry the one.) If each bit must wait for the previous carry, addition becomes a slow, serial process.
The idea behind the Manchester carry chain is to decide, in parallel, if each stage will generate a carry, propagate an existing carry, or block any carry. Then, the carry can rapidly flow through the "carry chain" without sequential evaluation. To understand this, consider the cases when adding two bits and a carry-in. For 0+0, there will be no carry, regardless of any carry-in. On the other hand, adding 1+1 will always produce a carry, regardless of any carry-in; this case is called "carry generate". The interesting cases are 0+1 and 1+0; there will be a carry-out if there was a carry-in. This case is called "carry propagate" since the carry-in propagates through the stage unchanged.
The "carry generate" and "carry propagate" signals are used to open or close switches (i.e. transistors) in the carry line. For "carry propagate", carry-in is connected to carry-out, so the carry can flow through. Otherwise, the incoming carry is disconnected. For "carry generate", a carry signal is sent to carry-out. Since these switches can all be set in parallel, carry computation is quick. There is still some propagation delay as the carry-in flows through the switches, potentially from bit 0 all the way to bit 15, but this is much faster than computing the carry through a sequence of logic gates.
The carry chain is visible on the die; the photo above shows four stages of the adder. The horizontal lines are the metal wiring: control signals, ground, and power (the thick line near the bottom). The silicon circuitry is barely visible underneath the metal. The carry chain wires are interrupted at each stage, to connect to the transistors underneath, and the new carry continues on to the next stage.
Careful examination of the adder shows that while the 16 single-bit stages are very similar, they are not all identical. The extra circuitry indicated below turns out to be a performance optimization called the carry-skip adder.
The idea of carry-skip is to skip over some of the stages in the carry chain if possible, reducing the worst-case delay through the chain. For example, if there is a carry-in to bit 8, and the carry propagate is set for bits 8, 9, 10, and 11, then it can be immediately determined that there is a carry-in to bit 12. Thus, by ANDing together the carry-in and the four carry-propagate values, the carry-in to bit 12 can be calculated immediately for this case. In other words, the carry skips from bit 8 to bit 12. Likewise, similar carry-skip circuits allow the carry to skip from bit 2 to bit 4, and bit 4 to bit 8. These carry-skip circuits reduced the adder's worst-case computation time.8
The performance of the adder is critical to the overall speed of the 8086, so it uses some interesting techniques to implement fast logic gates. Some of the adder's gates are built with dynamic logic. A standard logic gate is straightforward: you put signals in and you get the result out. In contrast, a dynamic logic gate uses a periodic clock signal to compute the logic function.9 Since dynamic logic can be faster and more compact, it is used in modern processors, in the form of domino logic.
Dynamic logic depends on a two-phase clock, commonly used for timing in microprocessors of that era. The two-phase clock consists of two clock signals that are active in alternation. First, phase 1 (ɸ1) is high and phase 2 (ɸ2) is low. Then phase 1 is low and phase 2 is high. This cycle repeats at the clock frequency, such as 5 MHz.
The schematic below shows a dynamic NAND gate from the adder. In phase 1, the clock ɸ1 turns on the lower transistor, pulling the input to the inverter low. Phase 2 is the evaluation phase, where the logic function is computed. If both inputs are high, the two input transistors will turn on, allowing clock ɸ2 to pass through to the inverter input, pulling it high and causing the output to be low. On the other hand, if either input is low, the clock ɸ2 cannot pass through the transistors. Instead, the inverter input remains low from the previous phase, due to the stray capacitance of the wire, so the output is high. Thus, in either case, the circuit implements the NAND functionality, with a low output only if the inputs are both high. Note that unlike a standard logic gate, the dynamic logic gate's output is only valid during clock phase 2.
The diagram below shows how the dynamic NAND gate is physically implemented on the die; the layout of the schematic corresponds to the physical layout. In the photo, the metal layer has been removed, showing the silicon underneath. The yellowish regions are doped, conductive silicon. The brownish, metallic lines are polysilicon, a special type of silicon used as wiring. A transistor is formed when polysilicon crosses doped silicon; the polysilicon is the gate, controlling conduction between the silicon on either side. The transistors have complex, twisted shapes to fit the circuitry in as little space as possible. Each transistor was given a particular size for the best balance between speed and power consumption. For example, the input transistors are small, while the inverter transistor is much larger.
The diagram below shows the location of a NAND gate in the 8086 chip. The first box zooms in on one of the 16 single-bit adder circuits. The second box shows the position of the NAND gate within the adder. The NAND gate is almost visible in the overall die photo showing how large the features are, compared to a modern chip.
Another interesting dynamic logic gate in the adder is exclusive-NOR (XNOR, the complement of XOR), which outputs 1 if both inputs are the same, and 0 otherwise. The schematic below shows the implementation of XNOR.10 As before, during phase 1, the inverter input is pulled to ground. In the evaluation phase, clock ɸ2 can pull the inverter input high through either the upper pair of transistors or the lower pair of transistors. This will happen if the inputs are different (input 2 is high and input 1 is low, or if input 1 is high and input 2 is low), causing the inverter output to be low. Otherwise, the inverter input will remain low from phase 1, and the inverter output will be high. Thus, the output is high if the two inputs are equal, and low otherwise, the desired XNOR behavior.
The adder in the 8086 has a critical role, computing addresses for every memory access. A 16-bit adder may seem like a straightforward circuit, but the adder in the 8086 was highly optimized so it wouldn't be a performance bottleneck. To speed up carry processing, the adder uses a Manchester carry chain, with carry-skip circuitry on top of that. The adder uses three different designs for logic gates: standard NMOS gates, pass-transistor logic, and dynamic logic. Even at the transistor level, the circuit is highly optimized, with transistors of all shapes and sizes carefully packed together.
The Intel 8086 is an interesting processor with complex circuits but still simple enough that its circuits can be studied under a microscope. The 8086 has 29,000 transistors and features that are a few micrometers large. In comparison, modern processors have billions of transistors and transistors that are measured in nanometers. While the progress of Moore's law has yielded great improvements in modern processors, the processors of the 1970s are much better for reverse engineering.
If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier. I plan to write more about the 8086 so follow me on Twitter @kenshirriff or RSS for updates.
The adder's layout has bits 15-8 in the top and bits 7-0 below. This layout is a consequence of the bit ordering in the data path: the bits are interleaved 15-7-14-6-...-8-0, instead of linearly 15-14-...-0. The reason behind this interleaving is that it makes it easy to swap the two bytes in the 16-bit word, by swapping pairs of bits. The adder is split into two rows so it fits into the horizontal space available. Even with the tall, narrow layout of an adder stage, a bit of the adder is wider than a bit of the register file. Splitting the adder into two rows keeps the bit spacing approximately the same, avoiding long wires between the register file and the adder. ↩
Many early microprocessors (such as the 6502 and Z-80) had an incrementer for the program counter, separate from the ALU. (One motivation was the ALU was 8 bits while the program counter was 16 bits.) The 68000 had address adders, separate from the ALU. ↩
The 8086's segmented architecture led to programming with near pointers and far pointers. A near pointer was a 16-bit pointer that could be held in a register and manipulated easily, but couldn't access more than 64 kilobytes. A far pointer was the combination of an offset and a segment value, resulting in a pointer that could access the full memory but required twice the storage for each pointer. Comparing far pointers was problematic, since they were not unique; multiple offset/segment combinations could address the same physical memory address. ↩
In contrast to the 8086, the Motorola 68000 microprocessor (1979) had 32-bit registers. Its address bus was 24 bits wide, allowing it to access 16 megabytes of memory directly, without segment registers. The 68020 (1984) extended the address bus to 32 bits, allowing 4 gigabytes of memory to be accessed.
The 68000 was provided in a 64-pin package, providing plenty of pins for the 24 address lines and 16 data lines. In comparison, Intel didn't like large IC packages and used a 40-pin package for the 8086. As a result, the 8086 used 20 pins for the address lines, and reused (i.e. multiplexed) 16 of these pins for data lines. The 8086 also multiplexed many of the control pins, complicating system design. ↩
The desired sum output is input1⊕input2⊕carry-in. In the 8086 adder, the carry-in is inverted, there are two exclusive-NOR gates, and an inverter in the path. Thus, the circuit has four inversions in total; since this number is even, they cancel out and the circuit produces the desired exclusive-OR of the three values. ↩
A tri-state buffer has three different outputs: high (1), low (0), or high-impedance (hi-Z). In the hi-Z state, the buffer is not outputting anything and is electrically disconnected. The motivation for this is that multiple signals can be connected to a bus through tri-state buffers. By enabling one buffer and disabling the rest, the desired signal can be output to the bus. (Regular buffers wouldn't work because electrical problems would arise if one buffer outputs a 1 and another outputs a 0.) Open-collector outputs are an alternative for connecting multiple signals to a bus. ↩
The Manchester carry chain was developed by the University of Manchester and described in the article Parallel addition in digital computers: a new fast 'carry' circuit, 1959. It was used in the Atlas supercomputer (1962).
The diagram above, from the original article, shows the structure of the Manchester carry chain. Although the switches look like relay contacts, the carry chain was implemented with transistors (2N501 micro-alloy diffused-base transistors). The structure of the carry chain in the 8086 is similar to the diagram above, but the top switches are replaced by XNOR gates. ↩
A few notes on the carry-skip implementation. Conceptually the signals are ANDed together, but the implementation uses a NOR gate since the carry and propagate signal inputs are inverted. For carry-skip to be useful, computing the carry with a gate must be faster than the carry chain, which was achieved by skipping four stages at a time. (I don't know why the first stage was implemented with a smaller skip.) Note that carry-skip helps in specific cases (which include the worst-case), so the regular carry circuitry is still required. ↩
Processors always have a maximum clock speed, the fastest they can run. (The original 8086 ran at up to 5 MHz, while the later 8086-1 supported 10 MHz.) However, due to the use of dynamic logic, the 8086 also had a minimum clock speed of 2 MHz. If the clock ran slower than that, there was a risk of the charge on a wire leaking away before it was used, causing errors. ↩
Surprisingly, the adder uses a completely different implementation for the upper XNOR gate; it is implemented with pass-transistor logic rather than dynamic logic. I think the motivation is that the carry-in signal to these XNOR gates is not quite synchronous, due to propagation delay through the carry chain. Dynamic logic has the disadvantage that if an input signal switches low after the clock, the gate can't recover; the circuit has been charged and won't be discharged until the next clock phase. In particular, if a carry comes in after clock phase 2 has started, it can't switch the output high. By using non-dynamic logic, the output will switch correctly when the carry arrives, even if it is not aligned with the clock.
Pass-transistor logic is different from "regular" NMOS logic gates, but provides a more efficient way of implementing XNOR. The circuit is similar to the XNOR in the Z-80 microprocessor, which I've described earlier, so I won't go into more detail here.
Pass-transistor logic is also used to implement the input and output latches on the adder. On the patent diagram shown earlier, these latches appear as "TMP B" and "TMP C" on the input side of the adder and "TMP ɸ1" on the output side. These latches are necessary because otherwise the adder's output would be connected directly to the input, causing the adder to repeatedly add. The implementation of these latches is simply clocked pass transistors in the path, holding the value by capacitance. ↩
The Intel 8086 microprocessor is one of the most influential chips ever created; it led to the x86 architecture that dominates desktop and server computing today. But it is still simple enough that its circuitry can be studied under the microscope and understood. In this post, I explain the implementation of a dynamic latch, a circuit that holds a single bit. The 8086 has over 80 latches scattered throughout the chip, holding a variety of important processor state bits,1 but I'll focus on the eight latches that implement the instruction register and hold the instruction that is being executed.
The photo above shows the silicon die of the 8086 processor under a microscope. I removed the metal and polysilicon layers to reveal the transistors, approximately 29,000 of them. The highlighted region indicates the 8086's 8-bit instruction buffer, consisting of eight latches. (This 1978 processor is simple enough that a single 8-bit register occupies a substantial region of the die.) The closeup shows the silicon and transistors making up a single latch.
The latch is one of the most important circuits in the 8086, since the latches keep track of what the processor is doing. While latches can be made in many ways,2 the 8086 uses a compact circuit called the dynamic latch. The dynamic latch depends on a two-phase clock, commonly used to control microprocessors of that era.3 A two-phase clock consists of two clock signals that are active in alternation. In the first phase, clock is high and the complement clock is low. Then they switch so clock is low and clock is high. This cycle repeats at the clock frequency, such as 5 MHz.
The schematic above shows a typical latch in the 8086. It consists of two inverters and several pass transistors. For our purposes, the pass transistor can be considered a switch: if the gate input is 1, the transistor passes the signal through. If the gate input is 0, the transistor blocks the signal. The pass transistors are controlled by several signals: load, which loads a bit into the latch; hold, which holds the existing bit value; clock, the first clock phase; and clock, the second, inverted clock phase.
The diagram below shows how a value (1 in this case) is loaded into the latch. The load signal is brought high, allowing the input (1 in this example) to pass through the first transistor. Since clock is high, the signal passes through the second transistor to the inverter, which outputs 0. At this point, the third (clock) transistor blocks the signal.
In the next clock phase (below), clock goes high, allowing the 0 signal to reach the second inverter, which outputs 1. Since hold is high, the signal loops back, but is blocked by the clock transistor. The important point, which makes this circuit dynamic, is that at this time there is no active input to the first inverter. Instead, its input remains 1 (shown in gray) due to the capacitance of the circuit. Eventually, this charge would leak away, losing the value, but before that happens, the clocks toggle.
After the clocks switch state, the second inverter's input is provided by the capacitance of the circuit (below). The signal loops around, recharging and refreshing the input to the first inverter. As the clock signals continue to toggle, the latch switches between this diagram and the previous diagram, preserving the value in the latch and keeping the output stable.4
The 8086 and other processors of that era were built from a type of transistor called NMOS. They were constructed from a silicon substrate that was "doped" by diffusion of arsenic or boron to form the transistors. On top of the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (In comparison, modern processors are built from CMOS technology, which combines NMOS and PMOS transistors, and they have many layers of metal wiring.)
The diagram above shows the structure of a transistor. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. Applying voltage to the gate lets current flow between the source and drain, while pulling the gate to 0 volts blocks the current flow. The gate is separated from the silicon by an insulating oxide layer; this makes the gate act like a capacitor as seen in the dynamic latch.
An inverter (below) is built from an NMOS transistor and a resistor.5 With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on. This connects the output to ground, pulling the output low. Thus, the circuit inverts the input signal.
The photo on the right shows how an inverter is physically constructed in the 8086. The yellowish regions are conductive doped silicon and the speckled regions are the polysilicon on top. A transistor is created where polysilicon crosses doped silicon: the polysilicon forms the transistor's gate, while the silicon regions on either side are the transistor's source and drain. The large polysilicon rectangle forms the pull-up resistor between +5 volts and the output. These physical structures can be matched with the schematic.
The diagram below shows the implementation of a latch on the chip. The pass transistors and the two inverters are indicated; the first inverter is the one described above. Polysilicon wiring connects the components together; the metal layer (removed) provided additional wiring. The transistors have complex shapes to make the most efficient use of the space.
The latch includes output buffers, not shown on the schematic above, that provide high-current signals for the output and inverted output. This type of buffer has the amusing name "superbuffer" because it provides much higher current than a regular NMOS inverter. The problem with an NMOS inverter is it is slow when driving something with high capacitance. Since the superbuffer provides more current, it will switch the signal much faster. The superbuffer accomplishes this by replacing the pullup resistor with a transistor, which provides higher current. The downside is that the pullup transistor requires an inverter to drive it, so the superbuffer circuit is more complex. Thus, superbuffers are only used when necessary, typically when sending a signal to many gates or when driving a long bus line.
The diagram above shows the superbuffer circuit in the 8086's latches. Unlike the typical superbuffer, this one includes both an inverting and non-inverting superbuffer. To understand the circuit, note that the central resistor and transistor form an inverter. The inverter output is connected to the upper transistors, while the uninverted input is connected to the lower transistors. Thus, if the input is 1, the lower transistors will turn on, while if the input is 0, the upper transistors will turn on due to the inverter. Thus, for a 1 input, the lower transistors will pull Output high and the complement Output low. But for a 0 input, the upper transistors will pull Output low and the complement Output high.6
The 8086, like most processors, has an instruction register that holds the instruction that is currently being executed. In the 8086, the instruction register holds the first byte of an instruction (which may consist of multiple bytes), so it is built from eight latches (below). You might expect the latches to be identical, but each latch has a different shape. Since the layout of the 8086 is highly optimized, each latch is shaped to make the best use of the available space, constrained by the neighboring wiring. In particular, note that some latches are merged together so they can share power and ground connections. Layout optimization is also probably why the latches are not in sequential order.
An instruction takes a winding journey through the 8086 chip. The 8086 processor uses prefetching, improving performance by loading instructions from memory before they are required. Prefetched instructions are stored in the instruction queue, a 6-byte queue in the middle of the 8086's register file. (In comparison, modern processors can have megabytes of instruction cache.) When an instruction is executed, it is stored in the instruction register, roughly in the middle of the chip. (The relatively large distances explains the use of superbuffers.) The instruction register feeds the instruction to the "group decode ROM". This ROM determines the high-level characteristics of the instruction, such as if it is a single-byte instruction, a multi-byte instruction, or an instruction prefix. (This is only a piece of the 8086's complex instruction handling. Other latches hold pieces of the instruction indicating register usage and the ALU operation, while a separate circuit controls the microcode engine, but I'll discuss that in another post.)
The 8086 makes extensive use of dynamic latches to store state internally. These latches are visible under a microscope and their circuitry can be traced out and understood. The 8086 is an interesting subject for die analysis since unlike modern processors, its transistors are large enough to see under a microscope, unlike modern processors. It was a complex processor at the time, with 29,000 transistors, but it is still simple enough that the circuitry can be traced out and understood.
The 8086 has over 80 latches. Some latches hold values for the AD (address/data) pins or control pins. Other latches hold the current microcode address and the microinstruction, as well as the return address for a microcode subroutine call. Other latches hold the source and destination register bits from the instruction, and the ALU operation from the instruction. Many latches hold internal state values that I'm still investigating. ↩
Many microprocessors use cross-coupled NOR (or NAND) gates to form an SR latch. An SR latch typically takes up more space than a dynamic latch, especially if additional circuity is added to make it clocked. Edge-triggered flip flops are popular, but are even more complex, using six gates. In many cases, a pass transistor provides sufficient storage; it can hold a value across a clock cycle, but doesn't provide the long-term storage of a latch. ↩
Processors always have a maximum clock speed, the fastest they can run. (The original 8086 ran at up to 5 MHz, while the later 8086-1 supported 10 MHz.) However, due to the use of dynamic logic, the 8086 also had a minimum clock speed of 2 MHz. If the clock ran slower than that, there was a risk of the charge on a wire leaking away before it was used, causing errors. ↩
A key to the operation of the latch is that there are two inverters, so the output is stable. An odd number of inverters would result in oscillation, a feature used by the 8086's charge pump oscillator. The 8086's register file also uses pairs of inverters to store bits. However, in the register file, the two inverters are connected to each other directly, without the clocked pass transistors, resulting in storage that is more compact but more difficult to control. ↩
Some more information on superbuffers. The problem with an NMOS inverter is that the pull-up resistor provides limited current. When outputting a 0, the transistor in an inverter pulls the output low quickly, with a relatively high current. However, when outputting a 1, the output is pulled high by the much weaker pullup resistor.
The superbuffer is somewhat like a CMOS inverter in that it has a pullup transistor and a pulldown transistor. The difference is that CMOS uses both PMOS and NMOS transistors, and the PMOS transistor has an inverted gate input. In contrast, with an NMOS superbuffer, a separate inverter is required. In other words, a CMOS inverter uses two transistors, while a superbuffer is much less efficient, requiring four transistors.
The superbuffer uses a depletion mode transistor for the pullup and an enhancement mode transistor for the pulldown. The depletion-mode transistor has a threshold voltage below zero, allowing its output (source) to get pulled up to 5V, rather than shutting off a bit lower. When the output is low, the depletion-mode transistor will still be (somewhat) on, acting like the pullup in a regular inverter, so there is some current flow through it. For more on superbuffers, see Introduction to VLSI Systems, page 28. ↩
One under-appreciated characteristic of early microprocessors is the difficulty of distributing power inside the integrated circuit. While a modern processor might have 15 layers of metal wiring, chips from the 1970s such as the 8086 had just a single layer of metal, making routing a challenge. Similarly, clock signals must be delivered to all parts of the chip to keep it in synchronization.
The photo below shows the 8086's die under a microscope. The metal layer on top of the chip is visible, with the silicon substrate and polysilicon wiring hidden underneath. Around the outside of the die, tiny bond wires connect pads on the die to the external pins. The 8086 has a power pad at the top and ground pads at the top and bottom. Each power and ground pad has two bond wires connected to support twice the current. You can see the wide metal traces from the power and ground pads; these distribute power throughout the chip.
Timing in the 8086 is controlled by two internal clock signals. An external oscillator provides a clock signal to the 8086 through the clock input pad at the bottom. The on-chip clock driver circuitry generates two high-current clock signals from this external clock. Note that the clock driver takes up a not-insignificant part of the chip.
In this blog post, I'll discuss how the 8086 routes power and clock signals through the chip, and how the clock driver circuit generates the necessary clock pulses.
The 8086 is constructed with three layers that can be used for wiring. The metal layer on top is best for wiring, since metal has low resistance. Underneath the metal is a layer of polysilicon wiring, made from a special type of silicon. Polysilicon has higher resistance than metal, but can still be used to transmit signals across the chip. The silicon substrate is where the transistors are formed. Silicon has relatively high resistance, so it is only used for short-distance connections, such as inside a gate.
Power routing in a chip like the 8086 creates a topological puzzle of sorts: The metal layer is the only practical layer for routing power and ground, due to its low resistance. Power and ground must be provided to nearly every gate in the chip.1 And since the chip has a single metal layer, power and ground can't cross.
The diagram below highlights these metal wiring networks in the 8086. Power, connected to the power pin at the top, is shown in red, traveling throughout the chip. A major branch flows down and to the right from the power pin, then splitting into multiple paths. Power also travels around the border of the entire chip, supplying the I/O pins.
There are two ground pins. The wiring in blue is connected to the upper ground pin, while the wiring in green is connected to the lower ground pin. The blue ground wiring has a large branch downwards through the center of the chip, branching in complex directions. The green ground wiring flows along the bottom, left, and right sides of the chip, supporting the I/O pins, as well as connected to the microcode ROM in the lower right.
The power wires get thinner from their source to their final destination as they branch or deliver power along the way and the current diminishes. This is visible in the ground wire to the address / data pins, below. At the left, the ground wire below the pins is very wide, but it tapers off to the right. In other words, at the left, the wire must handle current for all the pins, but at the right the wire is supporting just the remaining pin.
The metal layer is used for many signals besides power and ground; it is the best layer for delivering signals due to its low resistance. However, the extensive power and ground wiring constrains the other uses of the metal layer. To avoid intersections, most of the metal signal lines run parallel to the power lines; the polysilicon layer underneath is used to run perpendicular signals. But what happens if metal wires need to cross a power or ground line? The solution is to use a "crossunder", where the signal goes down to the polysilicon layer and crosses under the power line, popping back up on the other side,3 as shown below.
While power and ground are almost entirely routed in the metal layer, there are a few places where this breaks down and a crossunder is used for power. This typically happens near the end of the line, where the current is small. One example is shown below, where ground passes through two polysilicon crossunders. To reduce the resistance, these crossunders are much wider than the crossunders for signals and also use the silicon and polysilicon layers together. The small circles are connections (called vias) between the metal layer and the polysilicon layer.
The silicon layer plays a minor part in routing power. In particular, many gates are stretched out to reach the power and ground on either side. The photo below shows some gates in the 8086. Note the large doped silicon regions (white) that extend to reach the power and ground lines. Only a small part of this silicon is used for transistors, while the rest looks like wasted space. However, these empty silicon regions connect the gate to the metal power and ground wires. Since silicon has relatively high resistance, wide regions are used for these connections, and over short distances.
Other power routing issues arose as the 8086 was revised and became physically smaller. As manufacturing technology improved, Intel performed "die shrinks", keeping the same circuitry but scaling it down uniformly to produce a smaller die. Unfortunately, shrinking the power lines reduces the current they can handle. The solution was beef up the power lines around the edge of the chip, while allowing the internal circuitry and wiring to shrink. This can be seen in the photo below; the lower-right corner of the smaller 8086 has much more power wiring, for instance. (I wrote more about the 8086 die shrink here.)
Almost all computers use a clock signal to control the timing of the processor.4 Like many microprocessors, the 8086 uses a two-phase clock internally.5 In a two-phase clock, there are two clock signals: when the first clock is high, the second is low, and vice versa, as shown below. One set of circuitry is enabled by the first clock, while a second set of circuitry is enabled by the second clock. The 8086's circuitry requires that the two clock phases are non-overlapping —there is a gap after one goes low before the other goes high—and asymmetrical.6
In modern processors, clock routing is complex because the clock signals must reach all parts of the chip at the same time. Modern processors use a hierarchy of clock paths, balancing the time along each path, and often provide separate buffering for each path. In comparison, the 8086's clock routing is straightforward because its 5 to 10 MHz clock7 is orders of magnitude slower than modern processors. At these comparatively low speeds, the length of the path doesn't make much difference, so the 8086's clock signals can meander around the chip.
The diagram above shows the 8086's clock routing. Phase 1 is in green and phase 2 is in red. At the bottom of the chip, the circuitry that generates the clocks appears as large blobs. From there, the clock signals branch wind around the chip. For the most part, the two clock phases are routed parallel to each other, unlike power and ground, which form opposing branches.
Because the clock signals go to all parts of the chip, they require much more current than typical signals and are routed in the metal layer for the most part. When the clock signals must cross the power lines, they use large crossunders as shown below. Note that the irregularly-shaped clock crossunders are much larger than the crossunders for other signals, such as the Q bus below.
To provide the high-current clock signals, the clock signals have special driver circuitry built from large transistors. The photo below compares one of these driver transistors to a typical logic transistor. The driver transistor is about 300 times as large, so it can provide about 300 times the current. This transistor is constructed as 10 transistors in parallel; the 10 vertical polysilicon lines form the 10 gates. Each clock signal is driven by a pair of large transistors, one to pull the signal high and one to pull the signal low.
The photo below shows the clock driver circuitry. This circuit splits the external clock signal into two phases, makes the phases non-overlapping, and amplifies them. At the left, the pink square is the pad for the externally-supplied clock. The signal passes through a series of transistors, ending with the large driver transistors at the right for the clock signal. The brownish wiring is the polysilicon that forms the gates. Many transistors have zig-zagging gates to fit a larger transistor into the available space.
The schematic below shows the driver circuitry, slightly simplified. The triangles indicate high-current drivers, built from two or three transistors; an inverting input (indicated by a bubble) pulls the output low. At the left, the clock input pin has a small resistor and a diode to provide some protection (like the other input pins). Next, the clock is split into an uninverted phase (top) and an inverted phase (bottom).
The additional circuitry keeps the clocks from overlapping: when one clock is high, it forces the other side low, through the inverted inputs. To see how this works, let's start with the clk in pin high, so clk in and clock are high while clk in and clock are low. Now, suppose the clk in pin input goes low, causing clk in to go low and clk in to go high. However, the output clock can't go high until clock goes low, due to the negative inputs on the buffers. Once that happens, clk in proceeds through the lower drivers, pulling clock high after two gate delays.8 The point of this is that clock and clock don't switch at the same time; after one goes low, there is a delay before the other goes high. This generates the desired non-overlapping clock signals.
The 8086 uses some interesting routing for power, but modern processors operate at a whole different level. While the 8086 required 350 milliamps of current, a modern processor might require over a hundred amps. The 8086 used 3 of its 40 pins for power and ground, compared to a modern Intel Core i5 processor with 128 power pins and 377 ground pins (out of 1151 pins). Although the numerous metal layers in modern chips solved the 8086's routing issues, modern chips have new complications such as multiple power domains that allow unused parts of the chip to be powered down.
Clock routing is much harder on modern processors since at multi-gigahertz speeds, even an extra millimeter of path can affect the clock. To deal with this, modern processors use techniques such as H-trees or grids to distribute the clock, rather than the 8086's meandering paths. While the 8086 has a simple circuit to generate the two-phase clock, modern processors often use a phase-locked loop (PLL) to synthesize the clock and use multiple circuits scattered across the chip to generate and control clock signals.
Even though the 8086 is much simpler than modern processors, it contains a lot of interesting circuitry. I plan to reverse-engineer more of the 8086, so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed.
Power and ground must be provided to almost every gate in the chip since a standard NMOS gate requires ground for its pull-down network and power for its pull-up resistor. There are a few exceptions, though. The 8086 uses some dynamic logic gates, especially in the ALU for speed. These gates are pulled high by the clock, so they don't need a direct power connection. The 8086 also uses some pass-transistor XOR gates, which are pulled low by the inputs, so they don't need ground.
The microcode ROM forms a large region with no power connections, just ground. This is because each row in the ROM is implemented as a very large NOR gate with the power pull-up on the right-hand edge. Thus, the ROM gates all have power and ground, even though it looks like the ROM lacks power connections. ↩
Integrated circuits often have power and ground on opposite corners or opposite sides of the chip. This placement makes it easier to construct the non-intersecting power and ground networks in the chips. The 8086 is slightly unusual to have power and ground on diagonally-opposite pins, but then a second ground pin close to the power pin. The solution is to have tree-like branching networks for power and ground. These networks are interdigitated, meshed like fingers to reach all parts of the chip.2 ↩
Crossunders are used for many wire crossings, not just power, but power wiring is a key contributor. Typically, metal wiring is used for signals in one direction, while polysilicon wiring is used for signals in the perpendicular direction. (These directions vary in different parts of the chip, depending on the predominant direction for signals.) Thus, signals for the most part can travel unimpeded. Even so, signals often bounce from layer to layer to make the routing work. ↩
While almost all computers are synchronous and operate with a clock, the IAS machine architecture (popular in the 1950s) was asynchronous, operating without a clock. Instead, each circuit would send a pulse to the next when it was done, triggering the next step. Many early computers of the 1950s were based on the IAS machine architecture, including CYCLONE, ILLIAC, JOHNNIAC, MANIAC, SEAC, and the IBM 701. Research into asynchronous computing continues (link, link), but synchronous designs are dominant. ↩
Among other things, processors use the clock to prevent unwanted feedback in the circuitry. For instance, consider a program counter with a circuit to increment it and feed the result back to the program counter. You don't want the new value to get repeatedly incremented.
One approach is to use edge-sensitive circuits (flip flops) that will update that value in the program counter at the moment the clock goes high. Thus, there will be a single update as desired. However, with a two-phase clock, the circuit can be built from level-sensitive latches, which are much simpler than edge-sensitive flip flops. The idea is that when the first clock is high, the first half of the circuit receives input and does its logic calculations When the second clock is high, the second half of the circuit receives input from the first half and does any necessary calculations, while the first half is blocked. The point is that only half of the circuitry can update at any time, preventing uncontrolled feedback. ↩
The 8086 has strict requirements on its input clock, which must be high for 1/3 of the time. The clock signal into the 8086 was typically produced by an 8284 chip and a quartz crystal. This chip divided its input clock by 3 to generate the 33% duty cycle clock required by the 8086. ↩
Because the 8086 used dynamic logic, it also had a minimum clock speed of 2 MHz. If the clock ran slower than this, there was a risk of charges leaking away before they were refreshed, causing failures. The minimum clock speed was inconvenient for debugging, since you couldn't slow down or stop the clock. ↩
This is a somewhat handwaving description of the clock driver circuit. In particular, I'm not sure what happens when one transistor is pulling a signal high and another is pulling the same signal low. An accurate simulation would depend on the relative sizes of the two transistors. ↩
Intel introduced the 8086 processor in 1978, leading to the x86 architecture in use today. I'm currently reverse-engineering the circuitry of the 8086 so I've been purchasing vintage 8086 chips off eBay. One chip I received is shown below. From the outside, it looks like a typical Intel 8086.
I opened up the chip and looked at it under the microscope, creating the die photo below. The whitish lines are the metal layer, connecting the chip's circuitry. Underneath, the silicon has a purple hue. Around the outside of the die, bond wires connect the square pads to the 40 external pins on the IC.
I quickly noticed, however, that this wasn't an 8086 processor but something entirely different! For comparison, look at my die photo of a genuine 8086 below. As you can see, the chips are entirely different and the 8086 is much more complex. Someone had taken a random 40-pin chip and relabeled it as an Intel 8086 processor. The genuine 8086 has various functional blocks visible: the 16-bit registers and ALU on the left, the large microcode ROM in the lower right, and various other blocks of circuitry throughout the chip. (The genuine chip also has a tiny Intel copyright and the 8086 part number in the lower right. Click the image to magnify.) The fake chip above, on the other hand, is an irregular grid of horizontal and vertical wiring, with thicker horizontal and vertical lines for power.
If the chip isn't an 8086, what is it? I believe the fake chip is an Uncommitted Logic Array, a type of gate array. A gate array is a way of making semi-custom integrated circuits without the expense of a fully-custom design. The idea behind a gate array is that the silicon die has a standard array of transistors that can be wired up to create the desired logic functions. This wiring is done in the chip's metal layers, which are designed for the customer's requirements.2 Although a gate array doesn't provide the flexibility of a fully-custom design, it was considerably cheaper and faster to design.
Ferranti invented the ULA in 1972, claiming that it was the first "to turn the logic array concept into a practical proposition." A ULA allowed a single LSI chip to replace hundreds or even thousands of gates that otherwise would be implemented in a board full of 7400-series TTL chips. The most well-known users of a ULA are the popular Sinclair ZX 81 and ZX Spectrum home computers.3
A ULA was based on a matrix of identical cells that were wired to form the logic gates. Around the edges of the chip, standardized peripheral cells provided the desired I/O capabilities. The diagram below shows a typical cell in the matrix. The cell contains multiple transistors and resistors, which are mostly unconnected by default. The ULA is customized by creating connections between the components to build a set of logic gates.
The photo below shows the fake chip with the metal layers removed, revealing the transistor array underneath. Each small green/yellow rectangle is a transistor; there are nearly 1000 of them. Note the repeated pattern of cells in the matrix,1 as well as the different peripheral cells around the outside. The density of transistors is fairly low; the chip has empty columns to provide room to route the metal layer.
The fake chip uses bipolar transistors,4 completely different from the NMOS transistors in the 8086 processor. The closeup below shows transistors (the striped rectangles) and the two layers of metal wiring connecting them. (The genuine 8086 only has one metal layer, so the fake chip is probably more recent, from the 1980s.)
There is no manufacturer printed on the die of the fake chip. The matrix cells don't look like the Ferranti cells. The photo below shows a ULA built by Plessey, another ULA manufacturer. That die has a smaller transistor matrix than my chip, but the overall structure is roughly similar, so Plessey might be the manufacturer.
The photo below shows another detail of the fake chip. Matrix cells are at the top. The peripheral cell below has much larger transistors for I/O. (There are also resistors in the brownish regions, but they aren't really visible.) The upper metal layer consists of horizontal wiring, while the lower metal layer is mostly vertical. The thick metal line at the right is for power (or perhaps ground) and is connected to a horizontal power distribution trace at the bottom.
To summarize, the position of the transistors and resistors in the ULA is fixed. This allows the same underlying silicon wafers to be manufactured for all the customers, keeping volume high and costs low. But by customizing the metal wiring layers, the ULA can be completed to fulfill the logic functions each customer needs.
Why would someone go to all the work of relabeling a $3.80 chip? I guess someone had a stack of old custom ICs with no value. By re-labeling them, they could at least get something for them. It hardly seems worth the effort, but I guess they make up for it in volume. The seller has sold over 215 of these 8086's, although I don't know if they were all fake or if I was unlucky. In any case, the seller gave me a prompt refund.
The seller's feedback (below) shows a lot of complaints about fake chips. Even so, the seller's feedback is 99.2% positive, so I suspect that there are just a few fake chips mixed in with many types of real chips. It's also possible that most vintage 8086s are purchased by IC collectors who never test the chip.
I've been asked if this chip would actually work as an 8086. Sometimes counterfeiters sell a lower-quality chip in place of the real thing, such as the fake expensive op amps found by Zeptobars. But other times the fake chip is unrelated, such as the vintage bipolar RAM chips that I determined was a Touch-Tone dialer. Since an 8086 has 29,000 MOS transistors but the fake chip has under 1000 bipolar transistors, it's clear that this chip won't function as an 8086.
The moral is to always be careful when you're buying chips, since you never know what you might find. Semiconductor counterfeiting is a big business and I've encountered just a tiny piece of it. I plan to write more about reverse-engineering the (real) 8086, so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed.
I think the fake chip has a matrix of 8×12 cells, with each of the large "IXI" patterns composed of four cells. ↩
At first, a ULA was designed by hand by an engineer drawing the interconnects on paper, but by the 1980s, CAD software automated most of the design and testing. The CAD station below is pretty wild.
The book The ZX Spectrum ULA: How to design a microcomputer discusses Ferranti ULAs in detail along with a complete explanation of the ULA in the ZX Spectrum. ↩
Early ULAs used bipolar transistors, with CMOS circuitry introduced later. Different logic families were supported, depending on the needs of the application. Ferranti's ULAs had three types of matrix cells: RTL (resistor-transistor logic), CML (current-mode logic), and buffered current-mode logic. Other ULAs supported fast ECL (emitter-coupled logic) or standard TTL (transistor-transistor logic). ↩
The Intel 8086 processor was introduced in 1978, setting the course of modern computing. While the x86 processor family has supported 64-bit processing for decades, the original 8086 was a 16-bit processor. As such, it has a 16-bit arithmetic logic unit (ALU).1 The arithmetic logic unit is the heart of a processor: it performs arithmetic operations such as addition and subtraction. It also carries out Boolean logic operations such as bitwise AND and OR as well as also bit shifts and rotates. Since a fast ALU is essential to the overall performance of a processor, ALUs often incorporate interesting design tricks.
The die photo below shows the silicon die of the 8086 processor. The ALU is in the lower-left corner. Above it are the general- and special-purpose registers. An adder, used for address calculation, is in the upper left. (For performance, the 8086 has a separate adder to add the segment register and memory offset when accessing memory.) The large microcode ROM is in the lower right.
Zooming in on the ALU shows that it is constructed from 16 nearly-identical stages, one for each bit. The upper row handles bits 7 to 0 while the lower row handles bits 15 to 8.3 In between, the flag circuitry indicates the status of an arithmetic operation through condition codes such as zero or nonzero, positive or negative, carry, overflow, parity, and so forth. These are typically used for conditional branches.
In this blog post, I reverse-engineer the 8086's ALU and explain how it works. It's more complex than other vintage ALUs that I've studied,2 using a flexible circuit that can implement arbitrary bit functions. The carry is implemented with a Manchester carry chain, a fast design dating back to a 1960s supercomputer.
The 8086's ALU circuitry is a bit tricky, so I'll start by explaining how it adds two numbers. If you've studied digital logic, you may be familiar with the full adder, a building-block for adding binary numbers. Specifically, a full adder takes two bits and a carry-in bit. It adds these three bits and outputs the 1-bit sum, as well as a carry-out bit. (For instance 1+0+1 = 10 in binary, so the carry-out is 1 and the sum bit is 0.) A 16-bit adder can be created by joining 16 full-adders, with the carry-out from one fed into the carry-in of the next.
The simplified diagram below represents one stage of the ALU's adder. It takes two inputs and the carry-in and sums them, forming a 1-bit sum output and a carry-out. (Note that the carry signal travels right-to-left.) The sum bit output is generated by the exclusive-or of the two arguments and the carry-in, using the two exclusive-or gates at the bottom. Generating the carry, however, is more complex.
The carry computation uses an optimization called the Manchester carry chain4, dating back to 1959, to avoid delays as the carry ripples from one stage to the next. The idea is to decide, in parallel, if each stage will generate a carry, propagate an existing carry, or block an incoming carry. Then, the carry can rapidly flow through the "carry chain" without sequential evaluation. To understand this, consider the cases when adding two bits and a carry-in. For 0+0, there will be no carry-out, regardless of any carry-in. On the other hand, adding 1+1 will always produce a carry, regardless of any carry-in; this case is called "carry-generate". The interesting cases are 0+1 and 1+0; there will be a carry-out if there was a carry-in. This case is called "carry-propagate" since the carry-in propagates through the stage unchanged.
In the Manchester carry chain, the carry-propagate signal opens or closes transistors in the carry line. In the carry-propagate case, the top transistor is activated, connecting carry-in to carry-out, so the carry can flow through. Otherwise, the lower transistor is activated and the carry-out receives the carry-generate signal, generating a carry if both arguments are 1. Since these transistors can all be set in parallel, carry computation is quick. There is still some propagation delay as the carry signal flows through the transistors in the carry chain, but this is much faster than computing the carry through a sequence of logic gates.5
That explains how the ALU performs addition,6 but what about logic functions? How does it compute AND, OR, or XOR? Suppose you replace the carry-propagate XOR gate with a logic gate (AND, OR, or XOR) and replace the carry-generate gate with 0, as shown below. The output will simply be the AND (or OR or XOR) of the two arguments, depending on the new gate. (The right XOR gate has no effect since XOR with 0 passes the value through unchanged.) The point is that if you could somehow replace the gates, the same circuit could compute the AND, OR, and XOR logic functions, as well as addition.
Another important operation is bit shifting. The ALU shifts a value to the left by taking advantage of the carry line in an unusual way (below).7 The bit from the first argument is directed into the carry-out, sending it one bit position to the left. The received carry bit passes through the XOR gate, resulting in a left shift by one bit. The carry-propagate signal is set to 0; this both directs the argument bit to carry-out, and turns the XOR gate into a pass-through. (A right shift is implemented with a separate circuit, as will be explained below.)
Thus, the ALU can reuse this circuit to perform a variety of operations, by reprogramming the carry-propagate and generate gates with different functions. But how are these magic reprogrammable gates implemented? The trick is that any Boolean function of two variables can be specified by the four values in the truth table. For instance, AND has the truth table below, so it can be specified by the four values: 0, 0, 0, 1:
If we feed those values into a multiplexer, and select the desired value based on the two inputs, we will get the
AND of the inputs.
If instead, we feed 0, 1, 1, 0 into the multiplexer, we will get the
XOR of the inputs.
Other inputs create other logic functions similarly.
With the appropriate values, any logic function of two variables can be implemented.8
(Some special cases: 0, 0, 0, 0 will output the constant 0; while 0, 0, 1, 1 will output the input A.
This multiplexer circuit is used for the carry-propagate gate.
A similar but half-sized circuit is used for the carry-generate gate.9
Now that I've presented the background, the complete ALU circuit is shown below, with multiplexers in place of the carry-propagate and generate gates. On the chip, the carry-in and carry-out are inverted, and this is reflected below. The schematic also shows the connection from the ALU to the bus, outputting the result. The circuitry at the bottom supports the shift right operation, which doesn't fit into the general circuit of the ALU. For this blog post, I'll ignore how the control signals are generated.10
The 8086 and other processors of that era were built from a type of transistor called NMOS. The silicon substrate was "doped" by diffusion of arsenic or boron to form conductive silicon and transistors. On top of the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (In comparison, modern processors are built from CMOS technology, which combines NMOS and PMOS transistors, and they have many layers of metal wiring.)
The diagram above shows the structure of an NMOS transistor. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. A high voltage on the gate lets current flow between the source and drain, while low voltage on the gate blocks the current flow.
The simplest logic gate is an inverter; the diagram below shows how an inverter is built from an NMOS transistor and a resistor.11 The pinkish regions are doped silicon, while the brownish lines are the polysilicon wiring on top. A transistor is formed where the polysilicon line crosses the doped silicon. With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on. This connects the output to ground, pulling the output low. Thus, the input signal is inverted.
A more complex gate, such as the 2-input NOR gate below, uses similar principles. With low inputs, the transistors are turned off, so the pullup resistor pulls the output high. If either input is high, the corresponding transistor turns on and pulls the output low. Thus, this circuit implements a NOR gate. The die layout matches the schematic, but has a complicated appearance due to space-saving optimization. You might expect the transistors to be simple rectangles, but the silicon regions have irregular shapes to make the most use of the space. In addition, other transistors (not part of the NOR gate) share the ground connections to save space.
The multiplexers are built using a completely different technique: pass transistors. Instead of pulling the output to ground, pass transistors pass an input signal through to the output. In the multiplexer, each input is connected to a different pair of transistors. Depending on the arguments, exactly one pair will have both transistors on. For instance, if arg2 is 0 and arg1 is 1, the transistor pair in the upper left will connect ctl01 to the output. Each other input will be blocked by a transistor that is turned off. Thus, the multiplexer selects one of the four inputs, passing it through to the output. (This pass-transistor approach is more compact than building a multiplexer out of standard logic gates.)
The diagram below shows an ALU stage with the major components labeled. You may spot the inverter, NOR gate, and multiplexer described earlier. Other components are implemented with similar techniques. This diagram can be compared with the earlier schematic. The reddish horizontal lines are remnants of the metal layer, which was removed for this photo. These lines carried the control signals, power, and ground.
The diagram below (from the 8086 patent) shows how the ALU is connected to the rest of the processor by the ALU bus. The discussion above covered the "Full Function ALU" in the middle of the diagram, which takes two 16-bit inputs and produces a 16-bit output. These inputs are supplied from three temporary registers: A, B, and C. (These temporary registers are invisible to the programmer and should not be confused with the 8086's AX, BX, and CX registers.) I'll mention a few features of these registers that will be important later. Any register can provide the ALU's first input, but the second input always comes from the B register. These registers have a bidirectional connection to the ALU bus, so they can be both written and read. One unusual feature of the ALU is that it has a single data connection to the rest of the 8086, through the ALU bus.12 This seems like a bottleneck, since two clock cycles are required to load the registers, followed by another clock cycle to access the result. But apparently the single bus worked well enough for the 8086.
The Processor Status Word (PSW) shown above holds the condition flags, status bits on the ALU result: zero, negative, overflow, and so forth. Although the PSW looks trivial in the diagram above, the die photo at the top of the article shows that it constitutes about a third of the ALU circuitry. I'll leave the flag circuitry for a later discussion due to its complexity: each flag has unique circuitry that handles many special cases.
The schematic below shows one bit of the reverse-engineered implementation of the ALU's temporary registers. The registers are implemented with latches; each box represents a latch, a circuit that holds one bit. The two large AND-NOR gates act as multiplexers, selecting the output from one of the latches. The upper gate selects one of the registers for reading. The lower gate selects one of the registers as an argument for the ALU.
While the 6-input AND-NOR gate multiplexer may look complex, it is straightforward to implement with NMOS transistors. The schematic shows how it is built from 6 transistors and a pull-up. You can verify that if both transistors in a pair are energized, the output will be pulled to ground, providing the AND-NOR function.
The latch circuit is shown below. I've written about the 8086's latches in detail, so I'll give just a quick summary. The idea of the latch is that it can stably hold either a 0 or a 1 bit. When the clock signal clk' is high, the upper transistor is on, connecting the inverters into a loop. If the input to the first inverter is 1, it outputs a 0 to the second inverter, which outputs a 1 to the first, so they stay in that state, storing the bit. Similarly, the loop is stable if the input is a 0.
The special thing about this latch is that it's a dynamic latch. When the clock signal clk' is low, the loop is broken, but the input on the first inverter remains, due to the capacitance of the wire and transistor. When clk' goes high again, this voltage is refreshed. Alternatively, when clk' is low, a new value can be loaded into the latch by activating load, turning on the first transistor and allowing a new input signal to pass into the latch. The 8086 uses dynamic latches because the latch is compact, using just two transistors and two inverters. The latch is implemented in silicon as shown below.
The diagram below summarizes the components of the temporary register implementation. This circuitry is repeated 16 times to complete the registers.13 The output from the registers is fed into the ALU circuitry described earlier.
Although the Intel 8086 has complex circuits, its features are large enough that it can be studied under a microscope. The ALU is a key part of the processor and takes up a large fraction of the die. Its circuitry can be reverse-engineered through careful examination, revealing its interesting construction. It uses a Manchester carry chain for fast carry propagation. The carry-generate and carry-propagate signals are created by multiplexers that operate as arbitrary function generators, creating a flexible ALU with a small amount of circuitry. The ALU is built from a combination of standard logic, pass-transistor logic, and dynamic logic to optimize performance and minimize size.
You might have noticed that the 8086's ALU doesn't have support for multiplication, division, or multiple-bit shifts, even though the 8086 has instructions for these operations. These operations are computed in microcode using simpler ALU operations (shift, add, subtract for multiplication and division, and repeated single-bit shifts for larger shifts).
Some features of the ALU remain to be described, in particular the condition flags and how the ALU control signals are generated from opcodes. I plan to write about these soon, so follow me on Twitter @kenshirriff or RSS for updates.
The ALU size almost always matches the processor word size, but there are exceptions. Notably, the Z-80 is an 8-bit processor but has a 4-bit ALU. As a result, the Z-80's ALU runs twice for each arithmetic operation, processing half the byte at a time. Some early computers used a 1-bit ALU to keep costs down, but these serial processors were slow. ↩
The ALU's layout has bits 15-8 in the top and bits 7-0 below. This layout is a consequence of the bit ordering in the data path: the bits are interleaved 15-7-14-6-...-8-0, instead of linearly 15-14-...-0. The reason behind this interleaving is that it makes it easy to swap the two bytes in the 16-bit word, by swapping pairs of bits. The ALU is split into two rows so it fits into the horizontal space available. Even with the tall, narrow layout of an ALU stage, a bit of the ALU is wider than a bit of the register file. Splitting the ALU into two rows keeps the bit spacing approximately the same, avoiding long wires between the register file and the ALU. ↩
The Manchester carry chain was developed by the University of Manchester and described in the article Parallel addition in digital computers: a new fast 'carry' circuit, 1959. It was used in the Atlas supercomputer (1962). ↩
The ALU also uses carry-skip techniques to speed up carry calculation; I'll briefly summarize. The idea of carry-skip is to skip over some of the stages in the carry chain if possible, reducing the worst-case delay through the chain. For example, if there is a carry-in to bit 8, and the carry-propagate is set for bits 8, 9, 10, and 11, then it can be immediately determined that there is a carry-in to bit 12. Thus, by ANDing together the carry-in and the four carry-propagate values, the carry-in to bit 12 can be calculated immediately for this case. In other words, the carry skips from bit 8 to bit 12. Likewise, similar carry-skip circuits allow the carry to skip from bit 2 to bit 4, and bit 4 to bit 8. These carry-skip circuits reduced the ALU's worst-case computation time. The carry-skip circuitry explains why each stage in the ALU is similar but not quite identical. Note that for logic operations or shift, either carry-propagate or carry-generate is 0, so the carry-skip won't activate and corrupt the result. ↩
I should mention how subtraction is handled. A typical ALU inverts one of the inputs before adding, reusing the addition circuitry for subtraction. However, the 8086's ALU implements subtraction by changing the inputs to the multiplexers, as shown below. This leverages the general-purpose multiplexer and avoids implementing separate negation circuitry. (The comparison operation is implemented as subtraction but without storing the result. If the difference is zero, the values are equal, while a positive difference indicates the first value is larger.)
The typical way a processor implements a left shift by one bit is by adding the value to itself. I don't know why the 8086 used the carry approach rather than the adding approach. ↩
An FPGA (field-programmable gate array) uses similar techniques to implement arbitrary logic functions. The truth table is stored in a lookup table (LUT). These lookup tables are typically larger; a 6-input lookup table has 26 = 64 entries. One difference between the FPGA and the ALU is that the FPGA is programmed and then the gate functions are fixed, while the ALU's gates can change functions every operation. ↩
The carry-generate multiplexer returns 0 if argument 1 is 0. In other words, it only implements two cases of the truth table and has two control inputs. To handle the other two cases, it is pulled low by the clock signal so it outputs 0. Because it is driven by the clock and depends on the value held by the circuit capacitance, it is a form of dynamic logic. The 8086 primarily uses standard static logic, but uses dynamic logic in some places. ↩
The control signals for the ALU are generated from a PLA (similar to a ROM) that takes a 5-bit opcode as input. This opcode can either come from the instruction or be specified by the microcode. For an instruction, the ALU portion of the instruction is typically bits 5-3 of the first byte of the instruction or bits 5-3 of the MOD R/M byte. The point of this is that one microcode routine can handle all the similar arithmetic/logic instructions, making the microcode smaller. The ALU control PLA generates the signals to perform the correct ALU operation, transparently to the microcode. I should mention that there are many more ALU control signals than I described. Many of these control the flag handling, while others control various special cases.
The control signals pass through the peculiar circuit below. If the input is high, it sends a clock pulse to the ALU. Otherwise, it remains low. The drive signal is discharged to ground on the negative clock phase by the lower transistor. In the absence of an input, the signal is not driven during the positive clock phase, but remains low due to dynamic capacitance. One mystery is the transistor with its gate tied to +5V, leaving it permanently on, which seems pointless. It will reduce the voltage to the gate of the clk transistor, and thus the output voltage, but I don't see why. Maybe to reduce current? To slow the signal?
The pull-up resistor in an NMOS gate is implemented by a special depletion-mode transistor. The depletion-mode transistor acts as a resistor but is more compact and performs better than an actual resistor. ↩
In the 6502, the two inputs of the ALU are connected to separate buses (details), so they can both be loaded at the same time. The 8085 (and many other early microprocessors) connect the accumulator register to one input of the ALU to avoid use of the bus (details). ↩
The silicon implementation of the lower eight bits of the ALU / registers is flipped compared to the upper eight bits. The motivation is to put the ALU signals next to the flag circuitry that needs these signals. Since the flag circuitry is sandwiched between the two halves of the ALU, the two halves become (approximate) mirror images. (See the die photo at the top of the article.) ↩
The Nanoprocessor is a mostly-forgotten processor developed by Hewlett-Packard in 19741 as a microcontroller2 for their products. Strangely, this processor couldn't even add or subtract,3 probably why it was called a nanoprocessor and not a microprocessor. Despite this limitation, the Nanoprocessor powered numerous Hewlett-Packard devices ranging from interface boards and voltmeters to spectrum analyzers and data capture terminals.4 The Nanoprocessor's key feature was its low cost and high speed: Compared against the contemporary Motorola 6800,7 the Nanoprocessor cost $15 instead of $360 and was an order of magnitude faster for control tasks.
Recently, the six masks used to manufacture the Nanoprocessor were released by Larry Bower, the chip's designer, revealing details about its design. The composite mask image below shows the internal circuitry of the integrated circuit.5 The blue layer shows the metal on top of the chip, while the green shows the silicon underneath. The black squares around the outside are the 40 pads for connection to the IC's external pins. I used these masks to reverse-engineer the circuitry of the processor and understand its simple but clever RISC-like design.6
The Nanoprocessor was designed in 1974, the same year that the classic Intel 8080 and Motorola 6800 microprocessors were announced. However, the Nanoprocessor's silicon fabrication technology was a few years behind, using metal-gate transistors rather than silicon-gate transistors that were developed in the late 1960s. This may seem like an obscure difference, but silicon gate technology was much better in several ways. First, silicon-gate transistors were smaller, faster, and more reliable. Second, silicon-gate chips had a layer of polysilicon wiring in addition to the metal wiring; this made chip layouts about twice as dense.8 Third, metal-gate circuitry required an additional +12 V power supply. The Intel 4004 processor used silicon gates in 1971, so I'm surprised that HP was still using metal gates in 1974.9
A bizarre characteristic of the Nanoprocessor is its variable substrate bias voltage. For performance reasons, many 1970s microprocessors applied a negative voltage to the silicon substrate, with -5V provided through a bias pin.10 The Nanoprocessor has a bias pin, but strangely the bias voltage varied from chip to chip, from -2 volts to -5 volts. During manufacturing, the required voltage was hand-written on the chip (below). Each Nanoprocessor had to be installed with a matching resistor to provide the right voltage. If a Nanoprocessor was replaced on a board, the resistor had to be replaced as well. The variable bias voltage seems like a flaw in the manufacturing process; I can't imagine Intel making a processor like that.
Like most processors of that era, the Nanoprocessor was an 8-bit processor. However, it didn't use RAM, but ran code from an external 2-kilobyte ROM. It contained 16 8-bit registers, more than most processors and enough to make up for the lack of RAM in many applications. Based on transistor count, the Nanoprocessor is more complex than the Intel 8008 (1972) and slightly less complex than the 6800 (1974) or 6502 (1975).11 Its architecture uses its transistor count on different purposes from these processors, though. The Nanoprocessor lacks ALU functionality but in exchange, it has a large register set, taking up much of the die area. The Nanoprocessor has 48 instructions, a considerably smaller instruction set than the 6800's 72 instructions. However, the Nanoprocessor includes convenient bit set, clear, and test operations, which these other processors lacked.12 The Nanoprocessor supports indexed register access, but lacks the complex addressing modes of the other processors.
The block diagram below shows the internal structure of the Nanoprocessor. The main I/O feature is the 4-bit "I/O Instruction Device Select" which allows 15 devices to receive I/O operations. In other words, the select pins indicate which I/O device is being read or written over the data lines. External circuitry uses these signals to do whatever is necessary for the particular application, such as storing the data in a latch, sending it to another system, or reading values. More I/O is provided through seven "Direct Control I/O" pins (GPIO pins) that can be used for inputs or outputs. If not connected to external circuitry, these pins operate as convenient bit flags; the Nanocomputer can set a value and then read it back. The Control Logic Unit performs increments, decrements, shifts, and bit operations on the accumulator, lacking the arithmetic and logical operations of a standard Arithmetic/Logic Unit (ALU).
I reverse-engineered the Nanoprocessor's circuitry from the masks and determined how the functional blocks map onto the die, below. The largest feature is the set of 16 registers in the center-left. To the right is the comparator and then the accumulator, along with its increment, decrement, shift, and complement circuitry. The instruction decoder circuitry takes up much of the space above and to the right of the comparator and accumulator. The bottom part of the chip is dominated by the 11-bit program counter, along with the one-entry interrupt stack and subroutine stack. The control circuitry implements the Nanoprocessor's almost-trivial instruction timing: one fetch cycle followed by one execute cycle.13 In most microprocessors, the control circuitry takes up a large fraction of the chip, but the Nanoprocessor's control circuitry is just a small block.
The chip was fabricated using six masks, each used for constructing one layer of the processor using photolithography. The photo below shows the masks; each one is a 47.2×39.8 cm Mylar sheet. These sheets are 100× enlargements of the masks used to produce the 4.72×3.98 mm silicon die (for comparison, about 33% smaller than the 6800's die). Each 3-inch silicon wafer held about 200 integrated circuits, fabricated together on the wafer, and then tested, cut apart, and packaged.
To explain the role of the masks, I'll start with the structure of a metal-gate MOSFET, the transistor used in the Nanoprocessor. At the bottom, two regions of silicon (green) are doped to make them conductive, forming the source and drain of the transistor. A metal strip in between forms the gate, separated from the silicon by a thin layer of insulating oxide. (These layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.) The transistor can be considered a switch controlled by the gate. The metal layer also provides the main wiring of the integrated circuit, although the silicon layer is also used for some wiring.
Masks are a key part of the integrated circuit construction process, specifying the position of the components. The diagram below shows how a mask is used to dope regions of the silicon. First, the silicon wafer is oxidized to form an insulating oxide layer on top, and then light-sensitive photoresist is applied. Ultraviolet light polymerizes and hardens the photoresist, except where the mask blocks the light. Next, the soft, unexposed photoresist is dissolved. The wafer is exposed to hydrofluoric acid, which removes the oxide layer where it is not protected by photoresist. This yields holes in the oxide that match the mask pattern. The wafer is then exposed to a high-temperature gas which diffuses into the unprotected silicon regions, modifying the silicon's conductivity. These processing steps create tiny doped silicon regions matching the masks's pattern. As will be shown below, the other masks are used for different processing steps, but using the same photoresist-and-mask process.
I'll zoom in on the Nanoprocessor's die and show how one of its circuits is constructed from the six masks. (This two-transistor circuit is an inverter, flipping the binary value of its input.) The first mask dopes regions of silicon to make them conductive, using the photolithography steps described above. The doped regions (green) will become transistor source/drains or wiring between components.
Next, the die is covered with an insulating oxide layer. The second mask (magenta) is used to etch openings in the oxide, exposing the silicon underneath. These openings will be used to create transistor gates as well as connecting metal wiring to the silicon.
The third mask (gray) exposes a region to ion implantation, which changes the doping of the silicon, and thus the transistor's properties. This turns the upper transistor into a special depletion-mode transistor that pulls logic gate outputs high.
Next, the silicon is covered with an additional thin layer of insulating oxide, forming the gate oxide for the transistors. The fourth mask (orange) removes this oxide from regions that will become contacts between the silicon and the metal layer. After this step, most of the die is covered with a thick insulating oxide layer. The oxide layer is very thin over the transistor gates (magenta), and there are contact holes in the oxide from the current mask (orange).
The fifth mask (blue) is used to create the metal wiring on top; a uniform metal layer is applied and then the undesired parts are etched off. In locations where the fourth mask created holes in the oxide, the metal layer contacts the silicon and forms a connection. In locations where the third mask created a thin oxide layer, the metal layer forms the transistor gate between two silicon regions. Finally, the entire wafer is covered with a protective glassy layer. The sixth mask (not shown) is used to form holes in this layer over the pads around the edges of each chip. Once the wafer is cut into individual silicon dies (dice?), bond wires are attached to the pads, connecting the die to the external pins.
The schematic below shows how the circuitry above forms a two-transistor inverter. The two transistor symbols correspond to the two transistors created by the masks. When there is no input, the upper transistor (connected to +5 volts) pulls the output high. When the input is high, it turns on the lower transistor. This connects the output to ground, pulling the output low. Thus, the circuit inverts the output.
Although the diagrams above show just a single inverter, these masking steps create the entire processor with its 4639 transistors.11 The diagram below shows a larger part of the die with dozens of transistors forming more complex gates and circuitry. One cute thing I noticed on the masks is a tiny heart with HP inside, below the chip's number.14
To understand how the Nanoprocessor was used in practice, I reverse-engineered the code from an HP 98035 clock module. This module was plugged into an HP desktop computer15 to provide a real-time clock, as well as millisecond-accurate timings, intervals, and periodic events. The design of the clock module was rather unusual. To preserve the time when the computer was powered-down, the clock module was built around a digital watch chip with a backup battery.17 Inconveniently, the digital watch chip wasn't designed for computer control: it generated 7-segment signals to drive an LED, and it was set through three buttons. To read the time, the Nanoprocessor had to convert the 7-segment display outputs back into digits. And to set the time, the Nanoprocessor had to simulate the right sequence of button presses to advance through the digits.
The host computer controlled the clock module by sending it ASCII strings such as "S 12:07:12:45:00" to set the clock to 12:45:00 on December 7 (or on July 12 if the module was running in European mode). The module's various interval timers, periodic alarms, and counters were controlled with similar commands such as "Unit 2 Period 12345". The module supported 24 different commands, and the Nanoprocessor had to parse them. (See the manual for details.)
Here's some sample code reverse-engineered from the clock board ROM. This code is from the interrupt handler that increases the time and date every second. The code below determines how many days in the current month so it knows when to move to the next month. The columns are the byte value, the corresponding opcode, and my description of the instruction.
This code takes a month number (01-12 BCD) in the accumulator and returns (in register 0) the number of days in the month (28, 30, or 31 BCD). Not bad for 16 bytes of code, even if it ignores leap years. How does it work? For months past 7 (July), it subtracts 1. Then, if the month is odd, it has 31 days, while an even month has 30 days. To handle February, the code clears bit 1 of the month. If the month is now 0 (i.e. February), it has 28 days.
This code demonstrates that even though a processor without addition sounds useless, the Nanoprocessor's bit operations and increment/decrement allow more computation than you'd expect.16 It also shows that Nanoprocessor code is compact and efficient. Many things can be done in a single byte (such as bit test and skip) that would take multiple bytes on other processors.12 The Nanoprocessor's large register file also avoids much of the tedious shuffling of data back and forth often required in other processors. Although some call the Nanoprocessor more of a state machine controller than a microprocessor, that understates the capabilities and role of the Nanoprocessor.
While the Nanoprocessor doesn't include an ALU or have instructions for accessing RAM, these could be added as I/O devices. The clock module has 256 bytes of RAM to hold its multiple counter and timer values, accessed through four I/O ports. Other products added ALU chips to support arithmetic operations.18
The Nanoprocessor is an unusual processor. My first impression was that it wasn't even a "real processor", lacking basic arithmetic functionality. The chip was built with obsolete metal-gate technology, a few years behind other microprocessors. Most bizarrely, each chip required a different voltage, hand-written on the package, suggesting difficulty with manufacturing consistency. However, the Nanoprocessor provided high performance in its microcontroller role, much faster than other processors at the time. Hewlett-Packard used the Nanoprocessor in many products in the 1970s and 1980s, in roles that were more complex than you'd expect. strings and performing calculations.
While the Nanoprocessor has languished in obscurity, lacking even a mentioned on Wikipedia, the masks recently revealed by its designer shed light on this unusual corner of processor history. Thanks to Antoine Bercovici for scanning and remastering the masks, Larry Bower for the donation, and John Culver at The CPU Shack for sharing the donation. Thanks to Marc Verdiell for dumping the clock board ROM.
I'm not completely comfortable calling the Nanoprocessor a microcontroller since it uses an external program ROM, while a microcontroller usually has everything, including the ROM, on a single chip. (It is like the Intel 4004 in this way.) However, the Nanoprocessor resembles a microcontroller in most ways: it is designed for embedded control applications, with a Harvard architecture and an instruction set optimized for I/O, running a program from ROM with minimal storage. ↩
On the topic of computers that can't add, the desk-sized IBM 1620 computer (1959) didn't have addition circuitry, but used table lookup for addition. It had the codename CADET; people joked that this stood for "Can't Add, Doesn't Even Try." ↩
I've determined that the Nanoprocessor was used in the following HP products (and probably others): HP 9845B, HP 3585A spectrum analyzer, HP 3325A Synthesizer / Function Generator, HP 9885 floppy disk drive, HP 3070B data capture terminal, HP 98034 HPIB interface for the HP 9825 calculator, HP 98035 real time clock for the HP 9825 desktop computer, HP 7970E tape drive interface, HP 4262A LCR meter, HP 3852 Spectrum Analyzer, and HP 3455A voltmeter. ↩
The Nanoprocessor is like a RISC (Reduced Instruction Set Computer) processor in many ways, although it predated the RISC concept by several years. In particular, the Nanoprocessor is designed with a simple opcode structure, all instructions execute in one cycle (after the fetch cycle), the register set is large and orthogonal, and addressing is simple. These RISC characteristics yielded a high clock speed compared to more complex processors. ↩
Interestingly, the Nanoprocessor's competition during development was the Motorola 6800, rather than an Intel processor. The Nanoprocessor's key feature was performance: it ran at 4 MHz, compared to 1 MHz for the 6800. (Both processors took 2 cycles to perform a basic instruction, while the 6800 took up to 7 cycles for more complex instructions.)
The Nanoprocessor designers wrote a timing comparison, estimating that the Nanoprocessor could count six times faster than the 6800 and handle interrupts over sixteen times faster. The proposal assumed a 5 MHz Nanoprocessor while the actual chip fell a bit short, running at 4 MHz. The projected cost of the Nanoprocessor was $15 per chip, compared to $360 for the Motorola 6800. ↩
I'm impressed with the density of the Nanocomputer's layout given its limitations: one layer of metal wiring and no polysilicon. I've looked at other metal-gate chips and their layouts are horribly inefficient, with a lot more wiring than transistors. However, the Nanoprocessor's circuits are arranged efficiently, with very little wasted space. ↩
The Nanoprocessor's fabrication technology was ahead of the Intel 8080 and Motorola 6800 in one way: it used depletion-mode pull-up transistors, more advanced than the enhancement-mode transistors in the 8080 and 6800. Depletion-mode transistors resulted in faster, lower-power logic gates, but required an additional manufacturing step. For the Nanoprocessor, this step used mask #3 (the gray mask). In processors such as the MOS Technology 6502 and Zilog Z-80, depletion-mode transistors allowed the processor to run off a single voltage instead of three. Unfortunately, the Nanoprocessor still required three voltages due to its metal-gate transistors. ↩
Early DRAM memory chips and microprocessor chips often required three supplies: +5V (Vcc), +12V (Vdd) and -5V (Vbb) bias voltage. In the late 1970s, improvements in chip technology allowed a single supply to be used instead. The Intel 8080 microprocessor (1974) used enhancement-mode transistors and required three voltages, but the improved 8085 (1976) used depletion-mode transistors and was powered by a single +5V supply. Starting in the late 1970s, many microprocessors used an on-chip charge pump to generate the negative bias voltage. I wrote about the 8086's charge pump here. ↩
By my count, the Nanoprocessor has 4639 transistors. The instruction decoder is constructed from pairs of small transistors for layout reasons; combining these pairs yields 3829 unique transistors. Of these, 1061 act as pull-ups, while 2668 are active. In comparison, the 6502 has 4237 transistors, of which 3218 are active. The 8008 has 3500 transistors and the Motorola 6800 has 4100 transistors. ↩↩
Early microprocessors didn't have bit set, reset, and test operations (although these could be accomplished with AND and OR). The Z-80 (1976) added bit operations, but they took two bytes and were much slower than the Nanoprocessor. ↩↩
The Nanoprocessor sticks to its model of executing the instruction in one cycle even for two-byte instructions: The second byte is fetched during the execute cycle, so the overall timing is unchanged. ↩
The Nanoprocessor has two different part numbers. The 1820-1691 was the 2.66 MHz version, while the 1820-1692 was the 4 MHz version. The last digit of the part number was hand-written on each chip after testing its performance. (The part number is unrelated to the chip's number 9-4332A on the die.) ↩
The HP 9825 was a 16-bit desktop computer, running a BASIC-like language. It was introduced in 1976, five years before the IBM PC, and was a remarkably advanced system for its time. The back of the HP 9825 had three I/O slots for adding modules such as the real time clock.
I came across one place in the code where it needs to add two BCD digits to form one byte. This was accomplished by a loop that decremented one number while incrementing the second. When the first number reached zero, the result was the sum. Thus, even without an ALU, addition is possible but slow. ↩
The Texas Instruments watch chip was implemented with Integrated Injection Logic (I2L) to keep power consumption low. Nowadays, a low-power chip would use CMOS, but that wasn't common at the time. Integrated Injection Logic was built from bipolar transistors, similar to TTL, but using different high-density, low-power circuitry. I discussed Integrated Injection Logic in detail in this blog post. The Texas Instruments chip may be the X-902 in a DIP package. ↩
The clock board schematic shows how the two 256×4 RAM chips are connected to the Nanoprocessor. The Nanoprocessor's I/O port select pins are connected to the "3-8 Decoder" U5, which produces a separate signal for each I/O port. Three of these signals go to the RAM chip's control pins, while one signal controls the Data Latch chips U9 and U10 that hold write data.
All I/O ports use the Nanoprocessor's data bus (top) for communication, so the data bus is connected to both the address and data pins of the RAM chips. For a read, the RAM address is written to the RAM chips via one I/O port and then the data is read from RAM via a second port. In both cases, the values go across the data bus, while the signal from the "3-8 Decoder" indicates what to do with the values. For a write, the first I/O operation stores the byte value in the latches, and then the second I/O operation sends the address to the RAM chips. While this may seem like a clunky, Rube-Goldberg approach, it works well in practice; a read or write can be done with two bytes of instructions.
(Many processors, such as the 6502, used memory-mapped I/O; I/O devices were mapped into the memory address space and accessed through memory read/write operations. The Nanoprocessor is the opposite, putting RAM into the I/O port space and accessing it through I/O operations.)
Adding an ALU uses a similar approach, as in the HP 3455A voltmeter (schematic), which contains two Nanoprocessors. The voltmeter uses two 74LS181 ALU chips to implement an 8-bit ALU that it uses to scale value and compute percentage error. Two output ports provide the arguments and another port specifies the operation. The 8-bit result is read from a port, while the processor reads the carry through a GPIO pin. (At this point, I'd wonder if it wouldn't be better to use a processor that includes arithmetic.) ↩
A Field-Programmable Gate Array (FPGA) can implement arbitrary digital logic, anything from a microprocessor to a video generator or crypto miner. An FPGA consists of many logic blocks, each typically consisting of a flip flop and a logic function, along with a routing network that connects the logic blocks. What makes an FPGA special is that it is programmable hardware: you can redefine each logic block and the connections between them. The result is you can build a complex digital circuit without physically wiring up individual gates and flip flops or going to the expense of designing a custom integrated circuit.
The FPGA was invented by Ross Freeman1 who co-founded Xilinx2 in 1984 and introduced the first FPGA, the XC2064. 3 This FPGA is much simpler than modern FPGAs—it contains just 64 logic blocks, compared to thousands or millions in modern FPGAs—but it led to the current multi-billion-dollar FPGA industry. Because of its importance, the XC2064 is in the Chip Hall of Fame. I reverse-engineered Xilinx's XC2064, and in this blog post I explain its internal circuitry (above) and how a "bitstream" programs it.
Nowadays, an FPGA is programed in a hardware description language such as Verilog or VHDL, but back then Xilinx provided their own development software, an MS-DOS application named XACT with a hefty $12,000 price tag. XACT operated at a lower level than modern tools: the user defined the function of each logic block, as shown in the screenshot below, and the connections between logic blocks. XACT routed the connections and generated a bitstream file that could be loaded into the FPGA.
An FPGA is configured via the bitstream, a sequence of bits with a proprietary format. If you look at the bitstream for the XC2064 (below), it's a puzzling mixture of patterns that repeat irregularly with sequences scattered through the bitstream. There's no clear connection between the function definitions in XACT and the data in the bitstream. However, studying the physical circuitry of the FPGA reveals the structure of the bitstream data and it can be understood.
The diagram below, from the original FPGA patent, shows the basic structure of an FPGA. In this simplified FPGA, there are 9 logic blocks (blue) and 12 I/O pins. An interconnection network connects the components together. By setting switches (diagonal lines) on the interconnect, the logic blocks are connected to each other and to the I/O pins. Each logic element can be programmed with the desired logic function. The result is a highly programmable chip that can implement anything that fits in the available circuitry.
While the diagram above shows nine configurable logic blocks (CLBs), the XC2064 has 64 CLBs. The diagram below shows the structure of each CLB. Each CLB has four inputs (A, B, C, D) and two outputs (X and Y). In between is combinatorial logic, which can be programmed with any desired logic function. The CLB also contains a flip flop, allowing the FPGA to implement counters, shift registers, state machines and other stateful circuits. The trapezoids are multiplexers, which can be programmed to pass through any of their inputs. The multiplexers allow the CLB to be configured for a particular task, selecting the desired signals for the flip flop controls and the outputs.
You might wonder how the combinatorial logic implements arbitrary logic functions. Does it select between a collection of AND gates, OR gates, XOR gates, and so forth? No, it uses a clever trick called a lookup table (LUT), in effect holding the truth table for the function. For instance, a function of three variables is defined by the 8 rows in its truth table. The LUT consists of 8 bits of memory, along with a multiplexer circuit to select the right value. By storing values in these 8 bits of memory, any 3-input logic function can be implemented.4
The second key piece of the FPGA is the interconnect, which can be programmed to connect the CLBs in different ways. The interconnect is fairly complicated, but a rough description is that there are several horizontal and vertical line segments between each CLB. CLB. Interconnect points allow connections to be made between a horizontal line and a vertical line, allowing arbitrary paths to be created. More complex connections are done via "switch matrices". Each switch matrix has 8 pins, which can be wired together in (almost) arbitrary ways.
The diagram below shows the interconnect structure of the XC2064, providing connections to the logic blocks (cyan) and the I/O pins (yellow). The inset shows a closeup of the routing features. The green boxes are the 8-pin switch matrices, while the small squares are the programmable interconnection points.
The interconnect can wire, for example, an output of block DC to an input of block DE, as shown below. The red line indicates the routing path and the small red squares indicate activated routing points. After leaving block DC, the signal is directed by the first routing point to an 8-pin switch (green) which directs it to two more routing points and another 8-pin switch. (The unused vertical and horizontal paths are not shown.) Note that routing is fairly complex; even this short path used four routing points and two switches.
The screenshot below shows what routing looks like in the XACT program. The yellow lines indicate routing between the logic blocks. As more signals are added, the challenge is to route efficiently without the paths colliding. The XACT software package performs automatic routing, but routes can also be edited manually.
The remainder of this post discusses the internal circuitry of the XC2064, reverse-engineered from die photos.5 Be warned that this assumes some familiarity with the XC2064.
The die photo below shows the layout of the XC2064 chip. The main part of the FPGA is the 8×8 grid of tiles; each tile holds one logic block and the neighboring routing circuitry. Although FPGA diagrams show the logic blocks (CLBs) as separate entities from the routing that surrounds them, that is not how the FPGA is implemented. Instead, each logic block and the neighboring routing are implemented as a single entity, a tile. (Specifically, the tile includes the routing above and to the left of each CLB.)
Around the edges of the integrated circuit, I/O blocks provide communication with the outside world. They are connected to the small green square pads, which are wired to the chip's external pins. The die is divided by buffers (green): two vertical and two horizontal. These buffers amplify signals that travel long distances across the circuit, reducing delay. The vertical shift register (pink) and horizontal column select circuit (blue) are used to load the bitstream into the chip, as will be explained below.
The diagram below shows the layout of a single tile in the XC2064; the chip contains 64 of these tiles packed together as shown above. About 40% of each tile is taken up by the memory cells (green) that hold the configuration bits. The top third (roughtly) of the tile handles the interconnect routing through two switch matrices and numerous individual routing switches. Below that is the logic block. Key parts of the logic block are multiplexers for the input, the flip flop, and the lookup tables (LUTs). The tile is connected to neighboring tiles through vertical and horizontal wiring for interconnect, power and ground. The configuration data bits are fed into the memory cells horizontally, while vertical signals select a particular column of memory cells to load.
The FPGA is implemented with CMOS logic, built from NMOS and PMOS transistors. Transistors have two main roles in the FPGA. First, they can be combined to form logic gates. Second, transistors are used as switches that signals pass through, for instance to control routing. In this role, the transistor is called a pass transistor. The diagram below shows the basic structure of an MOS transistor. Two regions of silicon are doped with impurities to form the source and drain regions. In between, the gate turns the transistor on or off, controlling current flow between the source and drain. The gate is made of a special type of silicon called polysilicon, and is separated from the underlying silicon by a thin insulating oxide layer. Above this, two layers of metal provide wiring to connect the circuitry.
The die photo closeup below shows what a transistor looks like under a microscope. The polysilicon gate is the snaking line between the two doped silicon regions. The circles are vias, connections between the silicon and the metal layer (which has been removed for this photo).
The configuration information in the XC2064 is stored in configuration memory cells. Instead of using a block of RAM for storage, the FPGA's memory is distributed across the chip in a 160×71 grid, ensuring that each bit is next to the circuitry that it controls. The diagram below shows how the configuration bitstream is loaded into the FPGA. The bitstream is fed into the shift register that runs down the center of the chip (pink). Once 71 bits have been loaded into the shift register, the column select circuit (blue) selects a particular column of memory and the bits are loaded into this column in parallel. Then, the next 71 bits are loaded into the shift register and the next column to the left becomes the selected column. This process repeats for all 160 columns of the FPGA, loading the entire bitstream into the chip. Using a shift register avoids bulky memory addressing circuitry.
The important point is that the bitstream is distributed across the chip exactly as it appears in the file: the layout of bits in the bitstream file matches the physical layout on the chip. As will be shown below, each bit is stored in the FPGA next to the circuit it controls. Thus, the bitstream file format is directly determined by the layout of the hardware circuits. For instance, when there is a gap between FPGA tiles because of the buffer circuitry, the same gap appears in the bitstream. The content of the bitstream is not designed around software concepts such as fields or data tables or configuration blocks. Understanding the bitstream depends on thinking of it in hardware terms, not in software terms.7
Each bit of configuration memory is implemented as shown below.8 Each memory cell consists of two inverters connected in a loop. This circuit has two stable states so it can store a bit: either the top inverter is 1 and the bottom is 0 or vice versa. To write to the cell, the pass transistor on the left is activated, passing the data signal through. The signal on the data line simply overpowers the inverters, writing the desired bit. (You can also read the configuration data out of the FPGA using the same path.) The Q and inverted Q outputs control the desired function in the FPGA, such as closing a routing connection, providing a bit for a lookup table, or controlling the flip flops. (In most cases, just the Q output is used.)
The diagram below shows the physical layout of memory cells. The photo on the left shows eight memory cells, with one cell highlighted. Each horizontal data line feeds all the memory cells in that row. Each column select line selects all the memory cells in that column for writing. The middle photo zooms in on the silicon and polysilicon transistors for one memory cell. The metal layers were removed to expose the underlying transistors. The metal layers wire together the transistors; the circles are connections (vias) between the silicon or polysilicon and the metal. The schematic shows how the five transistors are connected; the schematic's physical layout matches the photo. Two pairs of transistors form two CMOS inverters, while the pass transistor in the lower left provides access to the cell.
As explained earlier, the FPGA implements arbitrary logic functions by using a lookup table.
The diagram below shows how a lookup table is implemented in the XC2064.
The eight values on the left are stored in eight memory cells.
Four multiplexers select one of each pair of values, depending on the value of the
A input; if
A is 0, the top value is selected and if
A is 1
the bottom value is selected.
Next, a larger multiplexer selects one of the four values based on
C. The result is the desired value, in this case
A XOR B XOR C.
By putting different values in the lookup table, the logic function can be changed as desired.
Each multiplexer is implemented with pass transistors. Depending on the control signals, one of the pass transistors is activated, passing that input to the output. The diagram below shows part of the LUT circuitry, multiplexing two of the bits. At the right are two of the memory cells. Each bit goes through an inverter to amplify it, and then passes through the multiplexer's pass transistors in the middle, selecting one of the bits.
Each CLB contains a flip flop, allowing the FPGA to implement latches, state machines, and other stateful circuits. The diagram below shows the (slightly unusual) implementation of the flip flop. It uses a primary/secondary design. When the clock is low, the first multiplexer lets the data into the primary latch. When the clock goes high, the multiplexer closes the loop for the first latch, holding the value. (The bit is inverted twice going through the OR gate, NAND gate, and inverter, so it is held constant.) Meanwhile, the secondary latch's multiplexer receives the bit from the first latch when the clock goes high (note that the clock is inverted). This value becomes the flip flop's output. When the clock goes low, the secondary's multiplexer closes the loop, latching the bit. Thus, the flip flop is edge-sensitive, latching the value on the rising edge of the clock. The set and reset lines force the flip flop high or low.
The switch matrix is an important routing element. Each switch has eight "pins" (two on each side) and can connect almost any combination of pins together. This allows signals to turn, split, or cross over, allowing more flexibility than the individual routing nodes. The diagram below shows part of the routing network between four CLBs (cyan). The switch matrices (green) can be connected with any combination of the connections on the right. Note that each pin can connect to 5 of the 7 other pins. For instance, pin 1 can connect to pin 3 but not pin 2 or 4. This makes the matrix almost a crossbar, with 20 potential connections rather than 28.
The switch matrix is implemented by a row of pass transistors controlled by memory cells above and below. The two sides of the transistor are the two switch matrix pins that can be connected by that transistor. Thus, each switch matrix has 20 associated control bits;9 two matrices per tile yields matrix 40 control bits per tile. The photo below indicates one of the memory cells, connected to the long squiggly gate of the pass transistor below. This transistor controls the connection between pin 5 and pin 1. Thus, the bit in the bitstream corresponding to that memory cell corresponds to the switch connection between pin 5 and pin 1. Likewise, the other memory cells and their associated transistors control other switch connections. Note that the ordering of these connections follows no particular pattern; consequently, the mapping between bitstream bits and the switch pins appears random.
The inputs to a CLB use a different encoding scheme in the bitstream, which is explained by the hardware implementation. In the diagram below, the eight circled nodes are potential inputs to CLB box DD. Only one node (at most) can be configured as an input, since connecting two signals to the same input would short them together.
The desired input is selected using a multiplexer. A straightforward solution would use an 8-way multiplexer, with 3 control bits selecting one of the 8 signals. Another straightforward solution would be to use 8 pass transistors, each with its own control signal, with one of them selecting the desired signal. However, the FPGA uses a hybrid approach that avoids the decoding hardware of the first approach but uses 5 control signals instead of the eight required by the second approach.
The schematic above shows the two-stage multiplexer approach used in the FPGA.
In the first stage, one of the control signals is activated.
The second stage picks either the top or bottom signal for the output.10
For instance, suppose control signal
B/F is sent to the first stage and 'ABCD' to the second stage; input B is the only one that will pass through to the output.
Thus, selecting one of the eight inputs requires 5 bits in the bitstream and uses 5 memory cells.
The XC2064 uses a variety of highly-optimized circuits to implement its logic blocks and routing. This circuitry required a tight layout in order to fit onto the die. Even so, the XC2064 was a very large chip, larger than microprocessors of the time, so it was difficult to manufacture at first and cost hundreds of dollars. Compared to modern FPGAs, the XC2064 had an absurdly small number of cells, but even so it sparked a revolutionary new product line.
Two concepts are the key to understanding the XC2064's bitstream. First, the FPGA is implemented from 64 tiles, repeated blocks that combine the logic block and routing. Although FPGAs are described as having logic blocks surrounded by routing, that is not how they are implemented. The second concept is that there are no abstractions in the bitstream; it is mapped directly onto the two-dimensional layout of the FPGA. Thus, the bitstream only makes sense if you consider the physical layout of the FPGA.
I've determined how most of the XC2064 bitstream is configured (see footnote 11) and I've made a program to generate the CLB information from a bitstream file. Unfortunately, this is one of those projects where the last 20% takes most of the time, so there's still work to be done. One problem is handling I/O pins, which are full of irregularities and their own routing configuration. Another problem is the tiles around the edges have slightly different configurations. Combining the individual routing points into an overall netlist also requires some tedious graph calculations.
Xilinx was one of the first fabless semiconductor companies. Unlike most semiconductor companies that designed and manufactured semiconductors, Xilinx only created the design while a fab company did the manufacturing. Xilinx used Seiko Epson Semiconductor Division (as in Seiko watches and Epson printers) for their initial fab. ↩
Custom integrated circuits have the problems of high cost and the long time (months or years) to design and manufacture the chip. One solution was Programmable Logic Devices (PLD), chips with gate arrays that can be programmed with various functions, which were developed around 1967. Originally they were mask-programmable; the metal layer of the chip was designed for the desired functionality, a new mask was made, and chips were manufactured to the specifications. Later chips contained a PROM that could be "field programmed" by blowing tiny fuses inside the chip to program it, or an EPROM that could be reprogrammed. Programmable logic devices had a variety of marketing names including Programmable Logic Array, Programmable Array Logic (1978), Generic Array Logic and Uncommitted Logic Array. For the most part, these devices consisted of logic gates arranged as a sum-of-products, although some included flip flops. The main innovation of the FPGA was to provide a programmable interconnect between logic blocks, rather than a fixed gate architecture, as well as logic blocks with flip flops. For an in-depth look at FPGA history and the effects of scalability, see Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology. Also see A Brief History of FPGAs. ↩
The lookup tables in the XC2064 are more complex than just a table. Each CLB contains two 3-input lookup tables. The inputs to the lookup tables in the XC2064 have programmable multiplexers, allowing selection of four different potential inputs. In addition, the two lookup tables can be tied together to create a function on four variables or other combinations.
To analyze the XC2064, I used my own die photos of the XC20186 as well as the siliconpr0n photos of the XC2064 and XC2018. Under a light microscope, the FPGA is hard to analyze because it has two metal layers. John McMaster used his electron microscope to help disambiguate the two layers. The photo below shows how the top metal layer is emphasized by the electron microscope.
The Xilinx XC2018 FPGA (below) is a 100-cell version of the XC2064 FPGA. Internally, it uses the same tiles as the 64-cell XC2064, except it has a 10×10 grid of tiles instead of an 8×8 grid. The bitstream format of the XC2018 is very similar, except with more entries.
The image below compares the XC2064 die with the XC2018 die. The dies are very similar, except the larger chip has two more rows and columns of tiles.
While the bitstream directly maps onto the hardware layout, the bitstream file (.RBT) does have a small amount of formatting, shown below.
The configuration memory is implemented as static RAM (SRAM) cells. (Technically, the memory is not RAM since it must be accessed sequentially through the shift register, but people still call it SRAM.) These memory cells have five transistors, so they are known as 5T SRAM.
One question that comes up is if there are any unused bits in the bitstream. It turns out that many bits are unused. For instance, each tile has an 18×8 block of bits assigned to it, of which 27 bits are unused. Looking at the die shows that the memory cell for an unused bit is omitted entirely, allowing that die area to be used for other circuitry. The die photo below shows 9 implemented bits and one missing bit.
The switch matrix has 20 pass transistors. Since each tile is 18 memory cells wide, two of the transistors are connected to slightly more distant memory cells. ↩
A few notes on the CLB input multiplexer.
The control signal
EFGH is the complement of
so only one control signal is needed in the bitstream and only one memory cell for this signal.
Second, other inputs to the CLB have 6 or 10 choices; the same two-level multiplexer approach is used, changing the number of inputs and control signals.
Finally, a few of the control signals are inverted (probably because the inverted memory output was closer).
This can cause confusion when trying to
understand the bitstream, since some bits appear to select 6 inputs instead of 2.
Looking at the complemented bit, instead, restores the pattern. ↩
The following table summarizes the meaning of each bit in a tile's 8×18 part of the bitstream. Each entry in the table corresponds to one bit in the bitstream and indicates what part of the FPGA is controlled by that bit. Empty entries indicate unused bits.
|#2: 1-3||#2: 3-4||PIP D2,D5 (bit inverted)||Gin_3 = D||G = 1 2' 3'|
|#2: 1-2||#2: 2-6||#2: 2-4||PIP A2,A5 (bit inverted)||Gin_3 = C||G = 1' 2' 3'|
|#2: 3-7||#2: 3-6||PIP D3, D4, D5||PIP A3, A4, A5||G = 1' 2 3'|
|#2: 2-7||#2: 2-8||ND 11||PIP A1, A4||G = 1 2 3'|
|#2: 1-5||#2: 3-5||PIP A3, AX||PIP D1, D4||Y=F||G = 1 2' 3|
|#2: 4-8||#2: 5-8||ND 10||PIP D3, DX||Y=G||Gin_2 = B||G = 1' 2' 3|
|#2: 7-8||#2: 6-8||ND 9||PIP B2, B5, B6, BX, BY||PIP Y2||X=G||Gin_1 = A||G = 1' 2 3|
|#2: 5-6||#2: 5-7||ND 8||PIP B3,BX (bit inverted)||PIP Y4||X=F||G = 1 2 3|
|#2: 4-6||#2: 1-4||#2: 1-7||PIP C1, C3, C4, C7||PIP X3||Q = LATCH||Base FG (separate LUTs)|
|#1: 3-5||#1: 5-8||#1: 2-8||PIP X2|
|#1: 3-4||#1: 2-4||ND 7||PIP C3,CX (bit inverted)||PIP X1||Fin_1 = A||F = ! 1 2 3|
|#1: 1-2||#1: 1-3||ND 6||PIP B6, B7||CLK = enabled||Fin_2 = B||F = 1' 2 3|
|#1: 1-5||#1: 1-4||ND 5||PIP C6, C7||CLK = inverted (FF), noninverted (LATCH)||F = 1' 2' 3|
|#1: 4-8||#1: 4-6||ND 4||PIP C4, C5||CLK = C||F = 1 2' 3|
|#1: 2-7||#1: 1-7||ND 3||PIP B4, B5||PIP K1||SET = F||F = 1 2 3'|
|#1: 2-6||#1: 3-6||ND 2||PIP B2, BC||PIP K2||SET = none||F = 1' 2 3'|
|#1: 7-8||#1: 3-7||ND 1||PIP C1, C2||PIP Y3||RES = D or G||Fin_3 = C||F = 1' 2' 3'|
|#1: 6-8||#1: 5-6||#1: 5-7||PIP B1, BY||PIP Y1||RES = G||Fin_3 = D||F = 1 2' 3'|
The first two columns of the table indicate the switch matrices. There are two switch matrices, labeled #1 (red) and #2 (green) in my diagram below. The 8 pins on matrix #1 are labeled 1-8 clockwise. (Switch #2 is the same, but there wasn't room for the labels.) For example, "#2: 1-3" indicates that bit connects pins 1 and 3 on switch #2. The next column defines the "ND" non-directional connections, the boxes below with purple numbers near the switch matrices. Each ND bit in the table controls the corresponding ND connection.
The next two columns describe what I'm calling the PIP connections the solid boxes on lines above. The connections from output X (brown) are controlled by individual bits (X1, X2, C3). Likewise, the connections from output Y (yellow). The connections to input B (light purple) are different. Only one of these input connections can be active at a time, so they are encoded with multiple bits using the multiplexer scheme. Inputs C (cyan), D (blue) and A (green) are similar. The remaining table columns describe the CLB; refer to the datasheet for details. Bits control the clock, set and reset lines. The X and Y outputs can be selected from the F or G LUTs. The last two columns define the LUTs. There are three inputs for LUT F and three inputs for LUT G, with multiplexers controlling the inputs. Finally, the 8 bits for each LUT are defined, specifying the output for a particular combination of three inputs. ↩
Various FPGA patents provide some details on the chips: 4870302, 4642487, 4706216, 4758985, and RE34363. XACT documentation was formerly at Xilinx, but they seem to have removed it. It can now be found here. John McMaster has some xc2064 tools available. ↩
A recent Twitter thread about a counterfeit analog multiplier chip attracted my attention since I'm interested in both counterfeit integrated circuits and how analog computers multiply. In the thread, John McMaster decapped a suspicious AD633 analog multiplier chip and found an entirely different Rockwell RC4200 die inside. Why would someone do this? Probably because the RC4200 (1978) currently sells for about 85 cents, while the more modern laser-trimmed1 AD633 (1989) sells for about $7.2
Analog multiplication has many uses such as mixers, modulators, and phase detectors, but analog computers are how I encountered analog multiplication. A typical analog computer uses voltages to represent values and is wired up through a plugboard to solve a particular equation. Adding or subtracting two values is easy with an op amp, as is multiplying by a constant. Integration seems like it would be difficult, but it's almost trivial with a capacitor; analog computers excelled at solving differential equations.
Multiplying two values, however, was surprisingly difficult; multiplication techniques were slow, inaccurate, noisy, or expensive. One accurate but slow multiplier used the Rube-Goldberg configuration of servo motors turning potentiometers.3 A 1969 multiplier circuit uses a light bulb and photocells. A fast and accurate approach was the "parabolic multiplier", built from numerous expensive high-precision resistors.4 The approach I'll discuss is to multiply by adding the logarithms and taking the exponential. Inconveniently, this approach magnifies even small differences between the transistors. It is also very sensitive to temperature. As a result, this approach was simple but inaccurate.
However, the development of analog integrated circuits created new opportunities for analog multiplication circuits. In particular, since the transistors in an integrated circuit were created together, they have nearly-identical properties. And the components on a tiny silicon die are all at nearly the same temperature.5
The first analog multiplier integrated circuit I could find is a television demodulator from 1967. The Gilbert cell technique was introduced by Barrie Gilbert in 1968 and is used in most analog multipliers today.6 The AD530 was introduced around 1970, and became an industry standard, but required external adjustments for accuracy. Laser-trimming the resistors inside the integrated circuit during manufacturing greatly improved the accuracy, an approach used in the AD633, the integrated circuit that was counterfeited.
Before explaining the circuitry of the RC4200 (the multiplier inside the counterfeit chip), I'll discuss the components that it is constructed from, and how they appear in an integrated circuit. This will help you recognize these structures in the die photo.
Transistors are the key components in a chip. The photo below shows an NPN transistor in the RC4200 as it appears on the chip. The different blue colors are regions of silicon that have been doped differently, forming N and P regions. The white lines are the metal layer of the chip on top of the silicon—these form the wires connecting to the emitter (E), base (B), and collector (C).
You might expect PNP transistors to be similar to NPN transistors, just swapping the roles of N and P silicon. But for a variety of reasons, PNP transistors have an entirely different construction. They consist of a circular emitter (P), surrounded by a ring-shaped base (N), which is surrounded by the collector (P). This forms a P-N-P sandwich horizontally (laterally), unlike the vertical structure of the NPN transistors. The diagram below shows one of the PNP transistors in the RC4200.
The input and output transistors in the RC4200 are larger than the other transistors and have a different structure to support higher currents. The photo below shows one of the output transistors. Note the multiple interdigitated "fingers" of the emitter and base.
Capacitors are important in op amps to provide stability. A capacitor can be built in an integrated circuit as a large metal plate separated from the silicon by an insulating oxide layer. The main drawback of capacitors on ICs is they are physically very large. The 15pF capacitors in the RC4200 have a very small capacitance but take up a large fraction of the die area. In the photo below, the red arrows indicate the connection to the capacitor's metal layer and to the capacitor's underlying silicon layer.
Resistors are a key component of analog chips. Unfortunately, resistors in ICs are very inaccurate; the resistances can vary by 50% from chip to chip. The photo below shows four resistors, formed using different techniques. The first resistor is the zig-zagging blue region on the left. It is formed from a strip of P silicon, with metal wiring (white) attached on the left and right. Its resistance is 3320 Ω. The resistor in the upper right is much shorter, so it is only 511Ω (long, narrow resistors have higher resistance than short, wide resistors). The remaining resistors are 20KΩ despite their small size because they are "pinch resistors". In the pinch resistor, the square layer of brownish N silicon on top makes the conductive region much thinner (i.e. pinches it). This allows a much higher resistance for a given size. (Otherwise, a 20 KΩ resistor would be 6 times as long as the first resistor, taking up excessive space.) The tradeoff is the pinch resistor is much less accurate.
This integrated circuit multiplies using the log-antilog technique. The idea is that if you take the log of two numbers, add the logs together, and then take the antilog (i.e. exponential), you get the product of the two numbers. Conveniently, transistors have a logarithmic / exponential characteristic: the current through the transistor is an exponential of the voltage on the base. Specifically, if VBE is the voltage between the transistor's base and emitter, the current through the collector (IC) is an exponential of that voltage, as shown in the graph below. The analog multiplier takes advantage of this property.
The main complication with this approach is that the curve above is very sensitive to the temperature and to the manufacturing characteristics of the transistor. Because the curve is exponential, even a small shift in the curve will radically change the current. This was a serious difficulty when building a multiplier from discrete transistors, since the properties varied from transistor to transistor. To stabilize the temperature, some multipliers used a temperature-controlled oven. However, using an integrated circuit mostly solved these problems. The transistors in an integrated circuit are well-matched since they were built from the same piece of silicon under the same conditions. And the transistors in an integrated circuit die will be at almost the same temperature. Thus, integrated circuits made transistor-log circuits much more practical.
The diagram below shows the structure of the RC4200 multiplier chip. The user provides three current inputs (I1, I2, and I4) and the chip computes the output current I3, where I3 = I1×I2÷I4. (The use of current inputs and outputs is a bit inconvenient compared to other multipliers, such as the AD633, that use voltages.)
The four transistors in the middle of the diagram are the multiplier core, the key to the IC's operation. The transistors are configured so their base-emitter voltages sum: VBE3 = VBE1+VBE2-VBE4. Because the transistor current is related exponentially to the voltage, the result is that I3 = I1×I2÷I4.
In more detail, first note that the voltages VBE1 through VBE4 control the collector currents IC1 through IC4 through the transistors (below). The op amps adjust the base-emitter voltages so the input currents match the transistor currents, i.e. I1 = IC1 and so forth. (This is accomplished by op amp feedback.) Now, if you go through the loop of base-emitter voltages starting at the base of Q1 and ending at the base of Q4 (red arrows), you find that VBE1+VBE2-VBE3-VBE4 = 0. (The voltages must sum to zero since you start at ground and end at ground.7) Now, because IC is related to exp(VBE), taking the exponential of the equation yields IC1×IC2÷IC3÷IC4 = 1. (Details in footnote8.)
Next, I'll explain how the VBE voltages are generated. Each current input has an op amp associated with it that produces the "correct" VBE voltage for the current using a feedback loop9 For example, suppose IC is too low so not all the input current flows through the transistor. The excess current will raise the voltage on the op amp's negative input, causing it to reduce its output voltage and thus the transistor's emitter voltage. This raises VBE (since the base will now be higher compared to the emitter), causing more collector current to flow through the transistor. Similarly, if too much current is flowing through the transistor, the op amp's input will be pulled lower, reducing VBE. Thus, the feedback loop causes the op amp to find the exact VBE for the current input.10
The above circuit works reasonably well, but there's a complication: the transistors have a small emitter resistance R. The voltage drop across this resistance will increase VBE by ICR, disturbing the nice exponential behavior. This creates a nonlinearity that reduces the accuracy of the result. The datasheet says that "Raytheon has developed a unique and proprietary means of inherently compensating for this undesired term." They don't explain this further, but by studying the die I have figured out how it works.
In the compensation circuit, each of the four multiplier transistors is paired with an identical "mirror" transistor with the corresponding emitters and corresponding bases connected, as shown below. These connections give the paired transistors the same base and emitter voltages, so they have the same collector currents. In other words, they form a current mirror. The mirrored currents are fed into special correction resistors that match the undesired emitter resistance, 0.1 Ω according to the datasheet.11 The voltage across the correction resistors will be the same as the excess voltage that needs to be compensated (since the resistance and current are the same). The final step is the correction resistors are connected to the base of the multiplication transistors, replacing the connection to ground. This will shrink VBE by the amount it was erroneously increased, fixing the computation.
Why are there two correction resistors? Recall that the multiplier has two transistors adding and two transistors subtracting (i.e. VBE1+VBE2-VBE3-VBE4 = 0). To handle this, the correction circuit is split in two. The left half sums IC1 and IC2 and applies this current to a correction resistor on the Q3/Q4 side, while the right half sums IC3 and IC4 and applies this to a correction resistor on the Q1/Q2 side. The addition and subtraction work out to yield the desired net correction.
The schematic below shows the complete circuitry of the RC4200; I've highlighted the main functional blocks. (Inconveniently, I didn't find this schematic until after I'd traced out the circuitry from the die photo.) The multiplier core and the correction resistors were discussed above The op amps circuits are fairly similar to the 741 op amp, which I've written about. They lack the output stage of typical op amps; the output transistor (Q112/Q212/Q412) corresponds to the intermediate gain state in a typical op amp. The bias circuit (orange, lower right) provides a fixed bias voltage for the op amps.12
Before integrated circuits, analog multiplication was difficult to implement. However, integrated circuits made it easy to create matched transistors, leading to fast, inexpensive analog multiplication integrated circuits. Unfortunately, analog multiplier integrated circuits were introduced just as analog computers were dying out, killed by inexpensive digital microprocessors, so analog computing missed most of the benefit of these chips.
While most analog multipliers use a circuit called the Gilbert cell, the Raytheon RC4200 analog multiplier uses a different technique to multiply and divide values represented by currents. Although, it includes a special error compensation circuit to improve its accuracy, it is obsolete compared to accurate, laser-trimmed multipliers. Now, counterfeiters re-label RC4200 chips and sell them as the more-expensive AD633 multiplier.
I announce my latest blog posts on Twitter, so follow me at kenshirriff for updates. I also have an RSS feed. Thank you to John McMaster for the die photos used in this blog post; the photos are here.
One reason that the AD633 multiplier is so expensive is that the resistors on the die are laser-trimmed resistors for accuracy. To get an accurate result, an analog multiplier requires exactly-tuned resistances. The older RC4200 requires adjustable external resistors, which is much less convenient. ↩
I'm a bit puzzled by this counterfeit chip. Sometimes people will label a cheap op amp as an expensive op amp, as explained by Zeptobars. At first glance, that's what's going on here: a cheap multiplier repackaged as an expensive one. However, the two multipilers are so different that I can't imagine one working at all in place of the other. Specifically, the AD633 takes differential voltage inputs and outputs two currents (a differential current), and it computes A×B+C. The RC4200, on the other hand, takes current inputs and outputs a single current, and it computes A×B&divC. ↩
An example of a servo multiplier is the Solartron Servo Multiplier from the late 1950s. This 17-pound unit contained a potentiometer controlled by a servo motor, allowing it to multiply numbers represented by +/- 100 volts. It's surprisingly fast considering its mechanical operation, responding in under 30 milliseconds. Power consumption was high: 70 watts, cooled by a fan. (In comparison, the RC4200 chip uses 40 milliwatts of power.)
The 1969 analog computer I'm restoring uses a parabolic multiplier, a technique used for high-accuracy multiplication. The idea is that to compute A×B, you compute ((A+B)&Hat2 - (A-B)&Hat2)/4, which has the same value. That equation looks much more complex than the original product, but is easier to implement on an analog computer because op amps can perform the sums, subtraction, and division by four. Squaring is easier than multiplication because it is a function of a single variable, so it can be implemented by an "arbitrary function generator".
The photo above shows a function board from an analog computer that computes the square, i.e. a parabola. The board approximates the function by multiple piecewise-linear segments, each defined by resistors. (Note the extremely accurate 0.01% resistors on the left.) The metal block in the center holds diodes, temperature-balanced by the metal. Each diode is biased to turn on at a particular voltage; the diodes act as switches, selecting the appropriate resistors for each linear segment. Note the large amount of precision hardware required for multiplication; a single product requires two of these parabolic function boards as well as multiple op amps. ↩
To minimize the effect of temperature on the integrated circuit, the critical multiplier transistors are placed close together in the center of the chip. If there is a thermal gradient across the chip, this will minimize the temperature difference between the transistors. (Compared to putting the transistors in the corners, for instance.) To reduce temperature gradients even more, the datasheet specifies a "thermal symmetry line". Putting a temperature source on this line ensures that the hotter transistors will tend to cancel each other out.
Barrie Gilbert, inventor of the Gilbert cell, has a video explaining translinear circuit, circuits based on the exponential current-voltage relationship of a bipolar transistor. This video explains translinear analog multipliers in detail, discussing two approaches> The first approach, used by the RC4200, is the "log-antilog" approach, where op-amps force and sense the collector currents. The second, used in the AD633 and many other multipliers, is the "integrated" approach, built from voltage-to-current conversion, a differential current-mode core, and current-to-voltage conversion. ↩
I should mention that the chip uses a -15 V supply, so ground is the highest voltage and the other internal voltages are all negative. Just a warning since this makes things confusing and backward compared to circuits where ground is the low voltage. ↩
The relationship between the base voltage and the collector current is given by the Ebers-Moll model. This equation (below) is filled with interesting constants: α: a gain factor (almost 1), k: the Boltzmann constant, IS: the saturation current (extremely small, order of 10-15 A), T: the absolute temperature, q: the charge on the electron. (The temperature in the exponential term reflects the importance of temperature stability for the multiplier.)
Substituting the thermal voltage VT (about 26 mV) for kT/q, making some minor approximations, and taking the log yields:
Substituting that into the multiplier's VBE loop equation yields
Taking the exponential and assuming the transistors all have the same temperature and saturation current yields the desired equation relating the four currents:
In a sense, the op amps compute the inverse of the transistor's exponential function. The transistor takes VBE as an input and produces the exponential current as an output. However, we have the current as the input and want the logarithmic voltage as the output. By using the op amp with a function in its feedback loop, we can find the inverse of a function, in this case giving us the logarithm. That is, the op amp will converge on the output X where f(X) equals the input, i.e. X = f-1</sup(input). The same technique can be used to generate a square root from a multiplier chip: use the multiplier to square its input, and then use an op amp to compute the inverse function, i.e. the square root. ↩
You might wonder why the op amp finds the "correct" value and doesn't overshoot and oscillate. Handwaving away all the theory, the idea is that the capacitor on the op amp input stabilizes it and prevents oscillation. Even so, the datasheet warns that the circuits become unstable as the input currents approach 0. This corresponds to dividing by zero, so it's not surprising that the circuitry doesn't handle it well. Mathematically, the op amp is trying to find ln(0), which isn't going to work. If you want to multiply by zero or negative values, the datasheet describes how the inputs can be biased with resistors to keep the inputs positive but still get the correct answer. ↩
The two resistors below are used for the emitter correction; they have unusual construction and a very small resistance, 0.1 Ω. Each resistor consists of the two vertical stripes, connected together at the bottom; the vertical region in the center is connected to the ground pin, forming the other side of each resistor. These resistors improve the accuracy of the product by correcting for the emitter resistances. Based on their purple color, which doesn't appear elsewhere on the die, they appear to be specially doped. The metal contacts at the bottom cover part of the resistor; I believe that by adjusting the size of the metal contacts, the resistor values can be tuned. I believe that the thick and thin regions allow for coarse and fine tuning.
The bias voltage circuit generates a stable voltage of one diode drop (about 800 mV) from Q4's collector; this voltage biases the op amps. The tricky part is how to keep the power supply voltage from influencing this voltage or the Zener voltage.
The idea is that the Zener diode puts 5.5 volts on the base of Q13. The voltage across R3 will be two diode drops lower (2.8 V) due to Q13 and Q12. This yields a fixed current of 2.8 V / 1430 Ω = 2 mA through Q4, resulting in a stable voltage drop across Q12 and a stable output. But a Zener's voltage fluctuates a bit with current, so the clever part is how the Zener's current is kept stable. Transistors Q14, Q15, and Q16 form a current mirror, so the current through the Zener will match the current through the resistor, which is 2 mA. Thus, the Zener voltage keeps the resistor current and output voltage stable, while the resistor current keeps the Zener stable. The final piece of the puzzle is the FET Q17, which provides a tiny current through the Zener to start the feedback cycle. ↩
Near the end of 1972, Intel introduced their first 8-bit microprocessor, the 8008. Decades later, this processor still influences computing; you probably use an x86 processor that is a descendent of the 8008. One unusual feature of the 8008 processor is its use of a "bootstrap load" or "bootstrap capacitor", a special capacitor circuit to improve performance.1 Federico Faggin, who led the development of the 8008, is the main character in this story; he invented a new way to fabricate bootstrap capacitors for the Intel 4004 and 8008 processors and says it "proved essential to the microprocessor realization" and "without [the bootstrap load], there was no microprocessor."
My photo above shows the tiny silicon die inside the 8008 package. You can barely see the wires and transistors that make up the chip. There are 90 bootstrap capacitors, visible as small yellow rectangles, especially in the upper center. The squares around the outside are the 18 pads that are connected to the external pins by tiny bond wires. 18 pins is a very small number for a microprocessor, but Intel was bizarrely committed to small packages at the time.2 This required inconvenient tradeoffs; the lack of multiple power pins was one factor forcing the use of bootstrap loads.
The 8008 processor's history is more complex than you might expect. Its roots are the Datapoint 2200, a popular computer introduced in 1970 as a programmable terminal. Created before the microprocessor, the Datapoint 2200 contained a board-sized CPU build from individual TTL chips. Datapoint talked with both Intel and Texas Instruments about replacing the processor board with a single MOS chip. Texas Instruments created the TMX 1795 processor in March 1971, while Intel created the 8008 around the end of 1971 but Datapoint rejected both chips for a variety of reasons. Texas Instruments abandoned the TMX 1795 after their attempts to market it failed. Intel, on the other hand, marketed the 8008 as a general-purpose microprocessor, creating the microprocessor industry.
(You might wonder how the Intel 4004 fits into this story. The Intel 4004 is architecturally unrelated to the 8008 in almost every way; despite the similar names, the 8008 is not an 8-bit version of the 4-bit 4004. After the Intel 4004 was launched in 1971, much of the 4004 team (including Faggin, Hoff, Mazor, and Feeney) moved over to the 8008 project. Because the 4004 and 8008 processors were built by the same team with the same PMOS3 process, they have some layout and circuit-level similarities, in particular the bootstrap load circuit.)
The purpose of the bootstrap load is to get extra voltage out of a transistor when necessary. To explain this, I'll start by showing how an inverter works when implemented in a processor. The diagram below shows an inverter, built from a PMOS3 transistor and a load resistor (which is actually a transistor). If the input to the inverter is 0 (low), the lower transistor turns on, pulling the output high (1). But if the input is 1 (high), the output transistor turns off. In that case, the load resistor pulls the output low (0). Thus, the input signal is inverted.
The diagram below shows the physical implementation of an inverter in the 8008 processor. The first die photo shows the inverter as it appears in the chip. The horizontal metal wiring on top provides VDD and the input to the circuit. For the second photo, I dissolved the metal layer to reveal the two transistors that form the circuit. The schematic on the right matches the physical layout of the transistors on the die but otherwise corresponds to the schematic above. Because creating resistors in an integrated circuit is inconvenient, the load resistor is implemented by a transistor.
There's a complication from using a transistor as a load resistor: these MOS transistors have a property called the threshold voltage VT. The problem is that when you try to pull a signal low, the transistor can't pull it all the way low. Although you'd like the signal to get pulled down to VDD (-9 volts), the threshold voltage (say -5 volts)9 means that you can only get the signal down to -4 volts. (This is one of the reasons why the 8008 requires a much larger voltage (15 volts overall) than modern integrated circuits; if you tried to run it at 5 volts, the threshold voltage would consume the entire signal.)
The diagram below explains the threshold voltage in more detail. VD, VG, and VS are the voltages on the drain, gate, and source respectively. VGS is the voltage between the gate and the source. The transistor will turn on if VGS < VT, the threshold voltage. (Inconveniently, most of these voltages are negative in a PMOS transistor, which makes things confusing.) The problem is that with a gate voltage of -9 volts and a threshold voltage of -5 volts, the transistor will only be on if VS is higher than -4 volts. Thus, the transistor can't pull VS lower than -4 volts. The only way to get VS lower is if you had a more-negative gate voltage, at least -14 volts in this case. Some chips solve this by using an additional voltage supply to provide more voltage to the gate, such as the Intel 8080 or the HP Nanoprocessor.
The threshold voltage isn't much of a problem when you're dealing with inverters and other gates, because the voltage levels are restored by each gate. However, there are two places where the threshold voltage is a problem: superbuffers and pass transistor logic. In these circuits (described in the footnote4), the threshold voltage drop happens twice, yielding an output that is too weak. Since these circuits are common in processors, a solution was needed: the bootstrap load. It is a way of generating more voltage for the gate to overcome the threshold voltage so the transistor to pull its output all the way to VD.
The bootstrap load is essentially a charge pump circuit that uses a bootstrap capacitor to boost the gate voltage. The diagram below shows the basic idea of a charge pump. On the left, a capacitor is charged to -9 volts from a voltage source. If you disconnect the voltage source and then re-connect the negative side to the capacitor as shown on the right, the capacitor retains its charge of -9 volts. However, since the lower side of the capacitor is now at -9 volts, the upper side of the capacitor is now at -18 volts. The bootstrap load uses this -18 volts as the gate voltage, sufficient to overcome the threshold voltage.
The diagram below shows the bootstrap load circuit. The circuit is similar to the inverter described earlier, but with the addition of a capacitor and a transistor. In the first diagram, a 0 input turns on the lower transistor (Q1), yielding a 1 output (+5 volts). Meanwhile, Q3 acts as a load resistor, pulling the top of the capacitor to -4 volts (not -9 volts due to the threshold voltage.) This results in -9 volts stored across the capacitor.
The second and third diagrams show what happens with a 1 input. The lower transistor Q1 turns off, allowing Q2 to pull the output low. With a regular inverter, -4 volts is as low as the output can go (second diagram). However, as explained earlier, the capacitor still holds -9 volts, so the top of the capacitor must be -13 volts. With -13 volts on the gate of Q2, Q2 will continue to pull the output lower, until the circuit ends up as shown on the right, with the output pulled all the way down to -9 volts. Note that the source can't get pulled down any lower than the drain, regardless of the gate voltage. (In comparison, the simple inverter described earlier could only pull the output down to -5 volts.)5
The image below shows part of Intel's schematic for the 4004 processor, showing the circuit for a standard load and the circuit for the bootstrap load, indicated by a "B" next to the resistor.
So far, I've discussed the bootstrap load, which was extensively used with MOS circuitry, and was patented by North American Rockwell in 1966. The invention necessary for the 4004 and 8008 processors was the extension of the bootstrap load to silicon-gate integrated circuits.
One of the key inventions that made the 8008 practical was the self-aligning silicon gate transistor.6 The diagram below shows the structure of an MOS transistor. Early MOS integrated circuits used metal-gate 7 transistors, which used metal, typically aluminum, instead of polysilicon for the gate. But at Fairchild in 1968, Faggin and Klein invented a practical way to make transistors with silicon gates. This may seem like a trivial difference, but silicon-gate transistors were better than metal-gate transistors in three important ways. First, the electrical properties of silicon-gate transistors are much better than metal-gate transistors, running faster and at lower power. Second, polysilicon provided a second layer for routing signals, making integrated circuit layouts much more compact.
Finally, polysilicon permitted construction of self-aligned transistors, which play an important part in the bypass capacitor story. Integrated circuits are constructed through a sequence of processing steps, using optical masks and photo-sensitive resist to create patterns on the surface. An integrated circuit with metal-gate transistors is constructed from the bottom up. First, the source and drain regions are doped with impurities to form P-type silicon, as shown below. In a later step, the metal gate is created between the source and the drain, using a different mask. The tricky part is making sure the gate is lined up with the source and the drain; if there's a gap, the transistor won't work. Thus, a metal gate is made larger than necessary so it will still cover the gate channel, even if the alignment of the layers is slightly off. Unfortunately, this overlap creates capacitance and harms performance.
On the other hand, the self-aligned gate is created in the opposite order. The polysilicon gate is created first. In a later step, the source and drain regions are doped. However, a mask isn't used to separate the source and drain from the gate. Instead, the gate itself blocks doping of the region in between the source and drain. Thus, the source and drain are automatically "self-aligned" with the gate, eliminating the excess capacitance from a too-large gate. (Why couldn't metal gates be self-aligned? Because doping the silicon requires high temperatures that would melt the metal, but polysilicon can handle the heat.)
Although self-aligned silicon gates are a major improvement over metal gates, there was one drawback: capacitors. With metal-gate transistors, a capacitor could be easily constructed by using metal and doped silicon as the plates: a large metal layer on top, doped silicon underneath, and a thin insulating oxide layer in between. (In other words, a transistor with a large gate is used as a capacitor.) With self-aligned gates, the polysilicon gate could be used as a capacitor plate in place of the metal layer. However, in the self-aligned process, the polysilicon gate blocks doping of the silicon underneath, which is good for a transistor but bad for a capacitor, since you can't dope the silicon under the polysilicon plate. (You could use an extra manufacturing step to dope the capacitor plates before creating the polysilicon gate, but this extra step would increase the cost.)
Faggin invented a solution that made capacitors practical with self-aligned gates.8 He realized that if you bias the capacitor correctly, the charge on the upper plate will create a conductive region in the silicon underneath it, even without any doping. He tried this at Fairchild and discovered that it worked. This solved the problem of how to use a bootstrap load with self-aligned silicon-gate transistors.
This bootstrap load technique was extensively used in the 4004 and 8008 processors. The diagram below shows the bootstrap loads in the 8008 processor, indicated with a red box. The 8008 has 90 bootstrap loads, so it is a significant circuit. Many bootstrap loads are around the periphery of the chip to help drive the output pins. The instruction register (upper center) uses bootstrap loads to drive the relatively large instruction decoder (center). At the right, bootstrap loads drive the register storage (upper right) and stack storage (lower right). Other miscellaneous circuits throughout the processor also use bootstrap loads.
A final question is if the bootstrap load was a key invention that made the microprocessor possible (as embodied in the 4004 and 8008) or if the microprocessor was inevitable regardless of features such as the bootstrap load. One view is that "the buried contact and particularly the bootstrap load, were indispensable to obtain the required speed within the available power budget." Feeney said in an 8008 oral history "that being limited on pins, limited on power supplies, whatever, that the bootstrap load became very, very critical." On the other hand, the development microprocessor seemed an inevitable, incremental process to many. Fairchild engineer Lee Boysel said in 1970,10 "The computer-on-a-chip is no big deal. It's almost here now... I've no doubt the whole computer will be on one chip within five years." Hal Feeney of Intel said, "a the time in the early 1970s, late 1960s, the industry was ripe for the invention of the microprocessor."
In the narrow sense, the bootstrap load made the 4004 and 8008 possible with their given size, performance, and power consumption. The bootstrap load also illustrates how the microprocessor is not a single invention, but the aggregation of many smaller inventions that made it possible. However, looking at the broader picture, microprocessors would have been only slightly hampered if the bootstrap capacitor didn't exist. There were many alternatives such as four-phase logic, static logic, higher gate voltages, an additional power supply, or using an extra mask for the capacitors. The Texas Instruments TMX 1795 provides a direct comparison, since it was built at the same time as the 8008 with the same architecture, but using metal-gate transistors instead of silicon-gate. The diagram below shows that the TMX 1795 was considerably larger than the 8008, and it had somewhat worse performance, but the point is that microprocessors would have proceeded essentially the same without the bootstrap load. In any case, by 1974, the switch to NMOS transistors and improvements in threshold voltages made bootstrap loads unnecessary. My conclusion is that the bootstrap load was a helpful innovation, but microprocessors would have proceeded along a similar path even without this invention. Once technology permitted a few thousand transistors to be constructed on an integrated circuit, the single-chip CPU was inevitable.
If you're interested in the 8008, my previous article has a detailed discussion of the 8008's architecture and more die photos; I also explain the 8008's ALU. I announce my latest blog posts on Twitter, so follow me at kenshirriff. I also have an RSS feed.
In his oral history, Faggin describes Intel's fixation on 16-pin packages. When a memory chip required 18 pins instead of 16, it was "like the sky had dropped from heaven. I never seen so [many] long faces at Intel, over this issue, because it was a religion in Intel; everything had to be 16 pins, in those days. Everything had to be 16 pins... It was a completely silly requirements to have 16 pins." At the time, other manufacturers were using 40- and 48-pin packages, so there was no technical limitation, just a minor cost saving from the smaller package. ↩
The classic microprocessors such as the 8080, 6502, and Z-80 were built with NMOS transistors. The earlier 4004 and 8008 used PMOS transistors, which were easier to manufacture but had poorer performance. If you're familiar with NMOS logic, PMOS logic is a mirror world, where everything is backward. PMOS used negative voltages, which were also significantly higher than the 5 volts used by standard TTL. For compatibility with TTL levels, the 8008 ran with Vcc at +5V and Vdd at -9V, so it could produce TTL-compatible outputs of roughly 0 volts and 5 volts. (See the datasheet for more details.) The 4004 required -15 volts, typically Vdd = -10V and Vss = +5V. Confusingly, the 4004 defined logic "0" as the more positive voltage and logic "1" as the more negative voltage (datasheet). ↩↩
The "superbuffer" replaces the load resistor with an active transistor and is used when more current is required, for instance to drive an internal bus or an output pin. The upper transistor is driven by an inverter, so it is on when the lower transistor is off. Instead of the weak current from the load resistor/transistor, this transistor provides a high current. The problem is that the threshold voltage limits the voltage from the upper transistor. With a regular inverter, the inverter output loses VT, so it will provide -4 volts to the upper transistor's gate. Losing another VT there yields an insufficient output voltage of +1 volt instead of the desired -9 volts.
The second case where the threshold voltage drop is a problem is with a pass transistor, used for dynamic logic. The diagram below illustrates a simple pass transistor circuit. When the control signal is low, the transistor is active, passing the input signal through to the output. But when the control signal is high, the transistor stops passing the input. Instead, the previous value is held by the circuit's capacitance (shown in gray) so the output holds its previous value. Thus, pass transistors provide an efficient way of implementing temporary storage. The problem with pass transistors is the threshold voltage. If the control signal on the gate comes from a regular gate, the "on" voltage will be -4 volts due to the threshold voltage loss. The pass transistor causes a second threshold voltage loss, so the lowest it can pull its output is +1 volt, not enough for reliable operation.
The bootstrap load fixes these problems. By putting a bootstrap load on the inverter in the superbuffer or on the circuit controlling the pass transistor, the drive voltage will be close to -9 volts. Now there is only a single threshold voltage drop, leaving the output at -5 volts, sufficiently negative for reliable operation. ↩
This discussion of the bootstrap load is a simplified explanation. The real circuit is affected by stray capacitance, transistor leakage, and other factors, so the output wouldn't be all the way to VDD. One thing I'd like to point out, though, is that you might expect the capacitor's charge to leak out through Q3 as fast as it charged. Although Q3 is treated as a resistor, it also acts as a diode, blocking the capacitor from discharging. (With the capacitor more negative, the roles of Q3's source and drain are reversed and it no longer conducts.) ↩
The silicon-gate bootstrap capacitor exemplifies the paths of information between companies at the dawn of the microprocessor era. Practical silicon gate technology was created at Fairchild (with some earlier roots). When employees (including Faggin) left Fairchild for Intel, they took this knowledge with them. (And in some cases took "lots and lots of Fairchild internal confidential documents", see Shima oral history). From Intel, ideas spread to other companies, such as when Faggin leaving Intel to found Zilog, basing the Zilog Z80 on the Intel 8080. ↩
Interestingly, in 2007 Intel started using metal gates again in order to scale transistors further (details). In a way, semiconductor technology has gone full circle, back to metal gates, although now unusual metals such as hafnium are used. ↩ ↩
In the making of the first microprocessor, Federico Faggin says, "bootstrap load was a very popular circuit design trick used in just about all MOS dynamic circuits of that time. It made possible an output signal swing that was not only equal to the power supply voltage, but was also faster than possible with normal MOS loads for the same power dissipation." Faggin describes how he invented the bootstrap load in the 4004 oral history (p11) and the 8008 oral history (p8). Also see Faggin's The MOS silicon gate technology and the first microprocessors. He describes how the bootstrap load is needed for a two-phase design, and how silicon gate technology didn't support capacitors. Faggin's site describes the bootstrap load. Bootstrap load is also described at mosgate. ↩
The threshold voltage depends on various properties of the integrated circuit including the gate material and the oxide thickness. I couldn't find a specific value for the threshold voltage in the 8008 processor, but -5 volts seems like the right ballpark (and is a conveniently round number). The book MOSFET in Circuit Design discusses threshold voltages for P-channel devices. ↩
The bootstrap load illustrates the social process through which people are assigned credit for inventions and the construction of reputation. Although Faggin had a key role in the 4004 and 8008 processors, "when he left to found Zilog he got temporarily written outside of the Intel history." (See Intel disowns Faggin and Interview with San Mazor.) Faggin states, "They tried to erase my name from all of my contributions, including the silicon gate technology and the first microprocessor, and attribute them to others." After lobbying efforts by Faggin's wife and the pro-Faggin website intel4004.com, Intel reluctantly gave Faggin more credit. Faggin eventually received various awards including the National Medal of Technology and Innovation in 2010, so in the end he received his (deserved) recognition.
The point is that credit is not assigned objectively, but is a dynamic force depending on various corporate and personal forces and who tells the story. (Wikipedia is one modern arena for these conflicts.) One corrective is the book History of semiconductor engineering, which covers many of the key people in the history of integrated circuits, with little regard for the "generally accepted" history. I should make it clear that I am drawing most heavily on Faggin's writings for background on the bootstrap load, so this blog post should not be viewed as an "objective" view of who should get credit for it. It looks like the silicon-gate bootstrap load was invented simultaneously at National Semiconductor; patent 3912948 filed in 1971 by Dilip Bapat describes an identical silicon-gate bootstrap load circuit. ↩
In 1978, a memory chip stored just 16 kilobits of data. To make a 32-kilobit memory chip, Mostek came up with the idea of putting two 16K chips onto a carrier the size of a standard integrated circuit, creating the first memory module, the MK4332 "RAM-pak". This module allowed computer manufacturers to double the density of their memory systems and by 1982, Mostek had sold over 3 million modules. The Apple III is the best-known system that used these memory modules.
This module was built from two 16-kilobit memory chips, constructed from the standard MK4116 dynamic RAM (DRAM) chip packaged in a leadless ceramic chip carrier; these are the golden rectangles on top of the carrier.
You might wonder why customers didn't simply use these surface-mount packages directly, but at the time soldering surface-mount components was still a challenge for many customers. However, mounting two leadless chips on a dual inline-package (DIP) carrier allowed customers to double their memory density while still using their standard through-hole soldering techniques.
The purple carrier holding the chips was a ceramic substrate designed for thermal compatibility with the chips.1 There is no circuitry inside the ceramic carrier except wiring between the chips and the eighteen DIP pins. The two memory chips were wired in parallel except for their two select lines, which were kept separate. This allowed the desired memory chip to be selected. As a result, the MK4332 module has 18 pins, compared to 16 pins for the chips on top. Mostek used the same module design with the next generation of RAM chips, creating a 128-kilobit RAM module (MK4528) from two 64-kilobit RAM chips (MK4564).
Although you might expect a complex mounting technique, the two 4116 chips are simply soldered onto the substrate with standard reflow techniques. For the photo below, I removed the metal lid from the left chip with a chisel and unsoldered the right chip with a hot air gun. On the left, you can see the rectangular silicon die inside the leadless carrier package. On the right are the 16 solder pads on the ceramic substrate. The wiring between the solder pads and the DIP pins is inside the ceramic substrate.
I created the die photo below from multiple microscope images. The white lines are the metal wiring on top of the chip, while the silicon underneath appears dark red. The two large rectangular regions are the 16,384 memory cells, arranged as a 128×128 matrix, split in two. The circuitry in between these regions consists of 128 sense amplifiers to amplify the bits read from memory, and selection circuitry to select one bit out of the 128. (Externally, the chip is accessed as 16,384×1, outputting a single bit. Typically, eight of these chips were used to store bytes.) The control and interface circuitry is at the left and right, connected to the external pads via tiny bond wires.
In dynamic RAM, a bit is stored in a capacitor, with a transistor providing access to the capacitor. The value of the bit is represented by the presence or absence of charge on the capacitor. The advantage of dynamic RAM is that each memory cell is very small, constructed from just two components,2 allowing a high memory density. (In comparison, static RAM may require six transistors per cell.) The downside of dynamic RAM is that the charge on a capacitor leaks away after a few milliseconds. To avoid losing data, dynamic RAM must be constantly refreshed: bits are read from the capacitors, amplified, and then written back to the capacitors. For this particular chip, all the data must be refreshed every two milliseconds.
The diagram below illustrates the wiring of the memory cells, showing two of the 128 rows and columns. To read or write data, a row select line is energized. The transistors in that row turn on, connecting that row's capacitors to the data in/out lines. The data from that row is read out of the capacitors and amplified. At that point, the data can either be written back to refresh the row, or a new bit can be written. Note that although the chip accesses 128 bits in parallel internally, the chip provides access to one bit at a time externally, selecting one of the 128 bits to read or write.
The magnified photo below shows some of the storage cells, densely packed together. It's a bit hard to visualize what's going on because the chip is constructed from multiple layers. The bottom layer is the grayish silicon die. On top of the silicon are two layers of polysilicon. Above this is the metal wiring, which was removed for this photo. The photo shows three sense lines (data in/out) in the silicon, with bulb-shaped storage cells connected on either side. Vertical strips of polysilicon (poly 1) over the storage cells implement capacitors: the silicon forms the lower plate, while the polysilicon forms the upper plate. The second layer of polysilicon (poly 2) is arranged in diagonal regions to implement the selection transistors. Square notches in the poly 1 layer allow the poly 2 layer to approach the silicon to form transistors. Horizontal metal wiring (not visible) is connected to the poly 2 regions to select a row by driving the transistors. Note that the rows are staggered and interlocking due to the highly-optimized layout. At the time, fitting this much memory on a chip was a challenge that pushed the limits of integrated circuit technology.
Apple was a major customer of these memory modules, using them in the Apple III computer (1980). The Apple III was marketed as a business computer to follow the popular Apple II. Unfortunately, the Apple III was a business failure due to reliability issues and competition from the IBM PC introduced a year later.
As was usual for the time, the Apple III's memory board3 was stuffed with memory chips to achieve more capacity. An unusual part of the design is it used three rows of memory chips (instead of a power of two), mixing 16-kilobit and 32-kilobit memory chips to achieve 128 kilobytes of storage. (The Apple III's case was designed before the boards, so the boards had to be designed to fit the available space.) In the photo below, the top row holds MK4332 memory modules, while the bottom two rows hold 16-kilobit MK4116 chips.4
Memory is an under-appreciated part of computing. The CPU usually gets the attention, but memory was often the limiting factor. The problem with memory is that storing a single bit is easy, but most approaches are impractical when you try to scale up to thousands or millions of bits.
The early ENIAC computer (1946) used vacuum tubes for storage, but these were bulky and expensive, limiting ENIAC to just 20 words (of 10 digits) stored in its accumulators. Early computers such as EDSAC (1949) used mercury delay lines for memory, sending pulse trains of sound waves through tubes of mercury. Although EDSAC could store 512 words, you had to wait for bits to circulate serially through the mercury. An improvement was the random-access Williams tube which stored data as spots on a cathode-ray tube screen. Although they were temperamental, Williams tubes were used in the Manchester Mark 1 (1949) and the commercial IBM 701 (1952).
The introduction of core memory revolutionized computing, providing fast, cheap, and reliable storage, storing each bit in a tiny magnetized ferrite ring. Core memory was introduced in the Whirlwind computer (1953) and used in most computers of the late 1950s and 1960s. However, since each bit required a separate physical ferrite core, memory sizes were limited to a few megabytes for even the largest customers. For example, memory cabinets for the IBM System/360 (1969) held 256 kilobytes but weighed over a ton each (below).
Semiconductor memory led to another dramatic shift. At first, semiconductor memory was costly and had very small capacity; Intel's first product was a memory chip holding just 64 bits and costing $99.50. In 1968, Dennard at IBM invented cost-effective dynamic RAM and semiconductor DRAM technology advanced quickly at various companies. Intel introduced the first commercially available DRAM chip in 1970, the i1103 holding 1K bits. This chip was nicknamed the "core killer" because of its impact on the magnetic core memory industry.
Computer storage rapidly moved from core memory to DRAM as the capacity of DRAM increased and the price fell.5 Mostek introduced the 4-kilobit MK4096 chip in 1973, followed by the 16-kilobit MK4116 in 1976. In 1978, Fujitsu introduced the first commercial 64-kilobit DRAM chip and Japan took the lead in DRAM manufacturing.6 Intel left the DRAM industry in 1985 due to decreasing market share and profits, followed by the remaining US DRAM manufacturers.
Fifty years after the introduction of DRAM, it is still the dominant technology for main storage, a remarkably long lifetime. Compared to the 16-kilobit chip I described, Samsung's recent 16-gigabit DRAMs are a factor of a million larger, showing the incredible increase in density. It remains to be seen if anything will challenge the long storage leadership of DRAM.
For details on the construction of the memory modules, see Rectangular chip-carriers double memory-board density, Electronics, 1982. ↩
Early dynamic RAMs such as the Intel 1103 used three transistors per cell and used separate lines for reading and writing data. Improvements in memory technology shrunk the circuit to a single transistor and a single data line. ↩
The Apple III memory board pictured is the "12 volt memory board", given that name because the memory chips required 12 volts (as well as +5 and -5). It was upgraded by the "5 volt memory board", which used only a 5 volt supply. The 5 volt memory board used more modern 64-kilobit memory chips (4864) giving it a larger capacity of 128 or 256 kilobytes. Inconveniently, the power supply required a 12-volt load to operate, so the 5-volt memory board has a power resistor to draw 0.4 amps from the otherwise-unused 12-volt supply. Details are in the Apple III reference manual. ↩
The Apple III memory board was also available in a lower-cost 96-kilobyte module. In that configuration, the 4332 memory modules were replaced with the 16-kilobit (MK4116) chips used on the rest of the board. One clever feature of the 4332 module is the two "extra" select pins are on the end of the package. The result is that a memory board (such as the Apple III's) can be designed to accept either the 16-pin 16-kilobyte chips or the 18-pin 32-kilobyte modules, depending on how much memory is desired. With the smaller chips, the two extra pins are unused. It's strange, however, that the Apple III memory board only accepted the larger modules in one of the three rows of chips. ↩
The industry switch from magnetic core memory to semiconductor memory wasn't as straightforward as superior semiconductor memory overthrowing inferior core memory. Instead, there was a time period where they co-existed, due to tradeoffs. For instance, in 1972, a customer could select core memory, semiconductor memory, or a mixture for the D-112 minicomputer (a PDP-8 clone); semiconductor memory was 5 times faster, but core memory supplied four times the capacity per board. By 1973, industry publications were reporting that "Semiconductor memories are taking over data-storage applications". As late as 1980, core memory manufacturers were advertising the benefits of core memory, battling the "myths" that semiconductor was better.
Was the overthrow of magnetic core by semiconductor memory inevitable? My view is that "technological determinism" acts in some ways; the development of DRAM memory was almost unavoidable following the development of MOS transistors. However, "economic determinism" was more responsible for the success of semiconductor memory: if magnetic core had remained the lower-cost option, it probably would have remained dominant. As a counterexample, CCD (charge-coupled device) memory and bubble memory were hyped as storage technologies of the future, but couldn't achieve the price-performance to dislodge either semiconductor memory or hard disks. ↩
Note that the capacity of memory chips increased by a factor of 4 each generation (1-, 4-, 16-, 64-kilobit) rather than a factor of 2. The reason is that each address pin was multiplexed to provide two address bits, so each additional address pin resulted in a factor of four increase. By reusing each address pin for both a row address and a column address, the number of address pins was kept low so compact 16-pin packages could be used even as memory sizes expanded to 256-kilobit. Conveniently, as technology improved, memory chips required fewer voltages, freeing up pins formerly used for power. One consequence, though, was the ordering of address pins on the chip was essentially random as new address pins were assigned based on which pins were available, rather than sequentially. The multiplexed address system was introduced in the Mostek MK4096 chip and meant that the 256-kilobit 41256 chip used fewer pins than the original 1-kilobit Intel 1103 (16 pins vs 18). ↩
The 8008 was Intel's first 8-bit microprocessor, introduced in 1972. While primitive by today's standards, the 8008 is historically important because it essentially started the microprocessor revolution and is the ancestor of the modern x86 processor family. I've been studying the 8008's silicon die under the microscope and reverse-engineering its circuitry.
The die photo below shows the main functional blocks1 including the registers, instruction decoder, and on-chip stack storage. The 8-bit arithmetic logic unit (ALU) is on the left. Above the ALU is the carry-lookahead generator, which improves performance by computing the carries for addition, before the addition takes place. It's a bit surprising to see carry lookahead implemented in such an early microprocessor. This blog post explains how the carry circuit is implemented.
Most of what you see in the die photo is the greenish-white wiring of the metal layer on top. Underneath the metal is polysilicon wiring, providing more connections as well as implementing transistors. The chip contains about 3500 tiny transistors, which appear as brigher yellow. The underlying silicon substrate is mostly obscured; it is purplish-gray. Around the edges of the die are 18 rectangular pads; these are connected by tiny bond wires to the external pins of the integrated circuit package (below).
The 8008 was sold as a small 18-pin DIP (dual inline package) integrated circuit. 18 pins is an inconveniently small number of pins for a microprocessor, but Intel was committed to small packages at the time.2 In comparison, other early microprocessors typically used 40 pins, making it much easier to connect the data bus, address bus, control signals, and power to the processor.
The heart of a processor is the arithmetic-logic unit (ALU), the functional block that performs arithmetic operations (such as addition or subtraction) and logical operations (such as AND, OR, and XOR). Addition was the most challenging operation to implement efficiently because of the need for carries.3
Consider how you add two decimal numbers such as 8888 and 1114, with long addition. Starting at the right, you add each pair of digits (8 and 4), write down the sum (2), and pass any carry (1) along to the left. In the next column, you add the pair of digits (8 and 1) along with the carry (1), writing down the sum (0) and passing the carry (1) to the next column. You repeat the process right-to-left, ending up with the result 10002. Note that you have to add each position before you can compute the next position.
Binary numbers can be added in a similar way with a circuit called a ripple-carry adder that was used in many early microprocessors. Each bit is computed by a full adder, which takes two input bits and a carry and produces the sum bit and a carry-out. For instance, adding binary 1 + 1 with no carry-in yields 10, for a sum bit of 0 and a carry-out of 1. Each carry-out is added to the bit position to the left, just like decimal long addition.
The problem with ripple carry is if you add, say, 11111111 + 1, you need to wait as the carry "ripples" through the sum from right to left. This makes addition a slow serial operation instead of a parallel operation. Even though the 8008 only performs addition on 8 bit numbers, this delay would slow the processor too much. The solution was a carry lookahead circuit that rapidly computes the carries for all eight bit positions. Then the sum can be calculated in parallel without waiting for carries to ripple through the bits. According to 8008 designer Hal Feeney, "We built the carry look-ahead logic because we needed the speed as far as the processor is concerned. So carry look ahead seemed like something we could integrate and have fairly low real estate overhead and, as you see, the whole carry look ahead is just a very small portion of the chip."
The idea of carry lookahead is that if you can compute all the carry values in advance, then you can rapidly add all the bit positions in parallel. But how can you compute the carries without performing the addition? The solution in the 8008 was to build a separate circuit for each bit position to compute the carry based on the inputs.
The diagram below zooms in on the carry lookahead circuitry and the arithmetic-logic unit (ALU). The two 8-bit arguments and a carry-in arrive at the top. These values flow vertically through the carry lookahead circuit, generating carry values for each bit along the way. Each ALU block receives two input bits and a carry bit and produces one output bit. The carry lookahead has a triangular layout because successive carry bits require more circuitry, as will be explained. The 8-bit ALU has an unusual layout in order to make the most of the triangular space. Almost all microprocessors arrange the ALU in a rectangular block; an 8-bit ALU would have 8 similar slices. But in the 8008, the slices of the ALU are scattered irregularly; some slices are even rotated sideways. I've written about the 8008's ALU before if you want more details.
To understand how carry lookahead works, consider three addition cases. First, adding 0+0 cannot generate a carry, even if there is a carry in; the sum is 0 (if there is no carry in) or 1 (with carry in). The second case is 0+1 or 1+0. In this case, there will be a carry out only if there is a carry in. (With no carry-in the result is 1, while with carry-in the result is 10.) This is the "propagate" case, since the carry-in is propagated to carry-out. The final case is 1+1. In this case, there will be a carry-out, regardless of the carry-in. This is the "generate" case, since a new carry is generated.
The circuit below computes the carry-out when adding two bits (X and Y) along with a carry-in. This circuit is built from an OR gate on the left, two AND gates in the middle, and an OR gate on the right. (Although this circuit looks complex, it can be implemented efficiently in hardware.) To see how it operates, consider the three cases. If X and Y are both 0, the carry output will be 0. Otherwise, the first OR gate will output 1. If carry-in is 1, the upper AND gate will output 1 and the carry-out will be 1. (This is the propagate case.) Finally, if both X and Y are 1, the lower AND gate will output 1, and the carry-out will be 1. (This is the generate case.)
To compute the carry into a higher-order position, multiple instances of this circuit can be chained together. For instance, the circuit below computes the carry into bit position 2 (C2). The gate block on the left computes C1, the carry into bit position 1, from the carry-in (C0) and low-order bits X0 and Y0, as explained above. The gates on the right apply the same process to the next bits, generating the carry into position 2. For other bit positions, the same principle is used but with additional blocks of gates. For instance, the carry into position 7 is computed by a chain of seven blocks of gates. Since the circuit for each successive bit is one unit longer, the carry structure has the triangular structure seen on the die.
The diagram below shows how the carry circuit for bit 2 is implemented on the die; the circuit for other bits is similar, but with more repeated blocks. In the photograph, the metal wiring on top of the die is silverish. Underneath this, the polysilicon wiring is yellow. At the bottom, the silicon is grayish. The transistors are brighter yellow; several are indicated. The schematic underneath shows the wiring of the transistors; the layout of the schematic is close to the physical layout.
I'll give a brief outline of how the circuit works. The 8008 is implemented with a type of transistor called a PMOS transistor. You can think of a PMOS transistor as turning on if the input is 0, and off if the input is 1.4 Instead of standard logic gates, this circuit uses a technique called dynamic logic, which takes advantage of capacitance. In the first step, the precharge signal connects -9 volts to the circuitry, precharging it. In the second step, the input signals (top) are applied, turning on various transistors. If there is a path through the transistors from the +5 supply to the output, the output will be pulled high. Otherwise, the output remains at the precharge level; the capacitance of the wires holds the -9 volts. I won't trace out the entire circuit, but the upper X/Y transistor pairs implement an OR gate since if either one is on, the carry can get through. The lower X/Y transistors implement an AND gate; if both are on, the +5 signal will get through, generating a 1.
You might wonder why this carry lookahead circuit is any faster than a plain ripple-carry adder, since the carry signal has to go through up to seven large gates to generate the last carry bit. The trick is that the entire circuit is electrically a single large gate due to the dynamic design. All the transistors are activated in parallel, and then the 5-volt signal can pass through them all rapidly.5 Although there is still a delay as this signal travels through the circuit, the circuit is faster than the standard ripple carry adder which activates transistors in sequence.
The efficient handling of carries was an issue back to the earliest days of mechanical calculation. The mathematician Blaise Pascal created a mechanical calculator in 1645. This calculator used a mechanical ripple carry mechanism powered by gravity that rapidly propagated the carry from one digit to the next. Almost two centuries later, Charles Babbage designed the famous difference engine (1819-1842). It used a slow ripple carry; after the addition cycle, spiral levers on a rotating shaft activated each digit's carry in sequence. Babbage spent years designing a better carry mechanism for his ambitious Analytical Engine (1937), developing an "anticipating carriage" to perform all carries in parallel. With the anticipating carriage, each digit wheel had a sliding shaft that moved into position when a digit was 9. When a digit triggered a carry by moving from 9 to 0, it raised the stack of shafts, incrementing all the appropriate digits in parallel (see video).
The first digital computers used ripple carry. The designer of the Bell Labs relay computer (1939) states that "the carry circuit was complicated" due to the use of binary-coded decimal (BCD). The groundbreaking ENIAC (1946) used decimal counters with ripple carry. Early binary electronic computers such as EDSAC (1949) and SEAC (1950) were serial, operating on one bit at a time, so they computed carries one bit at a time too. Early computers with parallel addition such as the 1950 SWAC (the fastest computer at the time) and the commercial IBM 701 (1952) used ripple carry.
As computers became faster in the 1950s, ripple carry limited performance so alternatives were developed. In 1956, the National Bureau of Standards patented a 53-bit adder using vacuum tubes. This design introduced the important carry-lookahead concept, as well as the idea of using a hierarchy of carry lookahead (two levels in this case). The diagram below illustrates the complexity of this adder.
The development of supercomputers led to new carry techniques. The transistorized Atlas was built by the University of Manchester, Ferranti and Plessey in 1962. It used the influential Manchester carry chain technique, described in 1959. The Atlas vied with the IBM Stretch (1961) for the title of the world's fastest computer. The Stretch introduced high-speed techniques including the carry-select adder and the patented carry save adder for multiplication.
As with mainframes, microprocessors started with simple adders but required improved carry techniques as performance demands increased. Most early microprocessors used ripple carry, such as the 6502, Z-80, and ARM1. Carry-skip was often used for the program counter (as in the 6502 and Z-80); ripple carry was fast enough for 8-bit words but too slow for the 16-bit program counter. The ALU of the Intel 8086 (1978) used a Manchester carry chain as well as carry skip. The large transistor counts of VLSI chips permitted more complex adders, fed by research in parallel-prefix adders. The DEC Alpha 21064 (1992) combined multiple techniques: Manchester carry chain, carry lookahead, conditional sum, and carry select (details). The Hewlett-Packard PA_8000 (1995) contained over 20 adders for various purposes, including a Ling adder, a type developed at IBM in 1966 (details). The Pentium II (1997) used a 72-bit Kogge-Stone adder while the Pentium 4 (2000) used a Han-Carlson adder.6
This history shows that carry propagation was an important performance problem in the 1950s and remains an issue today with continuing research and improvements. Many different solutions have been developed, first in mainframes and later in microprocessors, growing more complex as technology advances. These approaches have tradeoffs of die area, cost, and speed, so different processors choose different implementations.
If you're interested in the 8008, I have other articles about it describing its architecture, its ALU, its on-chip stack, bootstrap loads, and its unusual history. I announce my latest blog posts on Twitter, so follow me at @kenshirriff. I also have an RSS feed.
The functional blocks of the 8008 processor are documented in the datasheet (below). The layout of this diagram closely matches the physical layout on the die. I've highlighted the carry lookahead block.
According to Federico Faggin's oral history, the 8008 team was lucky to be allowed to even use an 18-pin package for the 8008. "It was a religion in Intel" to use 16-pin packages, even though other manufacturers commonly used 40- or 48-pin packages. When Intel was forced to move to 18-pin packages for the 1103 RAM chip, it "was like the sky had dropped from heaven. I never seen so [many] long faces at Int/el". The move to 18 pins was beneficial for the 8008 team, which had been forced to use 16 pins for the earlier 4004. However, even 18 pins was impractically small considering the chip used 14-bit addresses. The result was address and data signals were multiplexed over 8 data pins. This both slowed the processor and made use of the chip more complicated. Intel soon gave up on small packages, using a standard 40-pin package for the 8080 processor in 1974. ↩
I'm ignoring subtraction in this discussion because it was implemented by addition, adding a two's complement value. Multiplication and division were not implemented by early microprocessors. Interestingly, even the earliest mainframe computers implemented multiplication and division in hardware. ↩
Most of the "classic" microprocessors were implemented with NMOS transistors. If you're familiar with NMOS gates, everything is backwards with PMOS. Although PMOS has worse performance than NMOS, it was easier to manufacture at first, so the Intel 4004 and 8008 used PMOS. PMOS required fairly large negative voltages, which is why the diagram shows -9 volts and +5 volts. ↩
I'm hand-waving over the timing of the carry lookahead circuit. An accurate analysis of the timing would require considering the capacitance of each stage, which might add an O(n2) term.
Also note that this carry lookahead circuit is a bit unusual. A typical carry lookahead circuit (as in the 74181 ALU chip) expands out the gates, yielding much larger but flatter circuits to minimize propagation delays. On the other hand, the 8008's circuit has a lot in common with a Manchester carry chain, which uses a similar technique of passing the incoming carry through a chain of pass transistors, or potentially generating a carry at each stage. A Manchester carry chain, however, uses a single N-stage chain rather than the 8008's triangle of separate chains for each bit. A Manchester carry chain can tap each bit's carry from each stage of the chain, so only one chain is required. The 8008's carry circuit, however, lacks the transistors that block a carry from propagating backwards, so its intermediate values may not be valid.
In any case, the 8008's carry lookahead circuit was sufficiently fast for Intel's needs. ↩
Back in the late 1970s, the most popular memory chip was Mostek's MK4116, holding a whopping (for the time) 16 kilobits. It provided storage for computers such as the Apple II, TRS-80, ZX Spectrum, Commodore PET, IBM PC, and Xerox Alto as well as video games such as Defender and Missile Command. To see how the chip is implemented I opened one up and reverse-engineered it. I expected the circuitry to be similar to other chips of the era, using standard NMOS gates, but it was much more complex than I expected, built from low-power dynamic logic. The MK4116 also used advanced manufacturing processes to fit 16,384 high-density memory cells on the chip.12
I created the die photo below from multiple microscope images. The white lines are the metal wiring on top of the chip, while the silicon underneath appears dark red. The two large rectangular regions are the 16,384 memory cells, arranged as a 128×128 matrix split in two. In between the two memory arrays are the amplifiers and selection circuits. The control and interface circuitry is at the left and right, connected to the external pins via tiny bond wires.
In dynamic RAM, each bit is stored in a capacitor with the bit's value, 0 or 1, represented by the voltage on the capacitor.3 The advantage of dynamic RAM is that each memory cell is very small, so a lot of data can be stored on one chip.4 The downside of dynamic RAM is that the charge on a capacitor leaks away after a few milliseconds. To avoid losing data, dynamic RAM must be constantly refreshed: bits are read from the capacitors, amplified, and then written back to the capacitors. For the MK4116, all the data must be refreshed every two milliseconds.
The diagram below illustrates four of the 16,384 memory cells. Each memory cell has a capacitor, along with a transistor that connects the capacitor to the associated bit line. To read or write data, a row select line is energized, turning on the transistors in that row. The row's capacitors are connected to the bit lines, allowing the bits in that row to be accessed.
One of Mostek's key innovations was to multiplex the address pins.6 Earlier memory chips used a separate pin for each address bit; as memory sizes increased, so did the number of address pins. This forced Intel's 4096-bit memory chip, for instance, to use a large, more costly 22-pin package.5 Mostek cut the number of address pins in half by using each address pin twice, first for a "row" address, and then a "column" address. This approach became the industry standard, allowing memory chips to fit into inexpensive 16-pin packages.
Externally, the chip stores a single bit for 16,384 different addresses. (Typically, eight of these chips were used in parallel to store bytes.) Internally, however, the chip is implemented as a 128×128 matrix of storage cells. The row address selects a row of 128 cells7 and then the column address selects one of these 128 cells to read or write.8 Meanwhile, the entire row of 128 cells is refreshed by amplifying the signals and storing them back in the capacitors.
The die image above is labeled with the main functional blocks.9 The chip's 16 pins are labeled around the perimeter,10 including the seven address pins (A0-A6). The Row Address Strobe pin (RAS) is used to indicate the row address is ready, while the Column Address Strobe pin (CAS) indicates that the column address is ready. The two memory arrays are in the center; I've cut out most of the cells to keep the diagram compact. The column select circuitry and sense amplifiers are between the two memory arrays. At the right, the row decode circuitry selects a row based on the address pins, while the column address circuitry buffers the address for the column select circuitry. At the left, the clock circuits generate the chip's timing pulses, triggered by the RAS, CAS, and WRITE pins. Finally, the Data Out and Data In pins provide access to the selected data bit.
The key to the DRAM chip is the memory storage cell, designed to be as compact as possible. The highly magnified photo below shows some of the storage cells, densely packed together. It's a bit hard to visualize what's going on because the chip is constructed from multiple layers. The bottom layer is the grayish silicon die. On top of the silicon are two layers of polysilicon, a special type of deposited silicon used for transistor gates, capacitors, and wiring. The top layer of the chip is the metal wiring, which was removed for this photo. The photo shows three bit lines in the silicon, with bulb-shaped storage cells connected on either side. Vertical strips of polysilicon (poly 1) over the storage cells implement capacitors: the silicon forms the lower plate, while the polysilicon forms the upper plate. The second layer of polysilicon (poly 2) is arranged in diagonal regions to implement the selection transistors, where square notches in the poly 1 layer allow the poly 2 layer to approach the silicon.
The cross-section diagram below shows the three-dimensional, layered structure of a memory cell. At the bottom is the silicon (brown); the bit line (dark brown) is made from doped silicon. Above the silicon are the two polysilicon layers (red) and the metal layer (purple), separated by insulating silicon dioxide (gray). At the far left, the poly 1 layer and underlying silicon form a capacitor. In between the capacitor and the bit line, the poly 2 layer forms the gate of the transistor. At the left, the poly 2 layer is connected to the metal of the word line, which turns the transistor on, connecting the capacitor to the bit line.
The diagram below illustrates how bits are addressed in the storage matrix. The arrangement is someone confusing because columns of cells are offset and interlocked like zippers. A row select line is connected to the centers of diagonal poly 2 regions, so each region controls two transistors on neighboring bit lines. (For instance, in the upper left, the poly region connected to row select 0 forms transistors 0A and 0B.) The result is that each row select line activates 128 cells, one for each bit line in a staggered arrangement.
A key feature of the MK4116 memory chip is that it uses almost no power when it is sitting idle. Although it consumes 462 milliwatts when active, it uses just 20 milliwatts in standby mode. Although low-power circuitry is straightforward to build with modern CMOS technology, the 4116 used earlier NMOS transistors. Most NMOS integrated circuits constructed logic gates with load transistors, a simple technique with the disadvantage of wasting power. Instead, the MK4116 memory chip uses dynamic logic, which is considerably more complex but saves power while idle.
A typical dynamic logic gate (below) operates in two phases. In the first phase, a clock signal turns on the upper transistor, precharging the output to +12 volts, the "1" state. The upper transistor then turns off, but the output remains high due to the capacitance of the wire. In the second phase, the lower transistors can pull the output low. In particular, if either input is 1, the corresponding transistor turns on and pulls the output low, so the circuit implements a NOR gate. This circuit doesn't consume any static power, just a small current to charge the wire capacitance when switching. (The inputs must be carefully timed so they don't overlap with the precharge clock.) The use of dynamic circuitry makes the 4116 much more complex than it would be otherwise since the gates are controlled by clock signals, which need to be generated.
The purpose of the row-select circuitry is to decode the 7 address bits and energize the corresponding row select line (out of 128) to read one row of memory. In the first step, 32 5-input NOR gates decode address bits A0 through A4. These NOR gates are implemented in the compact circuit shown below. Each NOR gate takes a different combination of non-inverted and inverted address bits and matches a particular 5-bit address. These NOR gates use dynamic logic, first pulled high and then discharged to ground, except for the selected address which remains high. Next, each NOR output is split into four, based on A5 and A6. The result is that one of 128 row select lines is activated, turning on the transistors for that row in the matrix.
The NOR gates are implemented in several compact blocks; one block of three NOR gates is shown below. Each NOR gate is a horizontal stripe of doped silicon, with ground above and below it. Each NOR gate has transistors (pink stripes) connected to ground alternating above and below it. A transistor will pull the NOR gate low if the connected address line is high. The precharge transistors at the left pull the NOR gates to +12 volts, while the output control transistors control the flow of the decoded outputs to the rest of the circuitry.
The small greenish blobs at the end of a transistor gate (pink stripe) are connections (vias) between a transistor gate and an address line. The address lines are represented as vertical yellow stripes (since the metal layer was removed). Note that each transistor gate has an address line at the right and the inverted address line at the left; thus, the NOR gates all have the same basic layout, but with the contacts changed to match a particular address. For instance, the upper NOR gate has transistors connected to A0, A2, A1, A3, and A4, so it will be active for address 00000; any other address will pull it low.
The sense amplifiers are one of the most challenging parts of designing a memory chip. The job of the sense amplifier is to take the tiny voltage from a capacitor and amplify it into a binary 0 or 1.11 The challenge is that even though 12 volts is stored in a capacitor, the signal from the capacitor is very small, is only 100 millivolts or so. (Because the bit line is much larger than the tiny memory cell capacitor, the capacitor causes a very small voltage swing.)12 It is critically important for the sense amplifier to operate accurately, even in the presence of noise or voltage fluctuations, because any error will corrupt the data. The sense amplifier circuit must also be compact and low power since there are 128 sense amplifiers.
The chip's 128 sense amplifiers, one for each column, are located between the two memory arrays as shown above. During a read, 128 values in a row are accessed in parallel and amplified by the sense amplifiers. These 128 values are then written back to refresh the values in the capacitor. For a write operation, one of the bits is updated with the new value before they are written back.
Each sense amplifier (above) is a very simple circuit. It takes two inputs and compares them, pulling the lower one to 0.13 It is built from two cross-coupled transistors, each trying to pull the other one low. Whichever transistor has the higher voltage to start with will "win", forcing the other side low.14 The sense amplifier is sensitive to very small voltage differentials, allowing it to distinguish the small signals from a storage cell.
Locating the sense amplifiers between the two memory arrays isn't arbitrary, but the key to their operation: this is the "divided bit line" architecture introduced in 1972. The idea is that one input to the sense amp is the voltage from the desired memory cell, while the other input is a threshold voltage from a "dummy cell" in the opposite memory array. Dummy cells are constructed and precharged like real memory cells except the capacitor is half-sized, so they provide a voltage midway between a 0 bit and a 1 bit.3 If the voltage from the real memory cell is lower, the sense amp outputs a 0, and if higher, it outputs a 1.
The dummy cells are located on the edges of the memory arrays, as shown above. They consist of capacitors and transistors (similar to real memory cells), but with a separate line to charge them. The advantage of the dummy cell approach is that manufacturing differences or fluctuations during operation will (hopefully) affect the real cells and dummy cells equally, so the voltage from the dummy cell will remain at the correct level to distinguish beween a 0 and a 1. Address bit A0 controls which half of the array provides real data to the bit lines and which half connects dummy cells to the bit lines.
The purpose of the column select circuitry is to select one column out of the 128-bit row; this is the bit that is read or written. Each column select circuit is twice as wide as a memory cell, so they only decode one of 64 columns. The result is that two bits are selected at a time, and circuitry elsewhere selects one of the two bits. Like the row select circuitry, the column select circuitry is implemented by numerous NOR gates, each matching one address. For column select address bits A0 through A5 select one of 64 lines, selecting two columns at a time. These two bit lines are connected to data lines transmitting the signals to the I/O circuitry. (Since the bit lines for the upper and lower halves of the matrix are separate, there are actually four bit lines selected by the column select circuit.) As with the row select circuitry, dynamic logic is used, controlled by various timing signals. Note that each NOR gate is physically split into two parts with the sense amp in the middle.
The schematic below shows how the column decoder works with the sense amplifier. The diagram shows two bit lines and the top half of the column decoder and sense circuitry; it is mirrored for the lower array. At the top, the sense precharge circuit pulls all the bit lines high. At the bottom, the sense amplifiers amplify and refresh the signals as explained above. The column decoder matches a particular 6-bit address, so one of the 64 decoders will activate the associated sense select circuit, connecting the chip's I/O circuitry to four bit lines (two from the upper memory array as shown here and two from the lower memory array).
At this point, four bit lines have been selected for use and their signals are passed to the input/output circuitry; the column select circuitry only decoded 1-of-64, while there are 128 columns, and each half of the array has separate bit lines. Column address bit A6 provides the final selection between the two columns. The selected bit is sent to the data-out pin for a read. For a write, the value on the data-in pin is sent back through the appropriate wire to overwrite the value in the sense amplifier. This circuitry is implemented using dynamic logic and latches, controlled by various timing signals. Much of the circuitry is duplicated, with one copy for the upper half of the memory array and one copy for the lower half. Row address bit A0 distinguishes which half of the matrix is active and which half is providing dummy data). (Note that row address bit A0 was already used to select a particular row, but the circuitry has "lost track" of which was the real row and which was the dummy row, so it must make the selection again.)
The chip requires many timing signals for the various steps in a memory operations. The memory chip doesn't use an external clock, unlike a CPU, but generates its own timing signals internally. The diagram below illustrates the clock generators, using buffers to create a delay between each successive clock output. The first set of timing signals is triggered by the row-access strobe (RAS), indicating that the computer has put the row address on the address pins. The next set of timing signals is triggered by the column-access strobe (CAS), indicating the column address is on the address pins. Other timing signals are triggered by the WRITE pin.
The real clock circuitry is much more complex than the diagram indicates, consisting of dozens of transistors in multiple chains, feeding back in complex ways to shape the pulses. (Among other things, using dynamic logic requires each buffer to have both an input that pulls it high and an input that pulls it low, forming almost a circular problem.) These gates are mostly built from large transistors, as shown below, to provide enough current to drive the circuitry, and to increase the gate delay sufficiently. The clock circuitry also uses many capacitors, probably bootstrap loads to pull signals up sharper. I'm not going to describe the clocks in detail since it's a complicated mess.
The chip uses surprisingly complex circuits for the address pins and the data input pin. Mostek's earlier memory chip had problems due to noise margins on the inputs, so the MK4116 uses a complex circuit an analog threshold, capacitor drive, and multiple controls and latches.
The diagram below shows the threshold generation circuit, which generates a 1.5-volt reference. It uses many tiny transistors in series to generate the voltage level. Conceptually, it is similar to a resistor divider between power and ground to produce an output voltage. However, resistors are both power-hungry and difficult to build in integrated circuits, so transistors are used instead. Since this circuit is always active, the designers needed to minimize its current; this was achieved by using many transistors in series.
The voltage on the input pin and the threshold voltage are fed into a differential amplifier/comparator, conceptually similar to the sense amplifiers. Each side tries to pull the other side low, ending up with a 1 for the "winning" side and 0 for the "losing" side. Thus, the input is converted into a binary value. The result from the comparator is stored in a latch. Multiple timing signals gate the input signal, precharge the circuitry, and control the latch.
The photo above shows the input circuit for the data-in pin. Next to the pin's bond wire is the threshold circuit and latch; the two capacitors are the large rectangles of metal. The voltage reference circuit is next; the data-in voltage reference is similar to the address voltage reference described above. (I left the metal layer on for this photo; the polysilicon and silicon underneath is obscured by the oxide layer.)
This memory chip was much more complex than I expected. I studied a simple Intel memory chip earlier so I assumed this DRAM would be larger but not much more complicated. Instead, the MK4116 has complex circuitry with over 1000 transistors controlling it, in addition to the 16,384 transistors for the memory cells and about 1500 transistors for the column selects and sense amps. A cause of the complexity is that the design needed to optimize multiple axes: density, speed, and power efficiency.16
The table below shows that each generation of DRAM chips required substantial technological changes and new developments. Memory designers don't just sit around waiting for Moore's Law to increase the memory capacity; they have to constantly develop new techniques because DRAM storage cells are fundamentally analog. Fortunately, DRAM designers have continued to solve memory scaling problems; 16-gigabyte DRAMs recently went into production, an amazing factor of a million larger than the 16-kilobyte MK4116 DRAM chip of 1976.
A brief history of memory innovations is here. For detailed information on DRAM circuits, see this 1990 thesis on sense amplifier design. For history, Storage array and sense/refresh circuit for single-transistor memory cells (1972) introduced the concepts of dummy cells and cross-coupled sense amplifiers. Intel's chip is discussed in A 16 384-Bit Dynamic RAM (1976) while Mostek's chip is discussed in A 16K × 1 bit dynamic RAM (1977) and 16K - The new generation dynamic RAM (1977). Inconveniently, I found most of these references after I had this blog post nearly completed. ↩
An unusual characteristic of the chip is that it doesn't use "buried contacts". The issue is how to connect a polysilicon wire to a silicon circuit. In integrated circuits of the 1960s, polysilicon couldn't be connected to silicon directly, so a via connected the polysilicon wire to the metal layer, which had a short connection to a second via that connected down to the silicon. In 1968 at Fairchild, Federico Faggin invented the buried contact, a way to connect the polysilicon and silicon directly. This was much more convenient, so all the NMOS chips that I have examined use buried contacts.
However, the 4116 doesn't use buried contacts. Instead, it uses the obsolete connections through the metal layer. It's a mystery why they did this. Perhaps the metal wiring density was low enough that the additional segments weren't a problem and they could eliminate one masking and processing step. (Another theory is maybe there were patent issues, but I'm not aware of any patent on the buried contact.) But this illustrates that technological progress isn't consistently linear. Even an advanced chip like the 4116 can use obsolete techniques in some areas. ↩
In the MK4116, a 0 bit is represented by storing 12 volts on the capacitor, while a 1 bit is represented by 0 volts on the capacitor. This is backward from what you might expect, but probably saved an inverter somewhere in the circuitry. To avoid confusion, I ignore this in the text. ↩↩
Early dynamic RAMs such as the Intel 1103 used three transistors per cell and used separate lines for reading and writing data. Improvements in memory technology shrunk the circuit to a single transistor and a single data line. Static RAM, in comparison, often requires 6 transistors per bit, but has the advantage of not needing to be refreshed. ↩
For example, Intel's 2107 4096-bit DRAM required 22 pins, as did the 2101 256×4 static RAM chip. It's ironic that Intel used larger packages for these memory chips because a few years earlier, Intel had steadfastly refused to go beyond 16 pins, forcing the Intel 4004 microprocessor to use a 16-pin package. The 8008 microprocessor was barely allowed 18 pins, when 24 pins would have been more convenient. This made the 8008 slower and harder to use. ↩
Although multiplexing the address pins might seem trivial, Mostek claims that they bet the company on this idea. The problem is how to implement multiplexing without making memory accesses wait while both parts of the address are loaded. (The time to read memory was a key factor in computer design, so every nanosecond counted.) In Mostek's solution, first the row address is put on the address pins, and the row-access strobe (RAS) is activated. While the chip is reading that row from memory, the computer puts the column address on the address pins and activates the column-access strobe (CAS). By the time the 128 bits of the storage row have been read, the column address is available and the desired bit is selected from the row of 128 bits. In other words, reading of the row is overlapped with loading of the column address, so multiplexing doesn't slow the system. However, careful timing is required to make this multiplexing work; much of the chip is devoted to clock circuitry to generate the necessary timing pulses. ↩
The RAM chip operates on memory a row at a time, and then selects one entry from the row. This isn't the obvious way to access memory. In comparison, magnetic core memory also holds memory cells (cores) in a matrix, but accesses a single cell using X and Y select lines. A key reason for a DRAM to operate a row at a time is so the entire row can be refreshed at once, dramatically reducing the performance overhead from refresh operations. ↩
You might wonder if it's possible to read multiple bits from a row without repeating the entire row-read operation. The chip designers thought of that and provided several techniques to boost efficiency. The page-read and page-write functions let you rapidly access multiple bits in a 128-bit page (i.e. row). A read-modify-write sequence lets you read a row, modify bits in it, and write it back without repeating the row-read. A RAS-only refresh operation lets you read and refresh a row without providing a column address. The point of this is that the chip designers implemented clever features so customers could squeeze as much performance out of their memory system as possible. See the datasheet for details. ↩
The block diagram below shows the main functional blocks of the 4116. Many parts of this block diagram didn't make sense to me until after I had reverse-engineered the chip, such as the clock generator, dummy cells, and "1 of 2 data bus select". Many datasheets present a somewhat abstracted view of how the chip operates, but the 4116 datasheet accurately matches the implementation.
One inconvenient feature of the memory chip is it requires three different voltages: +12 volts, +5 volts, and -5 volts. Almost all the circuitry runs on 12 volts. The 5-volt supply is used only to provide a standard TTL voltage level for the data out pin. The -5 volts is a substrate bias, connected to the underlying silicon die to improve the characteristics of the transistors. Later chips implemented a charge pump circuit to generate the bias voltage, eliminating the need for an external bias voltage. Later memory chips also eliminated the need for +12 volts. This simplified use of the chips, since only a single-voltage power supply was required. A less-obvious benefit is that this made two of the chip's 16 pins available for other uses. Specifically, these pins were used as additional address bits in the next two generations of memory chips, the 64-kilobit and 256-kilobit chips. As a side effect, the address pins are in a somewhat scrambled order, due to the location of the available pins. ↩
It's not a coincidence that the input to the sense amp is very small, just enough to be reliably amplified. This is a consequence of economics: if the DRAM produced a large voltage difference, the designers would shrink the cells to save money. But if the voltage difference was too small for reliability, the designers would need to increase the cells. The result is a design where the voltage difference is just barely large enough to be reliably amplified by advanced circuitry. (We noticed the same thing when using a vintage 1960s IBM core memory (video); we were just barely able to read the core values. The cause is the same: if the cores had produced nice clean pulses, they were larger than they needed to be.) ↩
When the capacitor is connected to the bit line, the resulting voltage will depend on the relative capacitances of the capacitor and the bit line. The bit line capacitance is said to be 800 fF, while the storage cell has 40 fF capacitance, for a 20:1 ratio. Thus, the resulting voltage will be very close to the +12V precharge voltage on the bit line, but perturbed a few hundred ↩
The sense amplifier can only pull a signal low, not raise it, so you might wonder where the amplification happens. Both sides are precharged to +12 volts and the memory cell capacitance only pulls the sides down by 100 millivolts or so. The "winning" side will remain very close to 12 volts, while the other side is pulled to 0 by the sense amp. Thus a 1 bit is pulled higher by the precharge, while a 0 bit is pulled lower by the sense amp. ↩
The diagram below shows the sense amplifier voltages during operation of a prototype DRAM sense amp. First, the two sides of the sense amp are precharged to the same voltage. Next, a DRAM storage node is selected on one side and a dummy node on the other. Note that the voltage difference between the two sides is very small, maybe 200 millivolts. Finally, the difference is amplified, forcing the higher side up and the lower side down. In this case, the storage node held a 1 so it started slightly higher. If it held a 0, it would start slightly lower and the two lines would diverge in opposite directions. The point is that the sense amp takes a very small voltage differential and amplifies it into a large binary signal.
One difference between this sense amp and the MK4116 is that this circuit is precharged to a midpoint voltage, while the MK4116's is precharged to +12 volts. In this sense amp, one signal must be pulled high, while in the MK4116 both signals start near +12V and one is forced low. ↩
Robert Proebsting, co-founder of Mostek and developer of address multiplexing, has an oral history that provide some information on the 4116. He discusses why the column decoder selects one of 64 columns and the selection between the pair happens earlier. The reason is they wanted the noise from the address lines to be equal on both sides of the sense amp, so they have three address line pairs on each side. ↩
Intel produced 16,384-bit DRAM chips before Mostek, the 2116 and others, but Mostek's chips beat Intel in the marketplace. Interestingly, the internal structure was completely different from the MK4116. The 2116 contained four memory arrays internally and was structured as two independent 8-kilobit memories. This saved on power since the unused half could be left unpowered during a memory access. Moreover, if a 2116 chip had a manufacturing flaw in one half, Intel repackaged it as an 8-kilobit 2108 chip with either the upper or lower half operational. The user had to set address bit A6 appropriately to get the working half. ↩
Texas Instruments introduced the first commercial single-chip computer in 1974, combining the CPU, RAM, ROM, and I/O into one chip. This family of 4-bit processors was called the TMS1000.1 A 4-bit processor now seems very limited, but it was a good match for calculators, where each decimal digit fit into four bits. This microcontroller was also used in hand-held games2 and simple control applications such as microwave ovens.3 Since its software was in ROM, the TMS1000 needed to be custom-manufactured for each application, but it was inexpensive and sold for $2-$4 in quantity. It became very popular and was said to be the best-selling "computer on a chip".
The die photo above shows the main functional blocks of the TMS1000. One thing that distinguishes the TMS1000 (and most microcontrollers) from regular processors is the "Harvard architecture", where code and data are stored and accessed separately. In the TMS1000, code and data even have different sizes: instructions were 8 bits and stored in a 1-kilobyte ROM, while data was 4 bits and stored in a 64×4 (256-bit) RAM.4 Since the space for RAM was limited, Texas Instruments developed new circuits for RAM. In this blog post, I look at how the TMS1000 and later TI chips implemented their on-chip RAM.
Dynamic RAM revolutionized memory storage in the early 1970s; its low cost and high density rapidly made magnetic core memory obsolete. Dynamic RAM uses a tiny capacitor to store each bit, with a 0 or 1 represented by a low or high voltage stored in the capacitor. The problem with dynamic RAM is that the charge leaks away after a few milliseconds, so the values need to be constantly refreshed by reading the data, amplifying the voltages, and storing the values back in the capacitors.5 Texas Instruments developed a new dynamic RAM circuit for the TMS1000 to avoid the complexity of an external refresh circuit. Instead, each memory cell uses a clock signal to refresh itself internally.
The diagram below zooms in on the TMS1000 die photo, showing the 16×16 grid of RAM storage cells. The inset at the right shows a single storage cell. This photo shows the chip's metal layer; the transistors are underneath.
The TMS1000 is constructed from a type of transistor called PMOS, shown below. At the bottom, two regions of silicon (red) are doped to make them conductive, forming the source and drain of the transistor. A metal strip in between forms the gate, separated from the silicon by a thin layer of insulating oxide. (These layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.) The transistor can be considered a switch between the source and drain, controlled by the gate. The metal layer also provides the main wiring of the integrated circuit, although the silicon layer is also used for some wiring.
The diagram below shows a closeup of one bit of storage in the TMS1000. The first die photo shows the yellowish metal layer. The metal layer both connects the circuitry and forms the gates of the transistors. The second photo shows the die after the metal has been dissolved with acid to reveal the silicon underneath. The conductive doped silicon appears pinkish, while the transistors are yellow squares. The black spot in the lower left is a via connecting the silicon to the metal above. Since the photo is hard to interpret, I created the diagram at the right, clarifying the components. The five white squares are the transistors, between pink silicon regions. There are also two capacitors (labeled) created by overlapping the metal and silicon.
The schematic below corresponds to the above circuit, with the transistors in their approximate physical locations. To write a bit, the bit is placed on the data I/O line and the address line is activated.8 This turns on transistor Q4 and allows the bit to flow to point A, where it is maintained (temporarily) by the capacitor there. The bit can be read out the same way, by activating the address line. In a typical dynamic RAM chip, each cell consists of just this transistor and capacitor, but the TMS1000 uses the additional transistors to refresh the voltage on the capacitor.
The TMS1000 refresh circuit is driven by two clock signals, clock phase 1 (Φ1) and clock phase 5 (Φ5).7 Activating clock phase 5 turns on Q3 and allows the bit to flow to point C, the gate of transistor Q1. Large transistor Q1 is the key component of the refresh circuit, as it amplifies the signal C. Next, during clock phase 1, the amplified signal at B flows through Q2, restoring the original bit stored at A. This circuit is repeated 256 times for the 256 bits of RAM storage in the chip. These clock signals are activated at about 80 kilohertz, ensuring the bit is refreshed before it can drain away.
CMOS (Complementary MOS) is a type of circuitry that combines NMOS and PMOS transistors to reduce power consumption. In 1978, TI began building CMOS calculator chips, starting with the TP0310 and TP0320 chips.6 These chips were used in calculators such as the TI-30-II (below), TI-35, and TI-50. The switch to CMOS coincided with TI's switch from power-hungry LED or vacuum fluorescent displays (VFD) to low-power LCD (details). These improvements led to better battery life. TI also used CMOS to implement "Constant Memory™", preserving calculator data even when the calculator was off; CMOS's low power consumption meant that the memory could be continuously powered without draining the battery.
CMOS has a long history, starting with its invention in 1963. RCA did a lot of early development of CMOS, introducing the 4000-series of integrated circuits in 1968 and the first CMOS processor, the RCA 1802, in 1974. RCA was unfortunately a decade too early for market success with CMOS; although CMOS's lower power consumption made it useful for niche aerospace markets, NMOS processors dominated the microprocessor industry. Eventually, however, mainstream microprocessors switched to CMOS with the Intel 80386 in 1985 and Motorola's 68030 in 1987, and CMOS is the dominant technology today.
TI's move from metal-gate PMOS to CMOS in 1978 is unusual. Other manufacturers (such as Intel) switched from metal-gate transistors to the much superior silicon-gate transistors around 1971, and then moved from PMOS to NMOS around 1974. It's unclear why Texas Instruments continued using inferior metal-gate PMOS circuitry for several years; perhaps calculators didn't need the improved performance so it wasn't cost-effective to switch. But then Texas Instruments skipped over the NMOS generation entirely, jumping to CMOS a decade before the mainstream microprocessor industry. This decision is easier to justify, since low-power CMOS was a clear advantage for battery-powered calculators. Curiously, TI continued to use inferior metal-gate transistors, even after moving to CMOS.
This history illustrates that technological progress isn't a straightforward path with new and improved technologies replacing older technologies. Instead, a new technology like CMOS may take years to catch on, becoming successful in particular markets but being not making headway in other markets until economic factors and engineering tradeoffs changed.
Getting back to the TP0320, the die photo below shows the TP0320 die, zooming in on the RAM array. This 32×24 array holds 768 bits, a significant upgrade from the TMS1000. The closeup at the right zooms in on a single bit. The bit caell has a different layout from the TMS1000 RAM. The design switched from dynamic RAM to static RAM, eliminating the capacitors and the need for refresh. In this section, I'll explain how this RAM cell is implemented.
The diagram below shows how two inverters can be connected in a loop to store either a 0 or a 1. If the upper signal is 1, the inverter on the right outputs a 0 on the bottom, and the inverter on the left outputs a 1 at the top, reinforcing the original signal. Alternatively, the top signal can be a 0 as shown on the righ. The key difference between this static circuit and the previous dynamic circuit is that the static circuit will hold a bit for an arbitrarily long time. The bit won't leak out of a capacitor as in a dynamic RAM, so refresh is not needed.
To make a usable storage cell, an addressing mechanism is added to the inverter circuit above. When the address select line is activated, the transistors connect the inverters to the data lines. For a read, the value of the cell is read from the data line. For a write, the desired bit and its complement are applied to the data lines, overpowering the value stored in the inverters and switching them to the new bit value. This type of storage cell is used to implement registers in many processors, including the Zilog Z80 and the Intel 8085.
The diagram below shows how a CMOS inverter is constructed from two transistors. The upper transistor is a PMOS transistor, while the lower transistor is an NMOS transistor. With a 0 input, the PMOS transistor turns on, connecting the output to the positive voltage (1). With a 1 input, the NMOS transistor turns on, connecting the output to ground (0). Thus, the output is the opposite of the input, as you'd expect from an inverter.
Putting this all together yields the schematic below. Transistors Q1 and Q3 implement one inverter, while transistors Q2 and Q4 implement the second inverter. Transistors Q5 and Q6 select the cell based on the address. The transistors are arranged on the schematic to match their physical locations.
The die photos below show how the storage cell is implemented in the TP0320 processor. The first photo shows three vertical metal traces that wire the cell together. In the second photo, the metal was removed with acid to reveal the silicon underneath. The upper section holds the PMOS transistors (Q1 and Q2) while the lower section holds the NMOS transistors (Q3 to Q6). The transistors appear as whitish rectangles, while the doped silicon appears as greenish or reddish lines. The black spots are vias connecting the silicon to the metal above. The diagram can be compared with the schematic above.
The photo below zooms out a bit to show how the NMOS and PMOS transistors are arranged. Note the "P ring" that surrounds the NMOS transistors. This forms a tub of P-type silicon that holds the NMOS transistors. (This P ring is the horizontal green line below Q2 in the die photo above.) The chip contains many of these tubs, separating the PMOS and NMOS transistors.
In 1981, Texas Instruments introduced a more powerful architecture, the TP0455, followed shortlly by the TP0456. The TP0456 chip was used in calculators such as the TI-55-II scientific calculator, TI-35, and TI-60, as well as educational toys such as Little Professor and Spelling B.
The die photo below shows the TP0456. The RAM array is in the lower-left corner of the die photo below, while the ROM is in the lower-right. The TP0456's RAM array is 32 cells wide and 16 cells tall, providing 512 bits of storage, less than the 768 bits of the TP0320.
The TP0456 uses almost the same static cell structure as the earlier CMOS chips, but the layout was changed slightly. In particular, the select line runs between the two inverter lines, rather than on the side. I don't know why they made this change, as it doesn't appear to change the density. The static RAM circuit is same as the TP0320 described earlier, so I won't discuss it here.
While RAM storage may seem trivial, early microcontrollers required new ways to fit storage into the limited space on a die. Even just 256 bits took up a substantial fraction of the chip. Texas Instruments developed new dynamic RAM circuits for the TMS1000 microcontroller, followed by a completely different static circuit when they switched to CMOS microcontrollers.
Decades later, microcontrolelrs still have limited memory capacity. The Arduino Uno, for example, has 32 kilobytes of flash for program storage and 2 kilobytes of RAM. Modern high-end microcontrollers can have megabytes for program storage and hundreds of kilobytes of RAM, but this is still orders of magnitude less than a typical microcomputer. The constraints of fitting everything onto a single chip still limit capacity and still require novel solutions, just as in the TMS1000.
I announce my latest blog posts on Twitter, so follow me at kenshirriff. I also have an RSS feed. Thanks to Joerg Woerner at Datamath for suggesting this topic and thanks to Sean Riddle for die photos.
Texas Instruments is considered the inventor of the microcontroller for developing the TMS0100 (different from the TMS1000) in 1971. While the TMS0100 has the characteristics of a microcontroller, it was marketed as a "calculator-on-a-chip". The TMS1000, however, was marketed as a "single-chip computer" for both calculator-type applications and small to medium control applications. ↩
The famous Speak & Spell used a TMS1100 microcontroller, a version of the TMS1000 with twice as much RAM and ROM. A die photo is here; it is nearly identical to the TMS1000 die except the RAM and ROM regions are stretched vertically to increase the capacity.
The architecture of the TMS1000 is rather unusual due to its roots as a calculator chip. It has just four input lines, designed to be connected to a grid of buttons. The outputs are also unusual: it has 8 "O" output lines, but these are not individually controllable. Instead, a 5-bit value is converted to the eight outputs by a customizable PLA decoder. The motivation behind this is to drive a 7-segment display. The microcontroller also has 11 "R" outputs, which are typically used to multiplex the LED display and to scan the keyboard. Another curious feature of the TMS1000 is that the instruction set was somewhat customizable.
In comparison, Intel's microcontrollers such as the popular 8048 (1976) and 8051 (1980) were much more like standard 8-bit microprocessors. Unlike the TMS1000, the Intel microcontrollers had familiar features such as an 8-bit CPU, 8-bit I/O ports, interrupts, a stack, and a fixed instruction set with Boolean operations (AND, OR, XOR) and shifts. Looking at the TMS1000 instruction set, it seems slightly alien, while the 8048's instruction set is similar to microprocessors of the time. ↩
The TMS1000 is implemented with complex logic circuitry, using a five-phase clock. The TMS1000 uses a mixture of depletion loads, gated loads, or precharge logic, for power savings. I'm not sure why the TMS1000 uses a five-phase clock. Four-phase logic was a logic design methodology at the time, but the TMS1000 circuitry doesn't appear to use four-phase principles. Among other things, the TMS1000 phases are irregular and Φ4 pulses twice per cycle. ↩
TI's Random access memory cell patent (1974) describes the memory cell used in the TMS1000. The layout in the patent is similar but not identical to the actual layout. Transistor Q5 appears in the circuit but not the patent. It pulls point B to 0 when clock phase 5 is active, making sure that a 0 bit at C is restored to a stronger 0 bit.
While most patents don't provide much useful information, Texas Instruments' calculator patents are unusually detailed and informative, providing schematics, source code, and clear explanations; they seem like they were written by engineers. (I feel that I should give TI credit for the quality of their patents.) ↩
In 1969, Sharp introduced the first calculator built from high-density MOS chips, the QT-8D, followed by the handheld Sharp EL-8, the world's smallest calculator at the time.1 These calculators were high-end products, selling for $345 (about $1800 today). Integrated circuits at the time couldn't fit the entire calculator on one chip, so these calculators contained five ICs: an arithmetic chip, a decimal point chip, a keypad/display chip, a control chip, and a clock chip.
This blog post discusses the clock chip and how it generated the unusual four-phase clock signals required by the calculator. The die photo below, provided by calculator researcher Francois Gueissaz, shows the silicon die of the clock chip. the silicon substrate has a purple tint while the doped, conductive silicon is green. The metal layer on top is white. Around the edges, seven thin bond wires connect the die to the external pins.2 This chip has about 200 transistors and implements just a dozen moderately complex logic gates. While the density of this chip is absurdly low by modern standards, it illustrates the progress of MOS integrated circuits in the late 1960s.
Although computers now all use MOS integrated circuits, the path to MOS was rocky, with MOS integrated circuits viewed as slow and unreliable in the 1960s.4 Handheld calculators were a good match for the characteristics of MOS, though: they needed to be compact and lightweight with low power consumption, but computational speed was not important. In 1969, the Japanese calculator company Sharp signed a $30 million deal with Rockwell for this MOS-based calculator chipset, the largest MOS order in history at the time. The five chips were implemented by the Autonetics division of Rockwell.3
Although the Sharp calculator (above) was handheld, you can see that it was rather thick and chunky, with unusual 8-segment vacuum fluorescent display tubes for its display. The photo below shows the circuit board inside the calculator. The board is dominated by the four large integrated circuits with circular golden lids. These integrated circuits were packaged as 42-pin ceramic ICs with staggered pins. Unlike modern printed circuit boards, the traces on this board are curved, showing its hand-drawn layout.
The clock IC is packaged in the small 10-pin metal can, marked with a blurry Rockwell logo (the inset shows the logo). The part number is CG1121, probably standing for Clock Generator. The date code 7047 indicates this IC was manufactured in the 47th week of 1970, i.e. late November.
Cutting the top off the metal can integrated circuit reveals the tiny silicon die. Although the metal can has 10 pins, only seven pins are wired to the die. The metal tab at the top of the photo indicates pin 1 of the integrated circuit.
Why do the calculator chips require a complex four-phase clock? In 1966, Autonetics invented a technique for building logic circuits called four-phase logic. Unlike standard static logic gates, these logic gates held values dynamically using the capacitance of the wiring. The four-phase clock stepped the gates through sequences of precharging and then computing the logic function. This sounds complicated, but four-phase logic had ten times the density of standard logic gates, as well as using 1/10 the power and having 10 times the speed. As a result, many early high-density MOS chips used four-phase logic.5
Transistors are the key component of the chip. The diagram below shows a metal-gate PMOS transistor, the (somewhat primitive) type of transistor used in this IC. At the bottom, two regions of silicon (green) are doped to make them conductive, forming the source and drain of the transistor The gate is formed by a metal strip between the silicon regions, separated from the silicon by a thin layer of insulating oxide. (These layers—Metal, Oxide, Semiconductor—--give the MOS transistor its name.) The transistor can be considered a switch between the source and drain, controlled by the gate. To simplify the behavior, a PMOS transistor turns on when the gate is pulled negative (-25 volts), while the transistor turns off when the gate is at 0 volts. (These early PMOS transistors required an inconveniently large negative voltage.)
The photos below show transistors on the die as they appear under a microscope. The silicon and metal layers match the diagram above; the doped silicon is greenish while the metal layer on top is white. The gate is formed where the metal and silicon overlap, with a faint oval where the oxide is thinned. These transistors are three different sizes: the wider transistors allow higher current. The transistors are carefully sized in the circuits based on the required current.
The next important component is the resistor; the photo below shows three resistors. These resistors may look like transistors, and that's because they are transistors. While the transistors above were widened to support more current, these transistors are made longer so the long path reduces the current flow through the transistors. This makes them act as resistors. The metal gate of these transistors is tied to -25 volts, so the transistors are always on, rather than operating as switches.
The final important component of the integrated circuit is the capacitor. A capacitor is formed by using metal for one plate and doped silicon (green) for the other plate, separated by the insulating oxide layer. The photo below shows two small capacitors and one large capacitor, at the same scale. The large capacitor is used in the output circuitry; the metal stripes above and below it are transistors that drive it.
With these components, logic gates can be constructed. The schematic below shows how an inverter is implemented in the IC. The layout of the schematic matches the die image underneath, so hopefully the transistors and capacitor can be recognized. If the input is low, the input transistor turns on, pulling the output to ground (i.e. high). If the input is high, the input transistor turns off and the "bootstrap load", the tricky circuit on the right pulls the output to -25V (i.e. low). Thus, the circuit inverts the input.
Conceptually, you can think of the bootstrap load as a pull-down resistor. The implementation is complex to compensate for the poor characteristics of transistors at the time. The capacitor acts as a charge pump, providing a necessary voltage boost when the circuit switches. (For more details on bootstrap loads, see my earlier article.)
The implementation of a NAND gate is similar to the inverter above, but with multiple input transistors in parallel. If any input is low, the corresponding input transistor turns on, pulling the output to ground (i.e. high), as required by a NAND gate.
The die photo below shows the functional blocks of the clock chip. Eight NAND gates (red) form an oscillating 4-bit shift register. Four gates (yellow) generate the four-phase clock signals from the shift register outputs. Finally, four output driver circuits (orange) amplify these signals to produce high-current outputs.
The main building block of the clock chip is a NAND gate that has a delay when its output goes low. This delay creates the timing of the clock signal.6 The diagram below shows how the gate is constructed; the schematic corresponds to the layout of the circuit on the die. The delay makes this circuit somewhat complex and partially analog, but I'll try to explain it.
The NAND circuit is in the upper right; two input transistors and a bootstrap load implement the NAND circuit described earlier. The output of the NAND gate goes through a resistor-capacitor circuit. This delays the output as the capacitor slowly charges through the resistor. The speed of the clock is controlled by the bias pin, which sets a threshold voltage. This voltage controls the point in the resistor-capacitor curve when the level switching transistor turns on.7 By lowering the voltage on the bias pin, the transistor switches sooner, increasing the clock speed. The typical clock speed is 60 kHz, a slow clock even compared to early microprocessors, but calculators didn't require much speed.
When the level switching transistor turns on, it pulls the buffer high,8 and driving the inverter's output low. The inverter has a bootstrap load to provide sufficient output current. Finally, the output is fed back to the bias circuit, probably to sharpen the transition and provide hysteresis. To summarize, this complex circuit implements a delayed NAND gate. It is the key functional block of the chip, repeated ten times.
The clock is built from a 4-stage shift register. The idea is that each stage of the shift register shifts its bit to the right, after a delay. The bit on the right is inverted and shifted into the left side of the shift register. Thus, the shift register implements a ring counter, first shifting in 1's at the left and then shifting in 0's: the bit pattern is 0000, 1000, 1100, 1110, 1111, 0111, 0011, 0001, and back to 0000. This complete cycle corresponds to one 60 kilohertz clock cycle for the calculator.
The schematic below shows how the shift register is built from eight cross-coupled NAND gates with delay, using the circuit described earlier. Each pair of NAND gates forms a latch, storing either a 0 or a 1. The latch outputs are labeled Q0 through Q3 while the inverted outputs are labeled Q0 through Q3. The outputs from each latch are connected to the inputs of the next stage, so the bits are shifted to the right. Note that the wires from the last stage back to the first stage are crossed; this causes the bit to be inverted. Each stage consists of two cross-coupled NAND gates, forming a latch that holds one bit. If the delay is decreased (through the bias pin), the speed of the shift register increases, increasing the clock speed.
The shift register must be initialized to the proper state, which is the job of the reset gate. When the shift register is powered up, the reset gate initializes the latches to hold zeros by pulling the lower inputs to the latches low.
The output circuitry generates the four clock phase outputs from the shift register values. Two phases come from the last shift register stage and its complement. The other two phases are more complex. An unwired "select" pin selects between two outputs for these pins; presumably this pin was wired in other versions of the clock chip to provide different clock signals for a different calculator. In the normal case, these clock outputs are formed by NANDing together two shift register outputs to produce a shorter pulse.
The photo below shows one of the output buffers. The output signal enters at the left, travels through the buffer circuitry, and exits the chip through the bond wire on the right. The right half consists of two large transistors to provide the high output currents: one transistor pulls the output up to ground, while the other transistor pulls the output down to -25V. The remainder of the circuitry amplifies the small internal signal so it can drive the output transistors. Note the large bootstrap capacitor near the center; it helps drive one of the output transistors. There are also much smaller bootstrap capacitors in the upper left. This output buffer circuit is repeated four times, once for each output pin.
The output buffer transistors must be large due to an unusual characteristic of four-phase logic. Normal clocked logic uses the clock signals for timing, while the logic gates are connected to power and ground. In four-phase logic, however, the clock signals provide the power for the logic gates; there are no separate power and ground connections. When the gates are precharged and discharged by the clock signals, this provides the power for the gates. Thus, four-phase logic requires relatively high-current clock signals, since they are powering the circuits.9
To see the chip in action, the oscilloscope trace below shows the four clock outputs as measured from the chip. The yellow and blue traces are the main phases; note that the active (low) parts do not overlap. The magenta and green outputs are active during the first part of the yellow and blue phases, respectively. These clocks are used to precharge the logic circuits. (The clock phases match those on Wikipedia's four-phase article, except the polarity is reversed because of the PMOS transistors.)
Rockwell fit a calculator onto five chips, making the handheld calculator possible. However, Texas Instruments, Mostek, and other companies soon fit all the circuitry onto a single chip, creating the calculator-on-a-chip. Selling calculators was highly profitable for a short time and 11 million calculators were sold in the US in 1974. Although calculators sold for hundreds of dollars in 1969, competition and the improvements in technology caused calculator prices to plummet to $15 by 1975. The profit margin collapsed during the "calculator wars"; Texas Instruments alone lost $16 million in 1975.4
Although the calculator market was risky, the massive sales of calculators provided an important boost to MOS chip technology in the early 1970s, and thus the computer industry. In particular, microprocessors started with the Intel 4004, a chip designed for a calculator. And microcontrollers were created out of Texas Instruments' line of calculator chips. While a chip such as the CG234 clock generator is trivial by modern standards with about 200 transistors, it provides a historical window into how chips were constructed in the early days of MOS ICs.
Thanks to Francois Gueissaz for doing all the hard work of obtaining the calculator ICs, decapping them, and providing me with die photos and other information. I announce my latest blog posts on Twitter, so follow me at kenshirriff. I also have an RSS feed.
Measuring the die photo, I believe this chip uses a 15 µm process, so the transistors and features are very large by modern standards. (This is why five chips were required to implement the calculator.) In comparison, many modern chips use a 14 nm process, so the width of a modern transistor is roughly 1000 times smaller, and the area is roughly a million times smaller. This shows the amazing progress in silicon technology described by Moore's Law. ↩
It's hard to follow the spin-offs and acquisitions of the companies involved. Autonetics was founded as the research laboratory for North American Aviation in 1945. Among other things, Autonetics developed guidance computers for the Minuteman missile. Although North American Aviation is mostly forgotten now, it was a major aerospace company, building everything from the P-51 Mustang in World War II to the command and service module for the Apollo landing. It merged with Rockwell in 1967, becoming North American Rockwell. In 1970, about 800 employees from Autonetics were split off to form North American Rockwell MicroElectronics to develop and manufacture commercial integrated circuits. This later became Rockwell Semiconductor, then spun off into Conexant, which was later acquired by Synaptics. Rockwell was sold to Boeing in 1996.
Sharp, on the other hand, started as Hayakawa Metal Works in 1924, eventually being renamed Sharp Corporation in 1970. (The name came from the Ever-Sharp mechanical pencil, one of Hayakawa's early inventions.) Foxconn bought the majority of Sharp in 2016; Foxconn, also known as Hon Hai Precision Industry, is a Taiwanese electronics manufacturer. Although best known for manufacturing the iPhone for Apple, Foxconn is estimated to manufacture 40% of the world's consumer electronics. ↩
Much of the historical information in this post comes from the books To the Digital Age and History of Semiconductor Engineering. These books provide a detailed look at the rise of MOS integrated circuits. ↩↩
One of the main proponents of four-phase logic was Lee Boysel, who founded a company Four-Phase Systems around it. The company built 24-bit computers, which were some of the earliest MOS-based computers. Boysel's EECS presentation describes the advantages of four-phase logic. ↩
One important characteristic of the delayed NAND gate is that the delay is much larger when the output goes low than when the output goes high. This ensures that the output clock phases do not overlap while active (low). This is necessary for four-phase logic to ensure that logic gates don't conflict with each other. ↩
The level switching transistor (like other PMOS transistors) will turn on when the gate voltage is lower than the source voltage by Vt (the transistor's threshold voltage). Thus, by controlling the bias voltage on the transistor's source, the transistor can be made to turn on sooner or later, controlling the frequency. ↩
Note that the buffer circuit is constructed "backward" compared to a standard PMOS inverter. A PMOS inverter has the transistor connected to ground with a load resistor to -25V, while the buffer has the transistor connected to -25V and the load resistor to ground. I think it is constructed this way to shift the voltage levels from the level switching transistor. ↩
Although the four-phase clocks power the logic gates, the chips also have regular power and ground connections. These power the output pins since the current demands are too large to be reasonably satisfied by the clocks. ↩
In 1969, high-density MOS integrated circuits were still new and logic circuits were constructed in a variety of ways. One technique was "four-phase logic", which provided ten times the speed and density of standard logic gates while using 1/10 the power.1 One notable application of four-phase logic was calculators. In 1969, Sharp introduced the first calculator built from high-density MOS chips, the QT-8D, followed by the world's smallest calculator, the handheld EL-8. These calculators were high-end products, selling for $345 (about $1800 today).
Integrated circuits at the time weren't dense enough to implement an entire calculator on one chip so these calculators split the functionality across five ICs. These five chips were created for Sharp by the Autonetics division of Rockwell. Autonetics invented four-phase logic in the mid-1960s, so this logic family was a natural choice for the calculator chips.
In this blog post, I reverse-engineer the keypad/display chip shown above. This photo shows the tiny silicon die under a microscope. The silicon substrate has a purple tint while the doped, conductive silicon is green. The metal layer on top is white. Around the edges, thin bond wires connect the die to the 42 external pins. The chip contains roughly 500 transistors implementing 100 logic gates. While the density of this chip is absurdly low by modern standards, it illustrates the progress of MOS integrated circuits in the late 1960s.
The photo below shows the circuit board inside the calculator. The board is dominated by the four large integrated circuits with circular golden lids. These integrated circuits were packaged as 42-pin ceramic ICs with staggered pins, an arrangement that provided more room for the PCB traces. Unlike modern printed circuit boards, the traces on this board are curved, showing its hand-drawn layout.
These four chips have different functions: an arithmetic chip, a decimal point chip, a keypad/display chip, and a control chip. This blog post focuses on the keypad/display chip (NRD2256) in the upper left. The fifth chip, is the clock chip in the small metal can that provides the four-phase timing pulses. The system clock runs at about 60 kilohertz, very slow by microprocessor standards, but fast enough for a calculator
One function of the keypad/display chip is to handle keypresses, converting a digit key into a 4-bit serial binary value. (Unexpectedly, non-digit keypresses are handled by other chips.) Its second main function is to display digits on the display. Like most calculators, this calculator multiplexes the display; it displays one digit at a time, repeated rapidly enough that the display appears uniform. It does this by activating one display tube at a time and energizing the appropriate segments to produce the desired digit.2
The four main chips communicate serially, sending each decimal digit as four BCD (binary-coded decimal) bits. Each communication cycle consists of 8 digits plus a ninth unused spot, forming a 36-bit "packet".3 The basic timing comes from the 60-kilohertz clock chip; one bit is sent each clock cycle. The keypad/display chip produces additional timing signals keep everything synchronized. First, it divides the clock by 4, generating a "digit clock" signal that indicates each 4-bit digit. The keypad/display chip cycles through the display digits, one digit every four clocks; it transmits signals to the other chips to keep track of the current digit. Thus, as the keypad/display chip cycles through the digits of the display, it receives the binary value of each digit at the right time.
The diagram below shows the functional units in the keypad/display chip. The "digit scan" circuitry scans through the eight digit drive lines D1-D8. The "decimal point" circuitry deserializes the decimal point input "dp" and energizes the decimal point segment when the specified digit is active. The "digit serialize" circuit converts a digit keypress into four serial bits. The "wiring" section is simply wiring between the upper half of the chip and the lower half, showing how much space is wasted by signal routing. In the lower half, the "9-segment decoder" illuminates the appropriate segments to display a digit; this digit is serialized by the "digit latch" circuit. The "clk÷4" circuit divides the input clock by four to produce the digit clock. Finally, the "key encode" circuit converts a keypress (0-9) into the four-bit value used by the "digit serialize" circuit. As will be seen, these functional blocks are not very complex, consisting of maybe 20 gates each.
The calculator chip is built from metal-gate PMOS transistors. This type of transistor was easy to manufacture in the 1960s, but rapidly became obsolete. These transistors required large negative voltages, -25 volts for the calculator chip. (For simplicity, I will view the signals as active-low; 0V is a logical 0 and -25V is a logical 1.) Another problem with metal-gate transistors is that most of the chip was occupied by silicon and metal wiring, so the density of transistors was very low.
The diagram below illustrates a metal-gate PMOS transistor. At the bottom, two regions of silicon (green) are doped to make them conductive, forming the source and drain of the transistor The gate is formed by a metal strip between the silicon regions, separated from the silicon by a thin layer of insulating oxide. (These layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.) The transistor can be considered a switch between the source and drain, controlled by the gate. To simplify the behavior, a PMOS transistor turns on when the gate is pulled negative (-25 volts), while the transistor turns off when the gate is at 0 volts.
The image below shows how a transistor appears on the die. The gate is formed by the metal overlapping the doped silicon (vertical green strip). Inconveniently, a contact that connects the metal layer to the silicon looks very similar to a transistor in this chip—the metal layer in a transistor almost touches the silicon, while the metal layer in a contact touches the silicon. A contact and a transistor can be distinguished with effort; a contact is more square-shaped while a transistor is more oval-shaped and slightly blurrier. As will be explained below, four-phase logic often uses transistors where both the gate and the drain are connected to the same clock; this type of connection appears at the bottom of the diagram. By recognizing the transistors, the circuitry can be reverse-engineered.
Four-phase logic is a technique for building logic gates, such as NAND gates. At the time, the standard way of building a logic gate was called "static logic", because the output remained constant as long as the inputs didn't change. A disadvantage of static logic was that it required a large "load transistor" that continuously used current, resulting in high power consumption.
A solution to these problems was "dynamic logic". Instead of providing a steady output from the gate, the gate's output was controlled by a clock signal. The gate's value would be computed and then stored by the circuit's capacitance, instead of requiring a continuous current. Developing with dynamic logic can be tricky, however, because of its dependence on timing. (It also has the disadvantage that the output values rapidly leak away, rather than being stable as with static logic.) Dynamic logic is still used in modern CPUs, in the form of domino logic.
Four-phase logic is a specific type of dynamic logic, designed to simplify the design process. Its timing is controlled by four clock signals (below), the source of the name "four-phase".4 In the calculator, these clock signals repeated at 60 kilohertz.
The diagram below shows how an inverter is implemented in four-phase logic. In the first clock phase, φ1 is high causing the capacitor to get charged. In the second clock phase, the gate's value is determined. If the input is 0, the capacitor keeps its previous value (1). But if the input is 1, the capacitor discharges through the lower transistors so the output is 0. Thus, the circuit inverts the input.5 The capacitor holds the output for the remainder of the clock cycle, so the gate also acts as a latch. (This is an important feature of four-phase logic, simplifying many circuits.)
More complex gates are built in a similar manner. For a NAND gate, multiple input transistors are put in series; if all inputs are 1, the capacitor will discharge and the output will be 0. For a NOR gate, multiple input transistors are put in parallel; any 1 input will yield a 0. As will be seen later, complex gates can be created with a mixture of series and parallel transistors.
The gate described above only uses two phases,6 so why four-phase logic? The problem with the above circuit is that if you connect two gates together, during step 2 the output of the first gate will be changing while the second gate is using this value. This could cause the second gate to erroneously discharge, yielding the wrong answer. The solution is for the second gate to wait until the first gate is stable. Specifically, the first gate operates during time periods 1 and 2, while the second gate operates during time periods 3 and 4. The second gate can then be safely connected to another gate operating during time periods 1 and 2. A circuit that alternates the two types of gates will operate safely.7
The diagram below shows how a four-phase inverter appears on the die. The schematic is the same as before, but the circuit is stretched vertically, with a layout that is tall and skinny. The inverter consists of a doped silicon line (green) running vertically, crossed by metal wiring. The gate is implemented by three small transistors. The large capacitor in the middle holds the output voltage. Dynamic logic is often built to use the stray capacitance of the wiring, but this chip uses many large capacitors (perhaps due to leakage or the slow clock speed).
In the next sections, I'll describe how some of the calculator IC's circuits are implemented using four-phase logic.
This chip uses shift registers to convert a serial input signal into a parallel binary value. One shift register is used for the decimal point position input while another shift register handles the digit to be displayed. The basic implementation of the shift register is a chain of inverters with two inverters per stage. Because four-phase logic is clocked, a bit will advance through the two inverters every clock cycle. (One inverter during Φ1/Φ2 and the second inverter during Φ3/Φ4.) This is an advantage of four-phase logic; standard logic requires a flip-flop at each stage to hold the bits, making the circuit much more complex. Each stage has an additional inverter to output the uncomplemented value. To keep both outputs synchronized, these inverters use special timing, precharging on Φ3 and reading on Φ1.7
The diagram below shows how the shift register for the decimal point position is implemented on the die. It shows nine inverters, implemented with 27 transistors. Each vertical green line of doped silicon is one gate, while the white metal wiring is mostly horizontal. Note that this circuitry, just nine gates, takes up a large fraction of the die. While the gates are tightly packed side-to-side, they are very tall, so the die holds just two rows of gates. The density of transistors is very low, with most of the area consumed by wiring. Even so, four-phase logic was considered a dense way of creating gates, since other techniques were even worse. (A couple of years later, microprocessors used an additional layer of polysilicon wiring, which made signal routing much easier and greatly increased the density.)
Examples of transistors and capacitors are indicated on the diagram. At the bottom, the arrow shows one of the connections between two inverters. The short horizontal wire is connected to the inverter on the left, and forms the gate of the inverter on the right. Other wires are longer as they connect inverters to other parts of the circuitry.
The chip converts each digit keypress into a binary encoding, using the NAND gates shown below. The calculator's buttons are magnets, closing reed switches. These switches are connected to the inputs on the right. When a key is pressed, the input goes low and the circuit generates the corresponding 4-bit binary output at the bottom.
Each vertical green line corresponds to a NAND gate. (These gates are tall like the previous ones, but I'm only showing the interesting part.) The interesting thing about the encoder is that the binary representation is visible in the transistor pattern. For instance, the "1" bit output is connected to alternating inputs, while the "4" bit output is activated by keys 4 through 7. The unlabeled lines are used to determine if any key is pressed.
The desktop QT-8D calculator uses an unusual 9-segment display with curved segments, while the handheld EL-8 used an 8-segment display (omitting segment i, which provided a tail on the 4). These produce curved digits, unlike the blocky 7-segment digits seen in most calculators. The zero is particularly unusual: it is half-height. The calculator doesn't suppress leading zeroes, so the half-height zeros are less obtrusive. (1234, for instance, appears as oooo1234.)
The role of the segment decoder is to take a binary value and drive the appropriate segments, labeled a through i. The circuit below is the interesting part of the decoder circuit. The bit values and their complements enter on the right from the shift register. Most of the segments are decoded by AND-NOR gates; an AND-NOR gate consists of several AND terms with the results NOR'd together. An AND-NOR gate is implemented in four-phase logic as a single gate with a separate vertical strip for each AND term. The strips are tied together at the top and bottom so if any strip is activated, the gate is discharged; this provides the NOR action. As a result, the physical structure of the gate maps directly to its logical structure.
The gate for segment f is indicated on the diagram by an arrow. It has two vertical strips, so two AND terms. Studying the transistor connections, this gate implements: bit1 NOR (bit3 AND bit2). Evaluating this expression shows that f will be active for the digits 4, 5, 8, and 9. Looking at the display, you can verify that these are the digits that use segment f. Similar expressions are used to generate the other segments. For instance, segment h has four AND terms.
Segment i is activated by a NOR gate, which has two parallel vertical segments with three transistors in between. If any transistor is activated, it will connect the segments and discharge the gate, providing the NOR action. NOR gates are rare on the chip, probably because they require twice the width of a NAND gate. Segment i is NOR(bit0, bit2, bit1), so it is activated only for the number 4; this segment provides a short tail on the displayed 4.
One of the tasks of this chip is to display the decimal point, which is more complex than you might expect. The decimal point is encoded as a 4-bit value, transmitted serially to the chip. Three bits indicate the position of the decimal point (0 to 7), while the fourth bit enables or disables the decimal point. A shift register (described earlier) converts the serial bits to a 4-bit value. A remarkably complex gate (below) is used to determine when the active digit matches the specified decimal point position. At that time, the decimal point segment is activated, causing the correct decimal point to light up.
The circuit is implemented in four-phase logic as a single gate. The gate can be viewed as an 8-to-1 multiplexer that selects one of the eight digit (D) lines based on the bit value. This gate also includes a latch to hold the multiplexed value. Note that if the digit clock is 0, the AND gate at the bottom will cycle the output value (through an inverter, not shown), holding the value. When the digit clock is 1 (i.e. a digit has been read in), a new value from the multiplexer tree will be read. The branching tree structure is visible in the silicon structures above.
I won't describe the remainder of the circuits on the chip in detail. They were implemented using similar techniques, in particular shift registers. The keypress is converted to serial data with a latch and shift register, built from AND-NOR gates. The digit scan circuit is also a latch and shift register, with a gate to start a 1 value. This shift register is triggered by teh digit clock, so it shifts every 4 cycles. The circuit that divides the clock by 4 is a shift register to count four cycles.
Although Sharp managed to fit the calculator circuitry onto five chips, it was soon overshadowed by single-chip calculators. In a few years, calculators shrank from the handheld but blocky Sharp EL-8 to credit-card-sized. The calculator market was highly profitable for a short time until the "calculator wars" caused calculator prices to drop from hundreds of dollars to a few dollars. Most of the hundreds of calculator manufacturers left the market, leaving Texas Instruments, Hewlett-Packard, Sharp, and Casio as the dominant manufacturers.
As for four-phase logic, its success peaked in the 1970s. Most notably, the company Four-Phase Systems created a 24-bit desktop computer in 1971 using four-phase logic, and Motorola bought the company in 1982. For the most part, though, microprocessors of the 1970s used static NMOS logic rather than four-phase logic. I haven't been able to find an explanation of why four-phase logic wasn't more widely used. My suspicion is that improvements in semiconductor technology in the early 1970s reduced the benefits of four-phase logic, specifically the introduction of depletion-load NMOS logic.
I plan to analyze the remaining three calculator chips so follow me on Twitter @kenshirriff for updates. I also have an RSS feed. Thanks to François Gueissaz for doing all the hard work of obtaining the calculator ICs, decapping them, and providing me with die photos and other information.
The advantages of four-phase logic are discussed in a talk by Lee Boysel, an early proponent of MOS circuitry and four-phase logic. He founded the company Four-Phase Systems, which build a powerful desktop computer using four-phase logic. His interesting video on MOS history is here. ↩
The calculator display uses vacuum fluorescent display (VFD) tubes, developed as a lower-cost alternative to Nixie tubes to avoid paying patent royalties to Burroughs. Nixie tubes are similar to neon bulbs; there are 10 cathodes, each shaped like a digit, and applying 170 volts to a cathode causes the digit to light up with a neon glow.
The multi-segment VFD was invented in 1967 by Noritake Itron Corp. VFD tubes are vacuum tubes, sort of a cross between a triode and a low-voltage CRT. Unlike the "cold cathode" of Nixie tubes, the VFD's cathode is heated, causing electrons to boil off. These electrons are accelerated toward an anode by applying 25 volts to the anode, and cause a phosphor to light up when they hit the anode. A grid between the cathode and anode controls the electron flow; this is how a single tube is selected for multiplexing. The voltage in a VFD is much lower than a CRT, 25 volts instead of 25,000 volts. Another difference is that a CRT deflects the electron beam with deflection coils to create a pattern on the screen, while the VFD uses individual anodes that light up separately for each segment.
These Sharp calculators were the first calculators to use VFD tubes. The EL-8 calculator uses eight-segment Itron type DG10L tubes while the QT-8D calculator uses nine-segment DG10B tubes. The driver board has nine driver integrated circuits to interface between the calculator chips and the display tubes. ↩
I'm skipping over a bunch of details of the calculator. For instance, some signals are active-high, while others are active-low, and some signals are shifted by half a clock. (The design is optimized to minimize the hardware, rather than being conceptually clean.) In this blog post, I'm describing the concepts of the circuitry rather than the cycle-exact details. ↩
I haven't found many publications explaining four-phase logic. One is the article Four-phase logic is practical (1977). The 1969 master's thesis Basic design of MOSFET, four-phase, digital integrated circuits has a lot of information. The book MOS integrated circuits and their applications (1970) has a chapter on four-phase logic. See also Low-power VLSI implementation by NMOS 4-phase dynamic logic, published at the surprisingly late date of 2000. ↩
Note that the gate is powered only by the clock; there are no power or ground connections. Although the four-phase gate are powered through the clock, the chip does have connections for power (-25V) and ground. Power and ground are used by the output pins so they can provide static signals with more substantial current. Ground is also used for the gate capacitors. ↩
Most of the classic 1970's microprocessors used a two-phase clock. They used dynamic circuitry, typically for temporary data storage and timing, but the logic was typically static. The Intel 8086 used dynamic logic in a few places, such as the ALU, probably for performance reasons. ↩
In most cases, four-phase circuitry alternates between φ1φ2 gates and φ3φ4 gates. A problem arises, however, if one path to a gate has an odd number of gates and another has an even number of gates. The solution is two more types of gates, one that precharges on phase 1 and samples on phase 3, and one that precharges on phase 3 and samples on phase 1. These gates are slower, but can interface between the earlier two types. Thus, four-phase logic has four types of gates, distinguished by the clock phases they use. Following the simple interconnection rules ensures that the circuit operates correctly.
The four types of four-phase gates are illustrated in A mathematical model characterizing four-phase MOS circuits for logic simulation. (1968) and Four-phase logic is practical (1977). (I'm pretty sure the second article has some errors in Figure 2 though.)
Only certain combinations of four-phase gates can be connected. The diagram below shows that, for instance, the output from a type 1 gate can connect to the input of type 2 or type 3. A typical circuit alternates between Type 1 and Type 3. The calculator chip uses a few Type 2 gates and Type 4, for example when an extra inversion is required.
I recently received a vintage display box used by IBM to illustrate the progress of computer technology. This display case, created by IBM Germany1 in 1986 included technologies ranging from vacuum tubes and magnetic core memory to IBM's latest (at the time) memory chips and processor modules. In this blog post, I describe these items in detail and how they fit into IBM's history.
IBM is older than you might expect. It was founded (under the name CTR) in 1911 and produced punched card equipment for data processing, among other things. By the 1930s, IBM was producing complex electromechanical accounting machines for data processing, controlled by plugboards and relays.
The so-called first generation of electronic computers started around 1946 with the use of vacuum tubes, which were orders of magnitude faster than electromechanical systems. Appropriately, the first artifact in the box is an IBM pluggable tube module. The pluggable module combined a vacuum tube along with its associated resistors and capacitors. These modules could be tested before being assembled into the system, and also replaced in the field by service engineers. Pluggable modules were also innovative because they packed the electronics efficiently into three-dimensional space, compared to mounting tubes on a flat chassis.
The pluggable tube module is from an IBM 604 Electronic Calculating Punch (1948). This large machine was not quite a computer, but it could add, subtract, multiply, and divide. It read 100 punch cards per minute, performed operations, and then punched the results onto new punch cards. It was programmed through a plugboard and could perform up to 60 operations per card. The IBM 604 was a popular product, with over 5600 produced. A typical application was payroll, where the 604 could compute various tax rates through multiplication.
The 604 used many different types of tube modules. A typical module implemented an inverter, which could be used in an OR or AND gate.2 The tube module in the display box, however, is a thyratron driver, type MS-7A. The thyratron tube isn't exactly a vacuum tube since it is filled with xenon. This tube acts as a high-current switch; when activated, the xenon ionizes and passes the current. In the 604, thyratron tubes were used to drive relay coils or magnet coils in the card punch.3
Although the 604 wasn't quite a computer, IBM went on to build various vacuum-tube computers in the 1950s. These machines used larger pluggable tube modules that each held 8 tubes.4 The box didn't include one of these modules—probably due to their size—but I've included a photo below because of their historical importance.
With the development of transistors in the 1950s, computers moved into the second generation, replacing vacuum tubes with smaller and more reliable transistors. IBM based its transistorized computers on pluggable cards called Standard Modular System (SMS) cards. These cards were the building block of IBM's transistorized computers including the compact IBM 1401 (1959), and the larger 7000-series mainframe systems. A computer used thousands of SMS cards, manufactured in large numbers by automated machines.
The photo below shows the SMS card from the box.5 The card is a printed circuit board, about the size of a playing card, with components and jumpers on one side and wiring on the back. A typical SMS card had a few transistors and implemented a simple function such as a gate. The cards used germanium transistors in metal cans as silicon transistors weren't yet popular. I've written about SMS cards before if you want more details.
In 1964, IBM introduced the System/360 line of mainframe computers. The revolutionary idea behind System/360 was to use a single architecture for the full circle (360°) of applications: from business to scientific computing, and from low-end to high-end systems. (Prior to System/360, different models of computers had completely different architectures and instruction sets, so each system required its own software.) The System/360 line was highly successful and cemented IBM's leadership in mainframe computers for many years.
Although other manufacturers used integrated circuits for their third generation computers, IBM used modules called SLT (Solid Logic Technology), which were not quite integrated circuits. Each thumbnail-sized SLT module contained a few discrete transistors, diodes, and resistors on a square ceramic substrate. An SLT module was capped with a square metal case, giving it a distinct appearance. Although an SLT module doesn't achieve the integration of an IC, it provides a density improvement over individual components. Each small SLT module was roughly equivalent to a complete SMS card, but much more reliable.7 By 1966, IBM was producing over 100 million SLT modules per year at a cost of 40 cents per module.6
The board below is a logic board using 24 SLT modules. These modules implement AND-OR-INVERT logic gates, the primary logic circuit used in System/360. This board was probably part of the CPU.
The photo below shows the circuitry inside an SLT module. This module has four transistors (the tiny gray squares). SLT modules typically include thick-film resistors, but none are visible in this module.
The box also has an SLT card with analog circuitry (maybe for the computer's core memory or power supply). This card has one SLT module, a simple module that contains four transistors (number 361457). I don't know why this board has so many discrete transistors; perhaps they are higher-power transistors than SLT modules provided.
For a few years, IBM used SLT modules while other computer manufacturers used integrated circuits. Eventually, though, IBM moved to integrated circuits, which they called Monolithic System Technology (MST). An MST module looks like an SLT module from the outside, but inside it contains a monolithic die (i.e. an integrated circuit) rather than the discrete components of SLT. MST was first used in 1969 for the low-end System/3 computer.
The photo above shows the box's MST module. The silicon die is the tiny shiny rectangle in the middle, connected to the 16 pins of the module. The chip was mounted upside down, soldered directly to the substrate. This upside-down mounting is unusual; most other manufacturers used ceramic or plastic packages for integrated circuits, with the silicon die connected to the pins via bond wires.
The box contains a core memory plane; most computers from the 1950s until the early 1970s used magnetic core memory for their main memory.8 This plane holds 8704 bits and is from a System/360 Model 20, the lowest-cost and most popular computer in the System/360 line.9
In core memory, each bit is stored in a tiny magnetized ferrite ring. The ferrite rings are organized into a matrix; by energizing a pair of wires, one bit is selected for reading or writing. Multiple core planes were stacked together to store words of data. Because each bit required a separate ferrite ring, magnetic core memory was limited in scalability. This opened the door for alternative storage approaches.
IBM was an innovator in semiconductor memory and this is reflected in the numerous artifacts in the box that show off memory technology.10 Modern computers use a type of memory chip called DRAM (dynamic RAM), storing each bit in a tiny capacitor. DRAM was invented at IBM in 1966 and IBM continued to make important innovations in semiconductor memory.
Although magnetic core memory was the dominant RAM storage technique in the 1960s, IBM decided in 1968 to focus on semiconductor memory instead of magnetic core. The first computer to use semiconductor chips for its main memory12 was the IBM System/370 Model 145 mainframe (1970). Each chip in that computer held just 128 bits, so a computer might need tens of thousands of these chips.11 Fortunately, memory density rapidly increased, as shown by the dies below. I'll discuss the 2-kilobit chip in detail; my die photos of the others are in the footnotes13.
The photo below shows the 2-kilobit die14 under a microscope. It is a static RAM chip from 1973, not as dense as DRAM since it uses six transistors per bit. The tiny white lines on the chip are the metal layer on top of the silicon, wiring the circuitry together. Around the outside of the die are 26 solder bumps for attaching the chip to the substrate. Note that this chip is mounted upside down ("flip-chip") on the substrate, unlike most integrated circuits that use bond wires. The chip is covered with a protective yellowish film, except where the solder bumps are located.
To increase the density of storage, four of these chips were mounted in a two-layer MST module, yielding an 8-kilobit module. The module in the box (below) has the square metal case removed, showing the silicon dies inside. These memory modules provided the main memory for the IBM System/370 models 115 and 125, as well as the memory expansion for the models 158 and 168 (1972).
Each memory card (below) contained 32 of these modules to provide 32 kilobytes of storage. In the photo below, you can see the double-height memory modules along with shorter modules for support circuitry. A four-megabyte main memory unit held 144 of these cards in a frame about 3 feet × 3 feet × 1 foot, so semiconductor memory was still fairly bulky in 1972.
Moving along to some different memory chips, the box includes two silicon wafers holding memory dies, a 5" wafer and a 4" wafer.
The smaller four-inch wafer (1982) holds 288-kilobit dynamic RAM chips, an unusual size as it isn't a power of 2.15 The explanation is that the chip holds 32 kilobytes of 9-bit bytes (8 + parity). In the die photo, you can see that the memory array is mostly obscured by complex wiring on top of the die. This wiring is due to another unusual part of the chip's design: for the most efficient layout, the memory bit lines have a different spacing from the bit decode lines. As a result, irregular wiring is required to connect the parts of the chip together, forming the pattern visible on top of the chip. Because this die is on the wafer, you can see the alignment marks and test circuitry around the outside of the chip.
The five-inch wafer holds 1-megabit memory chips16 that were used in the IBM 3090 mainframe17 (1985). This computer used circuit cards with 32 of these chips, providing four megabytes of storage per card, a huge improvement over the 32-kilobyte card described earlier. The 3090 used multiple memory cards, providing up to 256 megabytes of main storage. The die photo below shows how the chip consists of 16 rectangular subarrays, each holding 64 kilobits.
The photo below shows how this die is mounted upside-down on the ceramic substrate with the solder bumps connected to the 23 pins of the module. This module (not part of the box) was used in the IBM PS/2 personal computer.18 The die below looks green, unlike the die above, but that's just due to the lighting.
The photo below compares three memory modules from the technology box. The first module is the 8-kilobit module containing four 2-kilobit chips, described earlier. The second module is a much wider 512-kilobit module, built from four 128-kilobit dies. The third module contains a 1-megabit chip (the one in the 4-chip display, not from the wafer). These megabit modules were used in the IBM 3090 mainframe's secondary storage.
The box contains a segment of a 14" IBM disk platter, used in disk storage systems from minicomputers to mainframes. IBM was a pioneer in hard disks, starting with the IBM RAMAC (1956), which weighed over a ton and held 5 million characters on a stack of 24" platters. IBM switched to 14" platters in 1961 and by 1980 the IBM 3380 disk system held up to 2.5 gigabytes in a large cabinet of 14" platters.19 The 14" platter was also popular in low-cost, removable disk cartridge (1965) used with many minicomputers. The 14" disk platter was finally replaced by an 11" platter with the introduction of the IBM 3390 disk drive in 1989. Nowadays, laptops typically use 2.5" platters; amazingly, disk capacity kept increasing as disk diameter steeply decreased.
At the time of the box's creation, the 3090 mainframe was IBM's new high-performance computer (below), so the box has several artifacts that show off the technology in this computer. Although the IBM 3090 (1985) had top-of-the-line performance at the time, by 1998 an Intel Pentium II Xeon microprocessor had comparable performance,20 illustrating the remarkable improvements of microprocessor technology.
In 1980, IBM introduced the thermal conduction module (TCM), an advanced way to package integrated circuits at high density, while removing the heat that they generate.21 A TCM starts with a multi-chip module with about 100 high-speed integrated circuits mounted on a ceramic substrate, as shown below. This substrate contains dozens of wiring layers to connect the integrated circuits.22 To remove the heat, the ceramic substrate is packaged in a TCM, which has a metal piston contacting each silicon die. These pistons are surrounded by helium (which conducts heat better than air), and the whole TCM package is water-cooled. Finally, nine TCMs are mounted on a printed circuit board.
This incredibly complex heat-removal system was required because the 3090 used emitter-coupled logic (ECL), the same type of circuitry used in the Cray-1 supercomputer. Although ECL is a very fast logic family, it is also power-hungry and generates much more heat than the MOS transistors used in microprocessors.
The photo above shows the ceramic substrate. Normally, the substrate has 100 silicon dies mounted on it, but this sample has just a single die. The box also includes a cross-section slice of the ceramic substrate (below). This shows the 38 layers of wiring inside the substrate, as well as the pins on the underside.
Each TCM had 1800 pins so it could be plugged into a printed circuit board and connected to the rest of the system. Each board held 9 TCMs and was powered with an incredible 1400 amps. The box includes a PCB sample, showing its multi-layer construction (below), and the dense grid of holes to receive the ceramic substrate.
Finally, here's a nice cutaway of a TCM from the detailed IBM 3090 brochure. At the bottom, it shows the silicon dies mounted on the ceramic substrate. The dies are contacted by the heat sink pistons in the middle. The connections on top are for the cooling water.
This technology exhibit box was created 35 years ago. Looking at it from the present provides a perspective on the history of both IBM and the computer industry. The box's date, 1986, marks the peak of IBM's success and influence,23 right before microcomputers decimated the mainframe market and IBM's dominance. What I find interesting is that the technology box focuses on mainframes and lacks any artifacts from the IBM PC (1981), which ended up having much more long-term impact..24 This neglect of microcomputers reflects IBM's corporate focus on the mainframe market rather than the PC market (which, ironically, IBM created).
In the bigger historical picture, the technology box covers a time of great upheaval as electromechanical accounting machines were replaced by three generations of computers in rapid succession: vacuums tubes, then transistors, and finally integrated circuits. In contrast to this period of rapid change, nothing has replaced integrated circuits over the past 50 years. Instead, integrated circuits have remained, but improved by many orders of magnitude, as described by Moore's Law. (Compared to the room-filling IBM 3090 mainframe, an iPhone has 1000 times the performance and 50 times the RAM.) Will integrated circuits continue their dominance for the next 50 years or will some new technology replace them? It remains to be seen.
The box was apparently created in Stuttgart, Germany. The components are protected by a piece of plexiglass, with labels in German for all the components, such as Mehrschicht-Keramiktrager for multi-layer ceramic substrate. The labels are listed here if you're interested.
The box originally included several German books on computer technology but since they are missing I had to do some research and come up with my own narrative.
For more information on the pluggable tube modules, see the schematics of IBM's pluggable units (which lack the box's MS-7A module). (I suspect the MS-7A was selected for the box because it is more compact than most of the pluggable modules, having one layer of circuitry below the tube, rather than two.) ↩
People sometimes think that an 8-tube module held a byte. This is wrong for two reasons. First, bytes didn't exist back then. IBM's early scientific computers used 36-bit words, while the business computers were based on characters of 6 bits plus parity. Second, 8 tubes didn't correspond to 8 bits because circuits often required multiple tubes. For instance, a tube module could implement three bits of register storage. ↩
SLT was controversial, since other companies used more-advanced integrated circuits rather than hybrid modules. In typical IBM fashion, the vice president in charge of SLT was demoted in 1964, only to be reinstated in 1966 when SLT proved successful. My view is that integrated circuit technology was too immature when the System/360 was released, so IBM's choice to use SLT made the System/360 possible. However, it only took a year before integrated circuits became practical, as shown by their use in competing mainframes. I think IBM stuck with SLT modules longer than necessary. Integrated circuits rapidly increased in complexity (Moore's Law), while SLT modules could only increase density through hacks such as putting resistors on the underside (SLD) and using two layers of ceramic (ASLT). ↩
Curiously, this card is labeled in the box as an MST card, but checking the part numbers shows it has SLT modules. Specifically, it contains the following types of SLT modules (click for details): 361453 AND-OR-Invert, 361454 inverters, 361456 AND-OR-extender, and 361479 inverters. The SLT modules are also documented in IBM's manual.
The schematic above shows one of the SLT modules. (IBM had their own symbol for transistors; T1 is an NPN transistor.) This gate is built from diode-transistor-logic, so it's more primitive than the TTL logic that became popular in the late 1960s. The "Extend" pins are used to connect modules together to build larger gates, so the modules provide a lot of flexibility. This module inconveniently requires three voltages. This SLT module contained one transistor die, three dual-diode dies, and three thick-film resistors. During manufacturing, the resistors were sand-blasted to obtain accurate resistances, an advantage over the inaccurate resistances on integrated circuit dies. ↩
The System/360 line was designed as a single 32-bit architecture for all the models. The Model 20, however, is a stripped-down, 16-bit version of System/360, incompatible with the other machines. (Some people don't consider the Model 20 a "real" System/360 for this reason.) But due to its low price, the Model 20 was the most popular System/360 with more than 7,400 in operation by the end of 1970. ↩
This core memory plane from a System/360 Model 20 is a 128×68 grid. Note that this isn't a power of 2: the plane provided 8192 bits of main memory storage as well as 512 bits for registers. Using the same core plane for memory and registers hurt performance but saved money. The computer used five of these planes to make a 4-kilobyte memory module, or 10 planes for an 8-kilobyte module. For details, see the Model 20 Field Engineering manual. ↩
For an extensive list of references on DRAM chips, see the thesis Impact of processing technology on DRAM sense amplifier design (1990). For a history of memory development at IBM through 1980, from ferrite core to DRAM, see Solid state memory development in IBM. ↩
The System/370 Model 145 was the first computer with semiconductor main memory. Each thumbnail-sized MST module held four 128-bit chips; 24 modules fit onto a 12-kilobit storage card. A shoebox-sized Basic Storage Module held 36 cards, providing 48 kilobytes of storage with parity. By modern standards this storage is incredibly bulky, but it provided twice the density of the magnetic core memory used by contemporary systems. The computer's storage consisted of up to 16 of these boxes in a large cabinet (or two), providing 112 kilobytes to 512 kilobytes of RAM.
IBM had used monolithic memory for special purposes earlier, holding the "storage protect" data in the IBM 360/91 (1966) and providing a memory cache in the System/360 Model 85. ↩
I wasn't able to find exact details on the 64-kilobit, 256-kilobit, and 1-megabit chips from the display, but I took die photos.
The 64-kilobit chip is shown above. The solder balls are the most visible part of the chip. The article A 64K FET Dynamic Random Access Memory: Design Considerations and Description (1980) describes IBM's experimental 64-kilobit DRAM chip, but the chip they describe doesn't entirely match the chip in the box. There were probably some significant design changes between the prototype chip and the production chip.
The 256-kilobit die is shown above. The diagonal lines on the die are similar, but not identical, to the die in A 256K NMOS DRAM (1984). That chip was designed at IBM Laboratories in Böblingen, Germany, and could provide 1, 2, or 4 bits in parallel.
The 1-megabit die is shown above. IBM was the first company to begin volume production of 1-megabit memory chips and the first company to use them in mainframe computers. This chip was used in the IBM 3090 mainframe, but was later replaced by the faster and smaller "second-generation" 1-megabit chip on the 5" wafer. One interesting feature of this die is the "eagle" logo, shown below.
The box includes a 1-megabit MST module (below) that uses this chip. Because the chip's solder balls are along its center, the module omits the center three pins to make room for the connections to the chip.
This memory card and its 2-kilobit chips are described in detail in A High Performance Low Power 2048-Bit Memory Chip in MOSFET Technology and Its Application (1976). These modules were used in the main memory of the IBM System 370 models 115 (1973) and 125 (1972) as well as upgraded memory for the models 158 (1972) and 168 (1972). The IBM System/360 Model 138 (1976) and Model 148 (1976) also used 2K MOSFET chips, presumably the same ones. The 2-kilobit chip was developed at IBM Laboratories in Böblingen, Germany; this may have motivated its inclusion in this German display box.
The closeup of the 2-kilobit die shows some of the decoder circuitry (left) and the storage cells (right). Two solder balls are in the lower left; the rest of the die is covered with a protective yellow film, probably polyimide. Each storage cell consists of six transistors. The chip is built with metal-gate NMOS transistors. ↩
The 288-kilobit chip is described in detail in A 288Kb Dynamic RAM.
The closeup die photo above shows some of the memory cells (at the top and bottom), wired into bit lines. One unusual feature of this chip is that has redundancy to work around faults. In particular, four redundant word lines can be substituted for faulty ones, by blowing configuration fuses. I think the large boxes with circles in the middle are four of the fuses.
The photo above shows the chip's part number; BTV refers to IBM's Burlington / Essex Junction, VT semiconductor plant where the chip was designed. This plant was acquired by GlobalFoundaries in 2015. This photo also shows the complex geometrical wiring, unlike the regular matrix in most memory chips. ↩
Note that there are two 1-megabit chips in the box. The chip on the 4-chip display is an older chip than the one on the 5" wafer. The 1-megabit memory chip on the wafer is described in An Experimental 80-ns l-Mbit DRAM with Fast Page Operation (1985). It uses a single 5-volt power supply. The chip is structured as four 256-kbit quadrants, each subdivided into four 64-kbit subarrays. It has two redundant bit lines per quadrant for higher yield. The horizontal solder balls through the middle of the chip are the common connections for each quadrant, while the vertical connections along the left and right edges provide the signals specific to each quadrant. This quadrant structure allows the chip to be accessed as 256K×4 or 1M×1. ↩
IBM's overview of the 3090 family provides details on the hardware, including the memory and TCM modules. Page 10 discusses IBM's memory technology as of 1987 and has a picture of their "second generation" 1-megabit chip, which matches the die on the 5" wafer. ↩
The 1-megabit memory chips were used in the IBM 3090 mainframe, but I think the faulty ones were used in IBM PS/2 personal computer. You can see the unusual metal MST packages on many PS/2 cards. Specifically, if one of the four quadrants in the memory chip had a fault, the memory chip was used as a 3/4-megabyte chip. These had four part numbers, depending on the faulty quadrant: 90X0710ESD through 90X0713ESD (ESD probably stands for Electrostatic Sensitive Device). The PS/2 2-megabyte memory card (90X7391) had 24 chips providing 2 megabytes with parity. The board used chips with alternating bad banks so the memory regions fit together. ↩
Since several of the artifacts in the box came from the IBM 3090 mainframe, and the 3380 disk system was used with the 3090 mainframe, my suspicion is that the platter is from the 3380 disk system, shown below.
It's difficult to precisely compare different computers, especially since the 3090 supported multiple processors and vector units. I looked at benchmarks from 2001 comparing various computers on a linear algebra benchmark. The IBM 3090 performed at 97 to 540 megaflops/second for configurations of 1 to 6 processors respectively. An Intel Pentium II Xeon performed at 295 megaflops/second, a bit faster than the 3-processor IBM 3090. To compare clock speeds, the IBM 3090 ran at 69 MHz, while the Pentium ran at 450 MHz. An IBM 3090 cost $4 million while a Pentium II system was $7,0000 to $20,000. The IBM 3090 came with 64 to 128 megabytes of RAM while people complained about the Pentium II's initial 512-megabyte limit. The point of this is that while the IBM 3090 was a powerful mainframe in 1985, microprocessors caught up in about 13 years, thanks to Moore's Law. ↩
The table below compares characteristics of the Thermal Conduction Modules used in the IBM 3081 (1980), IBM 3090 (1985), and IBM S/390 (1990) computers. The board-level technology progressed similarly. For instance, a 3081 board took up to 500 amps, while a 3090 board took 1400 amps, and an S/390 board took 3400 amps.
The IBM 4300-series processors (1979) used a ceramic multi-chip module that held 36 chips, but it used an aluminum heat sink and air cooling instead of the more complex water-cooled TCM. The IBM 4381's smaller multi-chip module is often erroneously called a TCM by online articles, but it's a multilayer ceramic multichip module (MLC MCM). For more information about IBM's chip packaging, see this detailed web page. ↩
Desktop computer sales first exceeded mainframe computer sales in 1984. Counting the number of employees, IBM peaked in 1985 and declined until 1994 (source). 1985 was also a peak year for IBM's revenue and profits, according to The Decline and Rise of IBM. By 1991, IBM's problems were discussed by the New York Times. After heavy losses, IBM regained profitability and growth in the 1990s, but never regained its dominance of the computer industry. ↩
Perhaps one reason that the technology box ignores IBM's personal computers is that these computers didn't contain IBM-specific hardware that they could show off: Intel built the 80x86 processor, while companies such as Texas Instruments built the memory and support integrated circuits. The lack of IBM-specific technology in these personal computers is one factor that led to IBM losing control of the PC-compatible market. ↩
The quartz oscillator is an important electronic circuit, providing highly-accurate timing signals at a low cost. A quartz crystal has the special property of piezoelectricity, changing its electrical properties as it vibrates. Since a crystal can be cut to vibrate at a very precise frequency, quartz oscillators are useful for many applications. Quartz oscillators were introduced in the 1920s and provided accurate frequencies for radio stations. Wristwatches were revolutionized in the 1970s by the use of highly-accurate quartz oscillators. Computers use quartz oscillators to generate their clock signals, from ENIAC in the 1940s to modern computers.1
A quartz crystal requires additional circuitry to make it oscillate, and this analog circuitry can be tricky to design. In the 1970s, crystal oscillator modules became popular, combining the quartz crystal, an integrated circuit, and discrete components into a compact, easy-to-use module. Curious about the contents of these modules, I opened one up and reverse-engineered the chip inside. In this blog post, I discuss how the module works and examine the tiny CMOS integrated circuit that runs the oscillator. There's more happening in the module than I expected, so I hope you find it interesting.
I examined the oscillator module from an IBM PC card.2 The module is packaged in a rectangular 4-pin metal can that protects the circuitry from electrical noise. (It is the "Rasco Plus" rectangular can on the right, not the square IBM integrated circuit.) This module produced a 4.7174 MHz clock signal, as indicated by the text on the package.
I cut open the can to reveal the hybrid circuitry inside. I was expecting a gem-like quartz crystal inside, but found that oscillators use a very thin disk of quartz. (I damaged the crystal while opening the package, so the upper part is missing..) The quartz crystal is visible on the left, with metal electrodes attached to either side of the crystal. The electrodes are attached to small pegs, raising the crystal above the surface so it can oscillate freely.
On the right side of the module is a tiny CMOS integrated circuit die. It is mounted on the ceramic substrate and connected to the circuitry by tiny golden bond wires. A surface-mount capacitor (3 nF) and a film resistor (10Ω) on the substrate filter out noise from the power pin.
The photo below shows the tiny integrated circuit die under a microscope, with the pads and main functional blocks labeled. The brownish-green regions are the silicon that forms the integrated circuit. A metal layer (yellowish white) wires up the components of the IC. Below the metal, reddish polysilicon implements transistors, but it is mostly obscured by the metal layer. Around the outside of the chip, bond wires are connected to pads, wiring the chip to the rest of the oscillator module. Two pads (select and disable) are left unconnected. The chip was manufactured by Motorola, with a 1986 date. I couldn't find any information on the part number SC380003.
The IC has two functions. First, its analog circuitry drives the quartz crystal to produce oscillations. Second, the IC's digital circuitry divides the frequency by 1, 2, 4, or 8, and produces a high-current clock output signal. (The division factor is selected by the two select pins on the IC.)
The oscillator is implemented with a circuit (below) called a Colpitts oscillator, which is more complex than the usual quartz oscillator circuit.43 The basic idea is that the crystal and the two capacitors oscillate at the desired frequency. The oscillations would rapidly die out, however, except for the feedback boost from the drive transistor.
In more detail, as the voltage across the crystal increases, the transistor turns on, feeding current into the capacitors and boosting the voltage across the capacitors (and thus the crystal). But as the voltage across the crystal decreases, the transistor turns off and the current sink (circle with arrow) pulls current out of the capacitors, reducing the voltage across the crystal. Thus, the feedback from the drive transistor strengthens the crystal's oscillations to keep them going.
The bias voltage and current circuits are an important part of this circuit. The bias voltage sets the drive transistor's gate midway between "on" and "off", so the voltage oscillations on the crystal will turn it on and off. The bias current is set midway between the drive transistor's on and off currents so the current flowing in and out of the capacitors balances out.5 (I'm saying "on" and "off" for simplicity; the signal will be a sine wave.)
A large part of the integrated circuit is occupied by five capacitors. One is the upper capacitor in the schematic, three are paralleled to form the lower capacitor in the schematic, and one stabilizes the voltage bias circuit. The die photo below shows one of the capacitors after dissolving the metal layer on top. The red and green region is polysilicon, which forms the upper plate of the capacitor, along with the metal layer. Underneath the polysilicon, the pinkish region is probably silicon nitride, forming the insulating dielectric layer. The doped silicon (not visible underneath) forms the bottom plate of the capacitor.
Curiously, the capacitors aren't connected together on the chip, but are connected to three pads that are wired together by bond wires. Perhaps this provides flexibility; the capacitance in the circuit can be modified by omitting the wire to a capacitor.
The right side of the chip contains digital circuitry to divide the crystal's output frequency by 1, 2, 4, or 8. This lets the same crystal provide four different frequencies. The divider is implemented by three flip-flops in series. Each one divides its input pulses by 2. A 4-to-1 multiplexer selects between the original clock pulses, or the output from one of the flip-flops. The choice is made through the wiring to the two select pads on the right side of the die, fixing the ratio at manufacturing time. Four NAND gates (along with inverters) are used to decode these pins and generate four control signals to the multiplexer and flip-flops.
The chip is built with CMOS logic (complementary MOS), which uses two types of transistors, NMOS and PMOS, working together. The diagram below shows how an NMOS transistor is constructed. The transistor can be considered a switch between the source and drain, controlled by the gate. The source and drain (green) consist of regions of silicon doped with impurities to change its semiconductor properties and called P+ silicon. The gate consists of a special type of silicon called polysilicon, separated from the underlying silicon by a very thin insulating oxide layer. The NMOS transistor turns on when the gate is pulled high.
A PMOS transistor has the opposite construction from NMOS: the source and drain consist of P+ silicon embedded in N silicon. The operation of a PMOS transistor is also opposite from the NMOS transistor: it turns on when the gate is pulled low. Typically PMOS transistors pull the drain (output) high, while NMOS transistors pull the drain low. In CMOS, the transistors act in a complementary fashion, pulling the output high or low as needed.
The diagram below shows how a NAND gate is implemented in CMOS. If an input is 0, the corresponding PMOS transistor (top) will turn on and pull the output high. But if both inputs are 1, the NMOS transistors (bottom) will turn on and pull the output low. Thus, the circuit implements the NAND function.
The diagram below shows how a NAND gate appears on the die. The transistors have complex, meandering shapes, unlike the rectangular layouts that appear in textbooks. The left side holds the PMOS transistors, while the right side holds the NMOS transistors. The polysilicon that forms the gates is the slightly reddish wiring on top of the silicon. Most of the underlying silicon is doped, making it conductive and slightly darker than the non-conductive undoped silicon along the left and right edges and in the center. For this photo, the metal layer was removed wi