Design of a 16-bit Non-pipelined RISC CPU in a Two Phase Drive Adiabatic Dynamic CMOS Logic

Yasuhiro Takahashi, Member, IACSIT, Toshikazu Sekine, and Michio Yokoyama

Abstract—We propose a design of a 16-bit RISC CPU core using an adiabatic logic which is called a two phase drive adiabatic dynamic CMOS logic (2PADCL), in this paper. The proposed adiabatic RISC CPU is non-pipelined with a latency of three cycles, and also consists of six blocks; an arithmetic and logic unit (ALU), a program counter, a register file, an instruction decoder unit, a multiplexer and a clock control unit. Through the SPICE simulation, the 2PADCL CPU was evaluated for 0.35μm standard CMOS library and was compared with the CMOS CPU. The simulation results show that the power consumption of the adiabatic CPU is about 1/4 compared to that of the CMOS CPU.

Index Terms—Adiabatic logic, RISC, CPU, non-pipelined.

I. INTRODUCTION

As operating frequencies and circuit densities have increased, energy dissipation and power flux have become problematic in a wide variety of digital devices, ranging from small portable systems (e.g. Laptops and Personal Digital Assistants), where battery size, weight and operational life are critical, to large computing machines. Power consumption and dissipation within digital electronic devices is largely attributable to switching activities occurring within components of such devices. In recent years, adiabatic switching has been proposed as a method of reducing switching activities [11]–[9]. In [8] and [9], we proposed a new topology for the adiabatic dynamic circuit which called a two phase drive adiabatic dynamic CMOS logic (2PADCL). The 2PADCL achieves ultra low energy dissipation by restricting current to flow across devices with low voltage drop and by recycling the energy stored in internal capacitors.

In this paper, we describe a design of a reduced instruction set computer (RISC) CPU using 2PADCL circuit technology. The basis of the 2PADCL is presented in Section II. In Section III, we design a 16-bit 2PADCL RISC CPU. The proposed CPU is non-pipelined with a latency of three cycles. The CPU also consists of six blocks; an arithmetic and logic unit (ALU), a program counter (PC), a register file (REG), an instruction decoder unit (IDU), a multiplexer (MUX), and a clock control unit (CCU). Section IV shows that the performance of the proposed adiabatic CPU is compared with that of the static CMOS CPU. The conclusions are summarized in Section V.

II. ADIABATIC LOGIC

A. Conventional vs. Adiabatic Switching

The conventional switching can be understood by using a simple CMOS inverter. The CMOS inverter can be considered to consist of a pull-up and pull-down networks connected to a load (or internal) capacitance $C$. The pull-up and pull-down networks are actually MOS transistors in series with the same load $C$. Both transistors can be modeled by an ideal switch in series with a resistor which is equal to the corresponding channel resistance of the transistor in the saturation mode, as shown in Fig. 1(a). When a conventional CMOS inverter is set into a logical “1” state, a charge $Q = CV_{DD}$ is delivered to the load and the energy which the supply applies is $E_{\text{applied}} = QV_{DD} = CV_{DD}^2$. The energy stored into the load $C$ is a half of the supplied energy:

$$E_{\text{stored}} = \frac{1}{2}CV_{DD}^2.$$  (1)

The same amount of energy is dissipated during the discharge process in the NMOS pull-down network because no energy can enter the ground rail $Q \times V_{\text{gnd}} = Q \times 0 = 0$.

From the energy conservation law, a conventional CMOS logic emits heat and, in this way, it wastes energy in every charge-discharge cycle:

$$E_{\text{total}} = E_{\text{charge}} + E_{\text{discharge}}$$

$$= \frac{1}{2}CV_{DD}^2 + \frac{1}{2}CV_{DD}^2 = CV_{DD}^2.$$  (2)

If the logic is driven by a certain frequency $f = 1/T$, where $T$ is the period of the signal, then the power of the CMOS gate is determined as:

$$P_{\text{total}} = \frac{E_{\text{total}}}{T} = CV_{DD}^2 f.$$  (3)

The main idea in an adiabatic switching shown in Fig. 1(b) is that transitions are considered to be sufficiently slow so that heat is not emitted significantly. This is made possible by
replacing the DC power supply by a resonance LC driver, an oscillator, a clock generator, etc. If a constant current source delivers the $Q = CV_{DD}$ charge during the time period $\Delta T$, the energy dissipation in the channel resistance $R$ is given by

$$E_{diss} = \xi \frac{Q^2}{\Delta T} = \xi \left( \frac{CV_{DD}}{\Delta T} \right)^2 R \Delta T,$$

where $\xi$ is a shape factor which depends on the shape of the clock edges. It takes on the minimum value $\xi = 1$ if the charge of the load capacitor is DC modulated. For a sinusoidal current, $\xi = \pi^2 / 8 \approx 1.23$. The above equation indicates that when the charging period $\Delta T$ is indefinitely long, in theory, the energy dissipation is reduced to zero. This is called an adiabatic switching [1].

### B. Proposed 2PADCL

The 2PADCL inverter is shown in the top of Fig. 2(a), where the inverter is operated with complementary phases of power supply signals. The supply waveform consists of two modes, ‘evaluation’ and ‘hold,’ as shown in the bottom of Fig. 2(a). Let us consider the adiabatic mode. When $V_p$ and $V_p'$ are in evaluate mode, there is conducting path(s) in either PMOS devices or NMOS devices. Output node may evaluate from low to high or from high to low or remain unchanged, which resembles the CMOS circuit. Thus, there is no need to restore the node voltage to 0 (or $V_{DD}$) every cycle. When $V_p$ and $V_p'$ are in hold mode, Output node holds its value in spite of the fact that $V_p$ and $V_p'$ are changing their values. We can find that such is the case by observing the function of diodes and the fact that the inputs of a gate have a different phase with the output. Circuits node are not necessarily charging and discharging every clock cycle, reducing the node switching activity substantially as shown in Fig. 2(b).

![Fig. 1 RC tree model.](image1)

![Fig. 1 RC tree model.](image2)

![Fig. 1 RC tree model.](image3)
where has only shown in the simulation. Just as the inverter is possible to estimate the energy consumption in adiabatic circuit. The energy dissipation in transistor, which can be used to calculate the energy dissipation of 2PADCL after consideration of clock timing is as follows:

\[ E_{2\text{PADCL}} = \frac{0.11 \times \xi^2}{0.8} = \frac{0.11 \times \pi^2}{8} = 0.17 \text{ pJ/cycle}. \]  

III. DESIGN OF 16-BIT RISC CPU

A. Architecture

The architecture of the proposed CPU is a three-cycle non-pipelined implementation. It is characterized by a RISC typical, uniform 16-bit instruction format [10]. It has a load/store architecture, i.e., communication with main memory is only accomplished by instructions LOAD and STORE. Operations will only be performed on registers, not on memory locations. The bus protocol is designed for static RAMs, as it is classical von-Neumann architecture with just one common memory bus for instructions and data.

To be fitted on a limited silicon area, the 2PADCL CPU supports only a subset of the instruction set, which includes only 26 instructions as shown in Table I. We reduce the instruction width to 16 bits and the data-path width to 8 bits. Both the op-code and function code is also reduced to 5 bits. We reduce the instruction width to 16 bits and the data-path width to 8 bits. This reduction allows for the memory address from which the next machine language instruction will be fetched. The address to the input to the PC is written into the PC on a leading edge of its write clock. The address at the input to the PC consists of six blocks; ALU, PC, REG, IDU, MUX and CCU. The data-path of the proposed CPU in this figure is explained as follows:

1) ALU

The arithmetic and logic unit (ALU) performs arithmetic and logic operation as well as rotations and shift by a variable distance. The proposed ALU contains three sub-modules, ARITHMETIC, LOGIC, and SHIFT as shown in Fig. 4. In module ARITHMETIC, additive ALU operations are computed including the arithmetic carry and overflow flags. Module LOGIC performs logical operations. In SHIFT, rotation and shift operations are executed and the shift carry flag is computed.

2) PC

The Program Counter (PC) is a 16-bit latch that holds the memory address from which the next machine language instruction will be fetched. The address to the input to the PC is written into the PC on a leading edge of its write clock. The output of the PC can be used as the read/write address for accessing main memory. The proposed PC is the largest sub-block and second to the control unit in complexity. It has
an 8-bit register in a master-slave configuration and performs only two functions: incrementing and loading. For most instructions, the PC is simply incremented in preparation for the following instruction or following instruction nibbles.

3) **REG**

The register file consists of 8 general-purpose registers of 16-bit. It is fully visible for the programmer. Register addresses are 3 bits (1 bit for future file extensions). The register file has a read/write port and an independent read port. Two accesses are possible per port per clock cycle.

4) **IDU**

Our instruction set is simple yet comprehensive. Since our data bus is only 5 bits wide, we decided to keep the number of instructions supported within 32 for easier implementation. The detailed instruction is summarized in Table I.

5) **MUX**

This proposed CPU includes two multiplexer, MUX1 and MUX2. A 4-to-1 multiplexer (MUX1) selects the instruction for the IDU in the next clock. It depends on whether there is a cache hit or a memory access. If the REG does not report a hit, but the PC does, the instruction is taken from PC. If the REG reports a hit, the delay instruction from REG is transferred to the IDU. In all remaining cases, the last instruction is taken again which comes from the IDU.

A 2-to-1 multiplexer (MUX2) selects the address of the instruction to be fetched in the next clock. The selection depends on whether a new step begins, whether a PC value from is to be taken, whether a branch has to be corrected, or whether a branch was detected.

6) **Clock Control Unit**

Efficient phase scheduling is required to optimize the throughput and the energy consumption of the adiabatic CPU system. In this paper, we propose a clock control unit (CCU) which is tasked with efficient phase scheduling. The proposed CCU is operated as following three steps, as shown in Fig. 5.

- Clock (a): The instruction decode Stage ID decodes instructions and the execute stage EX executes ALU operation, if the IDU is active with the rising clock edge.
- Clock (b): The write-back stage WB results into the register file.
- Clock (c): The instruction fetch state IF are fetched from memory address given by PC if the REG enables. The PC points to the next instruction, and the loaded instruction is transferred to the next stage ID.

---

### TABLE I: LIST OF THE INSTRUCTION SET IN 2PADCL CPU

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Binary Code</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>MOV</td>
<td>000000000000</td>
<td>MOV d, s → s</td>
</tr>
<tr>
<td>AND</td>
<td>000010000000</td>
<td>AND d, s → d &amp; s</td>
</tr>
<tr>
<td>OR</td>
<td>000100000000</td>
<td>OR d, s → d</td>
</tr>
<tr>
<td>XOR</td>
<td>000110000000</td>
<td>XOR d, s → d ⊕ s</td>
</tr>
<tr>
<td>ADD</td>
<td>001000000000</td>
<td>ADD d, s → d + s</td>
</tr>
<tr>
<td>SUB</td>
<td>001010000000</td>
<td>SUB d, s → d - s</td>
</tr>
<tr>
<td>SL</td>
<td>010000000000</td>
<td>SL d, s → d &lt;&lt; 1</td>
</tr>
<tr>
<td>RL</td>
<td>010100000000</td>
<td>RL d, s → d [14:0], s [15]</td>
</tr>
<tr>
<td>SR</td>
<td>010010000000</td>
<td>SR d, s → d &gt;&gt; 1</td>
</tr>
<tr>
<td>RR</td>
<td>010110000000</td>
<td>RR d, s → s [0], s [15:1]</td>
</tr>
<tr>
<td>SWP</td>
<td>011000000000</td>
<td>SWP d, s → s [7:0], s [15:8]</td>
</tr>
<tr>
<td>LHI</td>
<td>111010000000</td>
<td>LHI d, N → d [15:8], N [7:0]</td>
</tr>
<tr>
<td>LLI</td>
<td>111100000000</td>
<td>LLI d, N → d [7:0], N [7:0]</td>
</tr>
</tbody>
</table>
In order to evaluate the power consumption, we designed the proposed adiabatic CPU. At first, we used synthesis software to map the proposed CPU to a target library. The target library includes generic and/or technology mapping information. The tool for mapping the Verilog-HDL components to the cell library is the Synopsys design compiler. It mapped the Verilog-HDL components to a Rohm 0.35 m ASIC standard cell library. The extracted netlist was then converted into the 2PADCL netlist. The 2PADCL netlist includes the diode model for adiabatic operation. Finally, HSPICE simulations were carried out using the converted netlist.

Table II is the comparison of the power consumption and the top clock frequency for 2PADCL and CMOS designs. From this table we can see that the power consumption by the 2PADCL CPU is only about 25% of the conventional CMOS CPU. However, because the proposed CPU operates in the adiabatic mode, the 2PADCL CPU decreases the top clock frequency by 20% as compared to the conventional static CMOS CPU.

II. CONCLUSION

We have presented a design of 16-bit RISC CPU core using the 2PADCL. The architecture of the proposed CPU has been a three-cycle non-pipelined implementation. A conventional static CMOS CPU with the same structure has been also designed for an energy comparison. We have performed the power and functional simulation using the extracted net-lists from the layout. The power consumption of the proposed CPU has been improved by a factor of four compared to that of the conventional static CMOS CPU.

ACKNOWLEDGMENT

The custom circuits in this study have been designed with Synopsys CAD tools through the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo with the collaboration by Rohm Corporation.

The part of this work is supported by a grant from LSI IP design award committee in Japan.

REFERENCES

Yasuhiro Takahashi was born in Yamagata, Japan, in July, 1977. He received the B.E., M.E. and Ph.D. degrees in Electronic engineering from Yamagata University, Japan, in 2000, 2002 and 2005, respectively. From 2005 to 2007, he enlarged in the Department of Electrical and Electronic Engineering, Faculty of Engineering, Gifu University as a Research associate, where he is currently an Assistant professor at the same university. His research interests include VLSI architectures for communications systems and CAD techniques for the implementation of high-performance DSP functions, and low-power VLSI design with particular emphasis in digital logic.

Dr. Takahashi is a member of IEEE, IEEJ, IEICE, and IAENG.

Toshikazu Sekine received the B.E., M.E. and Ph.D. degrees in electronic engineering from Yamagata University in 1974, 1976 and 2002, respectively. Since 1976, he has been with the Faculty of Engineering, Gifu University and currently he is an Associate Professor. His current research interests include electro-magnetic compatibility (EMC) analysis, lossy transmission line modeling, and microwave system and high-speed PCB signal integrity analysis.

Dr. Sekine is a member of IEEE and IEICE.

Michio Yokoyama received the B.E. degree in electrical engineering from Yamagata University in 1989, and the M.E. degree in electrical and communication engineering and the Ph.D. degree in electronic engineering from Tohoku University, Japan, in 1991 and 1994, respectively. From 1994 to 2001, he joined Research Institute of Electrical Communication, Tohoku University, where he engaged in research on the design and development of RF-CMOS devices such as power amplifier module for digital cellular phone system. Since 2001, he has joined Yamagata University and engaged in development of RF CMOS circuits and RF system packages.

Dr. Yokoyama is a member of IEEJ, JIEP and JSAP.