info@itechprosolutions.in | +91 9790176891

VLSI 2017

Category Archives

A 0.45 V 147–375 nW ECG Compression Processor With Wavelet Shrinkage and Adaptive Temporal Decimation Architectures

Abstract:
This paper presents a real-time electrocardiogram (ECG) data compression processor with improved energy efficiency while maintaining high accuracy and real-time operation. Wavelet shrinkage is exploited to filter the noise and achieve sparse ECG signal representation. Adaptive temporal decimation is proposed to achieve configurable processing to adaptively reduce the data amount and computational activities for further power reduction. Modified Huffman and run-length wavelet source coding (MHRLC) is also designed to represent wavelet coefficients with optimized average code length and reduced memory requirement. Fabricated in 0.18-μm CMOS, the ECG processor is implemented with customized near-threshold digital logics for minimum energy operation. The prototype was fully validated with the MIT-BIH Arrhythmia database. With a power consumption of 147-375 nW at 0.45 V, the proposed ECG processor exhibits a wide compression ratio ranging from 2.89 to 26.91, corresponding to a percentage-RMS-distortion from 0% to 3.11%.


A 92-dB DR, 24.3-mW, 1.25-MHz BW Sigma–Delta Modulator Using Dynamically Biased Op Amp Sharing Sign In or Purchase

Abstract:
A 2-2 cascaded switched-capacitor ΣA modulator is presented for design of low-voltage, low-power, broadband analog-to-digital conversion. To reduce power dissipation in both analog and digital circuits and ensure low-voltage operation, a half-sample delayed-input feedforward architecture is employed in combination with 4-bit quantization, which results in reduced integrator output swings and relaxed timing constraint in the feedback path. The integrator power is further reduced by sharing an op amp in the two integrators in each stage and periodically changing the op amp bias condition between a high-current and a low-current mode using a fast low-power high-precision charge pump circuit. Implemented in a 0.18-μm CMOS technology, the experimental prototype achieves a 92-dB dynamic range, a 91-dB peak signal-to-noise ratio, and an 84-dB peak signal-to-noise plus distortion ratio, respectively for a signal bandwidth of 1.25 MHz. Operated at a 40-MHz sampling rate, the modulator dissipates 24.3 mW from a 1 V supply.


Energy-Efficient TCAM Search Engine Design Using Priority-Decision in Memory Technology

Abstract:
Ternary content-addressable memory (TCAM)-based search engines generally need a priority encoder (PE) to select the highest priority match entry for resolving the multiple match problem due to the don’t care (X) features of TCAM. In contemporary network security, TCAM-based search engines are widely used in regular expression matching across multiple packets to protect against attacks, such as by viruses and spam. However, the use of PE results in increased energy consumption for pattern updates and search operations. Instead of using PEs to determine the match, our solution is a three-phase search operation that utilizes the length information of the matched patterns to decide the longest pattern match data. This paper proposes a promising memory technology called priority-decision in memory (PDM), which eliminates the need for PEs and removes restrictions on ordering, implying that patterns can be stored in an arbitrary order without sorting their lengths. Moreover, we present a sequential input-state (SIS) scheme to disable the mass of redundant search operations in state segments on the basis of an analysis distribution of hex signatures in a virus database. Experimental results demonstrate that the PDM-based technology can improve update energy consumption of nonvolatile TCAM (nvTCAM) search engines by 36%-67%, because most of the energy in these search engines is used to reorder. By adopting the SIS-based method to avoid unnecessary search operations in a TCAM array, the search energy reduction is around 64% of nvTCAM search engines.


Sense Amplifier Half-Buffer (SAHB) A Low-Power High-Performance Asynchronous Logic QDI Cell Template

Abstract:
We propose a novel asynchronous logic (async) quasi-delay-insensitive (QDI) sense-amplifier half-buffer (SAHB) cell design approach, with emphases on high operational robustness, high speed, and low power dissipation. There are five key features of our proposed SAHB. First, the SAHB cell embodies the async QDI 4-phase (4φ) signaling protocol to accommodate process-voltage-temperature variations. Second, the sense amplifier (SA) block in SAHB cells embodies a cross-coupled latch with a positive feedback mechanism to speed up the output evaluation. Third, the evaluation block in the SAHB comprises both nMOS pull-up and pull-down networks with minimum transistor sizing to reduce the parasitic capacitance. Fourth, both the evaluation block and SA block are tightly coupled to reduce redundant internal switching nodes. Fifth, the SAHB cell is designed in CMOS static logic and hence appropriate for full-range dynamic voltage scaling operation for VDD ranging from nominal voltage (1 V) to subthreshold voltage (~0.3 V). When six library cells embodying our proposed SAHB are compared with those embodying the conventional async QDI precharged half-buffer (PCHB) approach, the proposed SAHB cells collectively feature simultaneous -.64% lower power, -.21% faster, and ~6% smaller IC area; the PCHB cell is inappropriate for subthreshold operation. A prototype 64-bit Kogge-Stone pipeline adder based on the SAHB approach (at 65 nm CMOS) is designed. For a 1-GHz throughput and at nominal VDD, the design based on the SAHB approach simultaneously features -.56% lower energy and -.24% lower transistor count advantages than its PCHB counterpart. When benchmarked against the ubiquitous synchronous logic counterpart, our SAHB dissipates -.39% lower energy at the 1-GHz throughput.


A 100-mA, 99.11% Current Efficiency, 2-mVpp Ripple Digitally Controlled LDO With Active Ripple Suppression

Abstract:
Digital low-dropout (DLDO) regulators are gaining attention due to their design scalability for distributed multiple voltage domain applications required in state-of-the-art system-on-chips. Due to the discrete nature of the output current and the discrete-time control loop, the steady-state response of the DLDO has inherent output voltage ripple. A hybrid DLDO (HD-LDO) with fast response and stable operation across a wide load range while reducing the output voltage ripple is proposed. In the HD-LDO, a DLDO and a low current analog ripple cancellation amplifier (RCA) work in parallel. The output dc of the RCA is sensed by a 2-bit analog-to-digital converter, and the digitized linear stage current is fed into the DLDO as an error signal. During load transients, a gear-shift controller enables fast transient response using dynamic load estimation. The DLDO suppresses the output dc of the RCA within its current resolution. With this arrangement, a majority of the dc load current is provided by the DLDO and the RCA supplies ripple cancellation current. The HD-LDO is designed and fabricated in a 180-nm CMOS technology, and occupies 0.697 mm2 of the die area. The HD-LDO operates with an input voltage range of 1.43-2.0 V and an output voltage range of 1.0-1.57 V. At 100-mA load current, the HD-LDO achieves a current peak efficiency of 99.11% and a settling time of 15 clock periods with a 0.5-MHz clock for a current switching between 10 and 90 mA. The RCA suppresses fundamental, second, and third harmonics of the switching frequency by 13.7, 13.3, and 14.1 dB, respectively.


Temporarily Fine-Grained Sleep Technique for Near- and Subthreshold Parallel Architectures

Abstract:
This paper presents a design approach for improving energy-efficiency and throughput of parallel architectures in near- and subthreshold voltage circuits. The focus is to suppress leakage energy dissipation of the idle portions of circuits during active modes, which can allow us to wholly transform the throughput improvement from parallel architectures into energy savings via deep voltage scaling. We begin by investigating the efficacy of parallel and pipeline architectures in the near- and subthreshold circuits. The investigation reveals that active energy dissipation largely undermines the ability of deep voltage scaling to transform excessive throughput into energy savings. Techniques, such as power-gating switches (PGSs), can mitigate active-leakage power dissipation; however, the overhead for entering and exiting sleep modes can offset the energy savings provided by sleep mode, particularly if sleep time is fine grained for suppressing active leakage. Therefore, in this paper, we propose a PGS design technique, inspired by the so-called zigzag supercutoff CMOS, in order to optimize the overheads of mode transitions of PGS in near- and subthreshold circuits. The proposed technique enables to have circuits in sleep mode for as short as a single clock cycle with a negligible amount of energy and delay overheads. We apply our proposed design to parallel multiplier-based test circuits operating at near- and subthreshold voltages. Simulations show a significant improvement in energy-efficiency over baselines at the same throughput.


A Fault Tolerance Technique for Combinational Circuits Based on Selective-Transistor Redundancy

Abstract:
With fabrication technology reaching nanolevels, systems are becoming more prone to manufacturing defects with higher susceptibility to soft errors. This paper is focused on designing combinational circuits for soft error tolerance with minimal area overhead. The idea is based on analyzing random pattern testability of faults in a circuit and protecting sensitive transistors, whose soft error detection probability is relatively high, until desired circuit reliability is achieved or a given area overhead constraint is met. Transistors are protected based on duplicating and sizing a subset of transistors necessary for providing the protection. In addition to that, a novel gate-level reliability evaluation technique is proposed that provides similar results to reliability evaluation at the transistor level (using SPICE) with the orders of magnitude reduction in CPU time. LGSynth’91 benchmark circuits are used to evaluate the proposed algorithm. Simulation results show that the proposed algorithm achieves better reliability than other transistor sizing-based techniques and the triple modular redundancy technique with significantly lower area overhead for 130-nm process technology at a ground level.


A 65-nm CMOS Constant Current Source With Reduced PVT Variation

Abstract:
This paper presents a new nanometer-based low-power constant current reference that attains a small value in the total process-voltage-temperature variation. The circuit architecture is based on the embodiment of a process-tolerant bias current circuit and a scaled process-tracking bias voltage source for the dedicated temperature-compensated voltage-to-current conversion in a preregulator loop. Fabricated in a UMC 65-nm CMOS process, it consumes 7.18 μW with a 1.4 V supply. The measured results indicate that the current reference achieves an average temperature coefficient of 119 ppm/°C over 12 samples in a temperature range from -30 °C to 90 °C without any calibration. Besides, a low line sensitivity of 180 ppm/V is obtained. This paper offers a better sensitivity figure of merit with respect to the reported representative counterparts.


An All-MOSFET Sub-1-V Voltage Reference With a —51 –dB PSR up to 60 MHz

Abstract:
This paper presents a voltage reference (VR) with a power supply rejection (PSR) better than 50 dB for frequencies of up to 60 MHz, and uses MOSFETs in strong inversion. Another innovation is a compact MOSFET low-pass filter, which was developed along with a feedback technique for a wide-bandwidth PSR not achieved in previous works. The proposed all-MOSFET VR was fabricated using a standard 0.18 μm CMOS process. It achieves a minimum temperature coefficient of 6.5 ppm/°C for temperatures from -30 °C to 110 °C. The line regulation is 0.076%/V for a step from 0.8 to 2.2 V supply voltage with 360 nW power consumption at room temperature and an area of 0.0143 mm2.


Conditional-Boosting Flip-Flop for Near-Threshold Voltage Application

Abstract:
A conditional-boosting flip-flop is proposed for ultralow-voltage application where the supply voltage is scaled down to the near-threshold region. The proposed flip-flop adopts voltage boosting to provide low latency with reduced performance variability in the near-threshold voltage region. It also adopts conditional capture to minimize the switching power consumption by eliminating redundant boosting operations. Experimental results in a 65-nm CMOS process indicated that the proposed flip-flop provided up to 72% lower latency with 75% less performance variability due to process variation, and up to 67% improved energy-delay product at 25% switching activity compared with conventional precharged differential flip-flops.


A 0.1–2-GHz Quadrature Correction Loop for Digital Multiphase Clock Generation Circuits in 130-nm CMOS Sign In or Purchase

Abstract:
A 100-MHz-2-GHz closed-loop analog in-phase/ quadrature correction circuit for digital clocks is presented. The proposed circuit consists of a phase-locked looptype architecture for quadrature error correction. The circuit corrects the phase error to within a 1.5° up to 1 GHz and to within 3° at 2 GHz. It consumes 5.4 mA from a 1.2 V supply at 2 GHz. The circuit was designed in UMC 0.13-μm mixed-mode CMOS with an active area of 102 μm×95 μm. The impact of duty cycle distortion has been analyzed. High-frequency quadrature measurement related issues have been discussed. The proposed circuit was used in two different applications for which the functionality has been verified.


A High-Speed and Power-Efficient Voltage Level Shifter for Dual-Supply Applications

Abstract:
This brief presents a fast and power-efficient voltage levelshifting circuit capable of converting extremely low levels of input voltages into high output voltage levels. The efficiency of the proposed circuit is due to the fact that not only the strength of the pull-up device is significantly reduced when the pull-down device is pulling down the output node, but the strength of the pull-down device is also increased using a low-power auxiliary circuit. Postlayout simulation results of the proposed circuit in a 0.18-μm technology demonstrate a total energy per transition of 157 fJ, a static power dissipation of 0.3 nW, and a propagation delay of 30 ns for input frequency of 1 MHz, low supply voltage level of VDDL = 0.4 V, and high supply voltage level of VDDH = 1.8 V.


Probability-Driven Multibit Flip-Flop Integration With Clock Gating

Abstract:
Data-driven clock gated (DDCG) and multibit flip-flops (MBFFs) are two low-power design techniques that are usually treated separately. Combining these techniques into a single grouping algorithm and design flow enables further power savings. We study MBFF multiplicity and its synergy with FF data-to-clock toggling probabilities. A probabilistic model is implemented to maximize the expected energy savings by grouping FFs in increasing order of their data-to-clock toggling probabilities. We present a front-end design flow, guided by physical layout considerations for a 65-nm 32-bit MIPS and a 28-nm industrial network processor. It is shown to achieve the power savings of 23% and 17%, respectively, compared with designs with ordinary FFs. About half of the savings was due to integrating the DDCG into the MBFFs.


Area and Energy-Efficient Complementary Dual-Modular Redundancy Dynamic Memory for Space Applications

Abstract:
The limited size and power budgets of space-bound systems often contradict the requirements for reliable circuit operation within high-radiation environments. In this paper, we propose the smallest solution for soft-error tolerant embedded memory yet to be presented. The proposed complementary dual-modular redundancy (CDMR) memory is based on a four-transistor dynamic memory core that internally stores complementary data values to provide an inherent per-bit error detection capability. By adding simple, low-overhead parity, an error-correction capability is added to the memory architecture for robust soft-error protection. The proposed memory was implemented in a 65-nm CMOS technology, displaying as much as a 3.5×1 smaller silicon footprint than other radiation-hardened bitcells. In addition, the CDMR memory consumes between 48% and 87% less standby power than other considered solutions across the entire operating region.


Delay Analysis for Current Mode Threshold Logic Gate Designs

Abstract:
Current mode is a popular CMOS-based implementation of threshold logic functions, where the gate delay depends on the sensor size. This paper presents a new implementation of current mode threshold functions for improved gate delay and switching energy. An analytical method is also proposed in order to identify quickly the sensor size that minimizes the gate delay. Simulation results on different gates implemented using the optimum sensor size indicate that the proposed current mode implementation method outperforms consistently the existing implementations in delay as well as switching energy.


Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator

Abstract:
The need for ultra low-power, area efficient, and high speed analog-to-digital converters is pushing toward the use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis on the delay of the dynamic comparators will be presented and analytical expressions are derived. From the analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay and fully explore the tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic comparator is proposed, where the circuit of a conventional double-tail comparator is modified for low-power and fast operation even in small supply voltages. Without complicating the design and by adding few transistors, the positive feedback during the regeneration is strengthened, which results in remarkably reduced delay time. Post-layout simulation results in a 0.18- μm CMOS technology confirm the analysis results. It is shown that in the proposed dynamic comparator both the power consumption and delay time are significantly reduced. The maximum clock frequency of the proposed comparator can be increased to 2.5 and 1.1 GHz at supply voltages of 1.2 and 0.6 V, while consuming 1.4 mW and 153 μW, respectively. The standard deviation of the input-referred offset is 7.8 mV at 1.2 V supply.


10T SRAM Using Half- VDD Precharge and Row-Wise Dynamically Powered Read Port for Low Switching Power and Ultralow RBL Leakage

Abstract:
We present, in this paper, a new 10T static random access memory cell having single ended decoupled read-bitline (RBL) with a 4T read port for low power operation and leakage reduction. The RBL is precharged at half the cell’s supply voltage, and is allowed to charge and discharge according to the stored data bit. An inverter, driven by the complementary data node (QB), connects the RBL to the virtual power rails through a transmission gate during the read operation. RBL increases toward the VDD level for a read-1, and discharges toward the ground level for a read-0. Virtual power rails have the same value of the RBL precharging level during the write and the hold mode, and are connected to true supply levels only during the read operation. Dynamic control of virtual rails substantially reduces the RBL leakage. The proposed 10T cell in a commercial 65 nm technology is 2.47× the size of 6T with β = 2, provides 2.3× read static noise margin, and reduces the read power dissipation by 50% than that of 6T. The value of RBL leakage is reduced by more than 3 orders of magnitude and (ION/IOFF) is greatly improved compared with the 6T BL leakage. The overall leakage characteristics of 6T and 10T are similar, and competitive performance is achieved.


Low-Power Design for a Digit-Serial Polynomial Basis Finite Field Multiplier Using Factoring Technique

Abstract:
In CMOS-based application-specific integrated circuit (ASIC) designs, total power consumption is dominated by dynamic power, where dynamic power consists of two major components, namely, switching power and internal power. In this paper, we present a low-power design for a digit-serial finite field multiplier in GF(2m). In the proposed design, a factoring technique is used to minimize switching power. To the best of our knowledge, factoring method has not been reported in the literature being used in the design of a finite field multiplier at an architectural level. Logic gate substitution is also utilized to reduce internal power. Our proposed design along with several existing similar works have been realized for GF(2233) on ASIC platform, and a comparison is made between them. The synthesis results show that the proposed multiplier design consumes at least 27.8% lower total power than any previous work in comparison.


Temporarily Fine-Grained Sleep Technique for Near- and Subthreshold Parallel Architectures

Abstract:
This paper presents a design approach for improving energy-efficiency and throughput of parallel architectures in near- and subthreshold voltage circuits. The focus is to suppress leakage energy dissipation of the idle portions of circuits during active modes, which can allow us to wholly transform the throughput improvement from parallel architectures into energy savings via deep voltage scaling. We begin by investigating the efficacy of parallel and pipeline architectures in the near- and subthreshold circuits. The investigation reveals that active energy dissipation largely undermines the ability of deep voltage scaling to transform excessive throughput into energy savings. Techniques, such as power-gating switches (PGSs), can mitigate active-leakage power dissipation; however, the overhead for entering and exiting sleep modes can offset the energy savings provided by sleep mode, particularly if sleep time is fine grained for suppressing active leakage. Therefore, in this paper, we propose a PGS design technique, inspired by the so-called zigzag supercutoff CMOS, in order to optimize the overheads of mode transitions of PGS in near- and subthreshold circuits. The proposed technique enables to have circuits in sleep mode for as short as a single clock cycle with a negligible amount of energy and delay overheads. We apply our proposed design to parallel multiplier-based test circuits operating at near- and subthreshold voltages. Simulations show a significant improvement in energy-efficiency over baselines at the same throughput.


Multicast-Aware High-Performance Wireless Network-on-Chip Architectures

Abstract:
Today’s multiprocessor platforms employ the network-on-chip (NoC) architecture as the preferable communication backbone. Conventional NoCs are designed predominantly for unicast data exchanges. In such NoCs, the multicast traffic is generally handled by converting each multicast message to multiple unicast transmissions. Hence, applications dominated by multicast traffic experience high queuing latencies and significant performance penalties when running on systems designed with unicast-based NoC architectures. Various multicast mechanisms such as XY-tree multicast and path multicast have already been proposed to enhance the performance of the traditional wireline mesh NoC incorporating multicast traffic. However, even with such added features, the multihop nature of the wireline mesh NoC leads to high network latencies and thus limits the achievable system performance. In this paper, to sustain the high-bandwidth and high-throughput requirements of emerging applications, we propose the design of a wireless NoC (WiNoC) architecture incorporating necessary multicast support. By integrating congestion-aware multicast routing with network coding, the WiNoC is able to efficiently handle heavy multicast injections. For applications running with a broadcast-heavy Hammer cache coherence protocol, the proposed multicast-aware WiNoC achieves an average of 47% reduction in message latency compared with the XY-tree-based multicast-aware mesh NoC. This network level improvement translates into a 26% saving in full-system energy delay product.


Reordering Tests for Efficient Fail Data Collection and Tester Time Reduction

Abstract:
During fail data collection, a tester collects information that is useful for defect diagnosis. If fail data collection can be terminated early, the tester time as well as the volume of fail data will be reduced. Test reordering can enhance the ability to terminate the process early without affecting the quality of diagnosis. In this paper, test reordering targets logic defects based on information that is derived during defect diagnosis. The defect diagnosis procedure is enhanced to identify tests that are useful for defect diagnosis across a sample of faulty instances of a circuit. Tests that are determined to be useful for more faulty instances of a circuit are placed earlier in the test set based on the expectation that the same tests will be useful for other faulty instances of the circuit. The experimental results for logic defects in benchmark circuits support the effectiveness of this approach and indicate that test reordering helps to terminate fail data collection early without impacting the diagnosis quality.


COMEDI: Combinatorial Election of Diagnostic Vectors From Detection Test Sets for Logic Circuits

Abstract:
Although the modern automatic test pattern generation (ATPG) tools can efficiently produce near-optimal test sets with high fault-coverage for a circuit-under-test, a diagnostic test set (DTS), which is needed for fault localization, is much more challenging to construct. The DTS is used to analyze the responses of failing chips during manufacturing test for the purpose of identifying the root cause of observed errors. In this paper, a novel technique for selecting a powerful DTS for stuck-at faults from a pool of ATPG detection vectors is proposed. Unlike existing methods, this technique does not use any diagnostic test generation, circuit modification, or miter-based approach. It constructs a combinatorial cover of the pool to determine a test set with high diagnostic coverage (DC). Two variants of the covering algorithm are proposed based on this technique. The experimental results on several combinational and scan-based benchmark circuits demonstrate the effectiveness of our method in terms of the size of the DTS, DC, and CPU time.


Design of Power and Area Efficient Approximate Multipliers

Abstract:
Approximate computing can decrease the design complexity with an increase in performance and power efficiency for error resilient applications. This brief deals with a new design approach for approximation of multipliers. The partial products of the multiplier are altered to introduce varying probability terms. Logic complexity of approximation is varied for the accumulation of altered partial products based on their probability. The proposed approximation is utilized in two variants of 16-bit multipliers. Synthesis results reveal that two proposed multipliers achieve power savings of 72% and 38%, respectively, compared to an exact multiplier. They have better precision when compared to existing approximate multipliers. Mean relative error figures are as low as 7.6% and 0.02% for the proposed approximate multipliers, which are better than the previous works. Performance of the proposed multipliers is evaluated with an image processing application, where one of the proposed models achieves the highest peak signal to noise ratio.


Time-Encoded Values for Highly Efficient Stochastic Circuits

Abstract:
Stochastic computing (SC) is a promising technique for applications that require low area overhead and fault tolerance, but can tolerate relatively high latency. In the SC paradigm, logical computation is performed on randomized bit streams. In prior work, streams were generated with linear feedback shift registers; these contributed heavily to the hardware cost and consumed a significant amount of power. This paper introduces a new approach for encoding signal values: computation is performed on analog periodic pulse signals. Exploiting pulse width modulation, time-encoded signals corresponding to specific values are generated by adjusting the frequency and duty cycles of pulse width modulated (PWM) signals. With this approach, the latency, area, and energy consumption are all greatly reduced. Experimental results on image processing applications show up to 99% performance speedup, 98% saving in energy dissipation, and 40% area reduction compared to prior stochastic approaches. Circuits synthesized with the proposed approach can work as fast and energy-efficiently as a conventional binary design while retaining the fault-tolerance and low-cost advantages of conventional stochastic designs.


Soft Error Rate Reduction of Combinational Circuits Using Gate Sizing in the Presence of Process Variations

Abstract:
Soft errors in combinational logic circuits are emerging as a significant reliability concern for nanoscale VLSI designs. This paper presents a novel sensitivity-based gate sizing methodology to reduce the soft error rate (SER) of combinational circuits in the presence of process variations. The proposed method is based on modeling the statistics of SER of the circuit gates as a random variable to formulate a statistical optimization problem. A backward traversing algorithm with capability for incremental analysis is developed for computing the distribution of circuit gates of SER random variables. We present a gate resizing algorithm in which the gates with the most contribution to the circuit SER are selected in a candidate set using a statistical ordering approach. The proposed algorithm trades off SER reduction and area overheads. The experimental results show that using the proposed methodology, the circuit statistical SER can be reduced by up to 56.4% compared with the 14.8% SER reduction of a circuit obtained using the worst case methodology at the expense of 10% area overhead under 10% process variation ratio. The results also show that the proposed method achieves about 40% more SER reduction compared with that obtained using closed-form analysis for statistical soft error rate estimation (CASSER), the most recently published similar work, in the same experimental conditions. Comparing the runtime of the proposed optimization algorithm with the optimization based on CASSER, it is observed that the proposed method is two orders of magnitude faster than CASSER due to its incremental analysis property.


An FPGA-Based Hardware Accelerator for Traffic Sign Detection

Abstract:
Traffic sign detection plays an important role in a number of practical applications, such as intelligent driver assistance and roadway inventory management. In order to process the large amount of data from either real-time videos or large off-line databases, a high-throughput traffic sign detection system is required. In this paper, we propose an FPGA-based hardware accelerator for traffic sign detection based on cascade classifiers. To maximize the throughput and power efficiency, we propose several novel ideas, including: 1) rearranged numerical operations; 2) shared image storage; 3) adaptive workload distribution; and 4) fast image block integration. The proposed design is evaluated on a Xilinx ZC706 board. When processing high-definition (1080p) video, it achieves the throughput of 126 frames/s and the energy efficiency of 0.041 J/frame.


Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers

Abstract:
In this paper, we propose four 4:2 compressors, which have the flexibility of switching between the exact and approximate operating modes. In the approximate mode, these dual-quality compressors provide higher speeds and lower power consumptions at the cost of lower accuracy. Each of these compressors has its own level of accuracy in the approximate mode as well as different delays and power dissipations in the approximate and exact modes. Using these compressors in the structures of parallel multipliers provides configurable multipliers whose accuracies (as well as their powers and speeds) may change dynamically during the runtime. The efficiencies of these compressors in a 32-bit Dadda multiplier are evaluated in a 45-nm standard CMOS technology by comparing their parameters with those of the state-of-the-art approximate multipliers. The results of comparison indicate, on average, 46% and 68% lower delay and power consumption in the approximate mode. Also, the effectiveness of these compressors is assessed in some image processing applications.


Energy-Efficient Reduce-and-Rank Using Input-Adaptive Approximations

Abstract:
Approximate computing is an emerging design paradigm that exploits the intrinsic ability of applications to produce acceptable outputs even when their computations are executed approximately. In this paper, we explore approximate computing for a key computation pattern, reduce-andrank (RnR), which is prevalent in a wide range of workloads, including video processing, recognition, search, and data mining. An RnR kernel performs a reduction operation (e.g., distance computation, dot product, and L1-norm) between an input vector and each of a set of reference vectors, and ranks the reduction outputs to select the top reference vectors for the current input. We propose three complementary approximation strategies for the RnR computation pattern. The first is interleaved reductionand-ranking, wherein the vector reductions are decomposed into multiple partial reductions and interleaved with the rank computation. Leveraging this transformation, we propose the use of intermediate reduction results and ranks to identify future computations that are likely to have a low impact on the output, and can, hence, be approximated. The second strategy, inputsimilarity-based approximation, exploits the spatial or temporal correlation of inputs (e.g., pixels of an image or frames of a video) to identify computations that are amenable to approximation. The third strategy, reference vector reordering, rearranges the order in which the reference vectors are processed such that vectors that are relatively more critical in evaluating the correct output, are processed at the beginning of RnR operation. The number of these critical reference vectors is usually small, which renders a substantial portion of the total computation to be amenable to approximation. These strategies address a key challenge in approximate computing-identification of which computations to approximate-and may be used to drive any approximation mechanism, such as computation skipping or precision scaling to realize performance and energy improvements. A second key challenge in approximate computing is that the extent to which computations can be approximated varies significantly from application to application, and across inputs for even a single application. Hence, input-adaptive approximation, or the ability to automatically modulate the degree of approximation based on the nature of each individual input, is essential for obtaining optimal energy savings. In addition, to enable quality configurability in RnR kernels, we propose a kernel-level quality metric that correlates well to application-level quality, and identify key parameters that can be used to tune the proposed approximation strategies dynamically. We develop a runtime framework that modulates the identified parameters during the execution of RnR kernels to minimize their energy while meeting a given target quality. To evaluate the proposed concepts, we designed quality-configurable hardware implementations of six RnR-based applications from the recognition, mining, search, and video processing application domains in 45-nm technology. Our experiments demonstrate a 1.13×-3.18× reduction in energy consumption with virtually no loss in output quality (<;0.5%) at the application level. The energy benefits further improve up to 3.43× and 3.9× when the quality constraints are relaxed to 2.5% and 5%, respectively.


RoBA Multiplier: A Rounding-Based Approximate Multiplier for High-Speed yet Energy-Efficient Digital Signal Processing

Abstract:
In this paper, we propose an approximate multiplier that is high speed yet energy efficient. The approach is to round the operands to the nearest exponent of two. This way the computational intensive part of the multiplication is omitted improving speed and energy consumption at the price of a small error. The proposed approach is applicable to both signed and unsigned multiplications. We propose three hardware implementations of the approximate multiplier that includes one for the unsigned and two for the signed operations. The efficiency of the proposed multiplier is evaluated by comparing its performance with those of some approximate and accurate multipliers using different design parameters. In addition, the efficacy of the proposed approximate multiplier is studied in two image processing applications, i.e., image sharpening and smoothing.


A Dual-Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k Ultra-HD TV Encoding

Abstract:
Sample adaptive offset (SAO) is a newly introduced in-loop filtering component in H.265/High Efficiency Video Coding (HEVC). While SAO contributes to a notable coding efficiency improvement, the estimation of SAO parameters dominates the complexity of in-loop filtering in HEVC encoding. This paper presents an efficient VLSI design for SAO estimation. Our design features a dual-clock architecture that processes statistics collection (SC) and parameter decision (PD), the two main functional blocks of SAO estimation, at high- and low-speed clocks, respectively. Such a strategy reduces the overall area by 56% by addressing the heterogeneous data flows of SC and PD. To further improve the area and power efficiency, algorithm-architecture co-optimizations are applied, including a coarse range selection (CRS) and an accumulator bit width reduction (ABR). CRS shrinks the range of fine processed bands for the band offset estimation. ABR further reduces the area by narrowing the accumulators of SC. They together achieve another 25% area reduction. The proposed VLSI design is capable of processing 8k at 120-frames/s encoding. It occupies 51k logic gates, only one-third of the circuit area of the state-of-the-art implementations.


Energy-Efficient VLSI Realization of Binary64 Division With Redundant Number Systems

Abstract:
VLSI realizations of digit-recurrence binary division usually use redundant representation of partial remainders and quotient digits. The former allows for fast carry-free computation of the next partial remainder, and the latter leads to less number of the required divisor multiples. In studying the previous relevant works, we have noted that the binary carry-save (CS) number system is prevalent in the representation of partial remainders, and redundant high radix representation of quotient digits is popular in order to reduce the cycle count. In this paper, we explore a design space containing four division architectures. These are based on binary CS or radix-16 signed digit (SD) representations of partial remainders. On the other hand, they use full or partial precomputation of divisor multiples. The latter uses smaller multiplexer at the cost two extra adders, where one of the operands is constant within all cycles. The quotient digits are represented by radix-16 [-9, 9] SDs. Our synthesis-based evaluation of VLSI realizations of the best previous relevant work and the four proposed designs show reduced power and energy figures in the proposed designs at the cost of more silicon area and delay measures. However, our energy-delay product is 26%-35% less than that of the reference work.


Antiwear Leveling Design for SSDs With Hybrid ECC Capability

Abstract:
With the joint considerations of reliability and performance, hybrid error correction code (ECC) becomes an option in the designs of solid-state drives (SSDs). Unfortunately, wear leveling (WL) might result in the early performance degradation to SSDs, which is common with a limited number of P/E cycles, due to the efforts to delay the bit-error-rate growth. In this paper, an anti-WL design is proposed to avoid such a performance problem so that the performance of SSDs with hybrid ECC capability can be improved without sacrificing their reliability. The capability of the proposed design was evaluated by a series of experiments, for which it was shown that the proposed design could greatly improve the read and write performance of SSDs up to 50% without affecting the endurance of the investigated SSDs, compared with traditional approaches.


Low-Complexity Transformed Encoder Architectures for Quasi-Cyclic Nonbinary LDPC Codes Over Subfields

Abstract:
Quasi-cyclic low-density parity-check (QC-LDPC) codes are adopted in many digital communication and storage systems. The encoding of these codes is traditionally done by multiplying the message vector with a generator matrix consisting of dense circulant submatrices. To reduce the encoder complexity, this paper introduces two schemes making use of finite Fourier transform. We focus on QC-LDPC codes whose circulant submatrices are of dimension (2r – 1) × (2r – 1) and the entries are elements of GF(2p), where p divides r, and hence, GF(2p) is a subfield of GF(2r). These cover a broad range of codes, and binary LDPC codes are a special case. Making use of conjugacy constraints, low-complexity architectures are developed for finite Fourier and inverse transforms over subfields in this paper. In addition, composite field arithmetic is exploited to eliminate the computations associated with message mapping and reduce the complexity of Fourier transform. For a (2016, 1074) nonbinary QC-LDPC code whose generator matrix consists of circulants of dimension 63 × 63 with GF(22) entries, the proposed encoders achieve 22% area reduction compared with the conventional encoders without sacrificing the throughput.


FPGA Realization of Low Register Systolic All-One-Polynomial Multipliers Over GF(2m) and Their Applications in Trinomial Multipliers

Abstract:
Systolic all-one-polynomial (AOP) multipliers usually suffer from the problem of high register complexity, especially in field-programmable gate array (FPGA) platforms where the register resources are not that abundant. In this paper, we have shown that the AOP-based systolic multipliers can easily achieve low register-complexity implementations and the proposed architectures can be employed as computation cores to derive efficient implementations of systolic Montgomery multipliers based on trinomials. First, we propose a novel data broadcasting scheme in which the register complexity involved within existing AOP-based systolic multipliers is significantly reduced. We have found out that the modified AOP-based structure can be packed as a standard computation core. Next, we propose a novel Montgomery multiplication algorithm that can fully employ the proposed AOP-based computation core. The proposed Montgomery algorithm employs a novel precomputed-modular operation, and the systolic structures based on this algorithm fully inherit the advantages brought from the AOP-based core (low register complexity, low critical-path delay, and low latency) except some marginal hardware overhead brought by a precomputation unit. The proposed architectures are then implemented by Xilinx ISE 14.1 and it is shown that compared with the existing designs, the proposed designs achieve at least 61.8% and 47.6% less area-delay product and power-delay product than the best of competing designs, respectively.


Sign-Magnitude Encoding for Efficient VLSI Realization of Decimal Multiplication

Abstract:
Decimal X × Y multiplication is a complex operation, where intermediate partial products (IPPs) are commonly selected from a set of precomputed radix-10 X multiples. Some works require only [0, 5] × X via recoding digits of Y to one-hot representation of signed digits in [-5,5]. This reduces the selection logic at the cost of one extra IPP. Two’s complement signed-digit (TCSD) encoding is often used to represent IPPs, where dynamic negation (via one xor per bit of X multiples) is required for the recoded digits of Y in [-5, -1]. In this paper, despite generation of 17 IPPs, for 16-digit operands, we manage to start the partial product reduction (PPR) with 16 IPPs that enhance the VLSI regularity. Moreover, we save 75% of negating xors via representing precomputed multiples by sign-magnitude signed-digit (SMSD) encoding. For the first-level PPR, we devise an efficient adder, with two SMSD input numbers, whose sum is represented with TCSD encoding. Thereafter, multilevel TCSD 2:1 reduction leads to two TCSD accumulated partial products, which collectively undergo a special early initiated conversion scheme to get at the final binary-coded decimal product. As such, a VLSI implementation of 16 × 16-digit parallel decimal multiplier is synthesized, where evaluations show some performance improvement over previous relevant designs.


Hybrid LUT/Multiplexer FPGA Logic Architectures

Abstract:
Hybrid configurable logic block architectures for field-programmable gate arrays that contain a mixture of lookup tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectures, both nonfracturable and fracturable with varying MUX:LUT logic element ratios are evaluated across two benchmark suites (VTR and CHStone) using a custom tool flow consisting of LegUp-HLS, Odin-II front-end synthesis, ABC logic synthesis and technology mapping, and VPR for packing, placement, routing, and architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for nonfracturable architectures, without any mapper optimizations, we naturally save up to ~8% area postplace and route; both accounting for complex logic block and routing area while maintaining mapping depth. With architecture-aware technology mapper optimizations in ABC, additional area is saved, post-place-and-route. For fracturable architectures, experiments show that only marginal gains are seen after place-and-route up to ~2%. For both nonfracturable and fracturable architectures, we see minimal impact on timing performance for the architectures with best area-efficiency.


Low-Complexity Digit-Serial Multiplier Over GF(2m) Based on Efficient Toeplitz Block Toeplitz Matrix–Vector Product Decomposition

Abstract:
In this paper, we have shown that a regular Toeplitz matrix-vector product (TMVP) can be transformed into a Toeplitz block TMVP (TBTMVP) using a suitable permutation matrix. Based on the TBTMVP representation, we have proposed a new (a,b)-way TBTMVP decomposition algorithm for implementing a digit-serial multiplication. Moreover, it is shown that, based on iterative block recombination, we can improve the space complexity of the proposed TBTMVP decomposition. From the synthesis results, we have shown that the proposed TBTMVP-based multiplier involves less area, less area-delay product, and higher throughput compared with the existing digit-serial multipliers.


Efficient Soft Cancelation Decoder Architectures for Polar Codes

Abstract:
The flooding belief propagation (FO-BP) and the soft-cancelation (SCAN) algorithms are the two most popular soft-output BP algorithms for the decoding of capacity-achieving polar codes. The FO-BP algorithm has high throughput at the cost of performance degradation in high signal-to-noise ratio (SNR) region or with large block length. The SCAN algorithm has much better decoding performance while suffering from long decoding latency and low throughput. In this paper, an improved BP algorithm, named reduced complexity soft-cancelation (RCSC) algorithm, is proposed. Compared with the SCAN algorithm, the number of memory entries required by the RCSC algorithm is reduced by more than 50% in general, while achieving comparable or even better (e.g., when block size N = 215) decoding performance. When block size is large (e.g., N ≥ 215), the proposed RCSC algorithm reduces the required memory entries by more than 23% compared with the state-of-the-art FO-BP algorithm. The numerical results show that the error performance improvement of the RCSC algorithm is more significant when the SNR increases. For a different tradeoff, a reduced latency soft-cancelation (RLSC) algorithm is proposed to reduce the decoding latency and increase the throughput of the RCSC algorithm while slightly sacrificing decoding performance. Finally, the optimized VLSI architectures are presented for the RCSC and RLSC algorithms, respectively. The synthesis results demonstrate the efficiency of the proposed algorithms and architectures.


Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput Tradeoffs

Abstract:
Hybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations with USL support increase software FP throughput per core by 2.18× for addition/subtraction, 1.29× for multiplication, 3.07-4.05× for division, and 3.11-3.81× for square root, and use 90.7-94.6% less area than dedicated fused multiply- add (FMA) hardware. Hybrid implementations with custom FP-specific hardware increase throughput per core over a fixed-point software kernel by 3.69-7.28× for addition/subtraction, 1.22-2.03× for multiplication, 14.4× for division, and 31.9× for square root, and use 77.3-97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply-add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-three multiply- add implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11-15.9× and use 38.2-95.3% less area than dedicated FMA hardware.


ENFIRE: A Spatio-Temporal Fine-Grained Reconfigurable Hardware

Abstract:
Field programmable gate arrays (FPGAs) are well-established as fine-grained reconfigurable computing platforms. However, FPGAs demonstrate poor scalability in advanced technology nodes due to the large negative impact of the elaborate programmable interconnects (PIs). The need for such vast PIs arises from two key factors: 1) fine-grained bit-level data manipulation in the configurable logic blocks and 2) the purely spatial computing model followed in the FPGAs. In this paper, we propose ENFIRE, a novel memory-based spatio-temporal framework designed to provide the flexibility of reconfigurable bit-level information processing while improving scalability and energy efficiency. Dense 2-D memory arrays serve as the main computing elements storing not only the data to be processed but also the functional behavior of the application mapped into lookup tables. Computing elements are spatially distributed, communicating as needed over a hierarchical bus interconnect, while the functions are evaluated temporally inside each computing element. A custom software framework facilitates application mapping to the framework. By leveraging both spatial and temporal computing, ENFIRE significantly reduces the interconnect overhead when compared with FPGA. Simulation results show an improvement of 7.6× in energy, 1.6× in energy efficiency, 1.1× in leakage, and 5.3× in unified energy efficiency, a metric that considers energy and area together, compared with comparable FPGA implementations.


Page 1 of 212
RECENT PAPERS