Abstract : This paper presents a fixed-point reconfigurable parallel VLSI hardware architecture for real-time Electrical Capacitance Tomography (ECT). It is modular and consists of a front-end module which performs precise capacitance measurements in a time multiplexed manner using Capacitance to Digital Converter (CDC) technique. Another FPGA module performs the inverse steps of the tomography algorithm. A dual port built-in memory banks store the sensitivity matrix, the actual value of the capacitances, and the actual image. A two dimensional (2D) core multi-processing elements (PE) engine intercommunicates with these memory banks via parallel buses. A Hardware-software co-design methodology was conducted using commercially available tools in order to concurrently tune the algorithms and hardware parameters. Hence, the hardware was designed down to the bit-level in order to reduce both the hardware cost and power consumption, while satisfying real-time constraint. Quantization errors were assessed against the image quality and bit-level simulations demonstrate the correctness of the design. Further simulations indicate that the proposed architecture achieves a speed-up of up to three orders of magnitude over the software version when the reconstruction algorithm runs on 2.53 GHz-based Pentium processor or DSP Ti’s Delphino TMS320F32837 processor. More specifically, a throughput of 17.241 Kframes/sec for both the Linear-Back Projection (LBP) and modified Landweber algorithms and 8.475 Kframes/sec for the Land weber algorithm with 200 iterations could be achieved. This performance was achieved using an array of   processing units. This satisfies the real-time constraint of many industrial applications. To the best of the authors’ knowledge, this is the first embedded system which explores the intrinsic parallelism which is available in modern FPGA for ECT tomography.
Abstract: While long polar codes can achieve the capacity of arbitrary binary-input discrete memoryless channels when decoded by a low complexity successive-cancellation (SC) algorithm, the error performance of the SC algorithm is inferior for polar codes with finite block lengths. The cyclic redundancy check (CRC)-aided SC list (SCL) decoding algorithm has better error performance than the SC algorithm. However, current CRC-aided SCL decoders still suffer from long decoding latency and limited throughput. In this paper, a reduced latency list decoding (RLLD) algorithm for polar codes is proposed. Our RLLD algorithm performs the list decoding on a binary tree, whose leaves correspond to the bits of a polar code. In existing SCL decoding algorithms, all the nodes in the tree are traversed, and all possibilities of the information bits are considered. Instead, our RLLD algorithm visits much fewer nodes in the tree and considers fewer possibilities of the information bits. When configured properly, our RLLD algorithm significantly reduces the decoding latency and, hence, improves throughput, while introducing little performance degradation. Based on our RLLD algorithm, we also propose a high throughput list decoder architecture, which is suitable for larger block lengths due to its scalable partial sum computation unit. Our decoder architecture has been implemented for different block lengths and list sizes using the TSMC 90-nm CMOS technology. The implementation results demonstrate that our decoders achieve significant latency reduction and area efficiency improvement compared with the other list polar decoders in the literature.
Abstract: A dynamic functional verification method that compares untimed simulations versus timed simulations for synthesizable [high-level synthesis (HLS)] behavioral descriptions (ANSI-C) is presented in this paper. This paper proposes a method that automatically inserts a set of probes into the untimed behavioral description. These probes record the status of internal signals of the behavioral description during an initial untimed simulation. These simulation results are subsequently used as golden outputs for the verification of the internal signals during a timed simulation once the behavioral description has been synthesized using HLS. Our proposed method reports any simulation mismatches and accurately pinpoints any discrepancies between the functional Software (SW) simulation and the timed simulation at the original behavioral description (source code). Our method does not only determine where to place the probes, but is also able to insert different type of probes based on the specified HLS synthesis options in order not to interfere with the HLS process, minimizing the total number of probes and the size of the data to be stored in the trace file in order to minimize the running time. Results show that our proposed method is very effective and extremely simple to use as it is fully automated.
Abstract: This paper presents the VLSI architecture and hardware implementation of a highly efficient De-blocking Filter for High Efficiency Video Coding (HEVC) systems. In order to reduce the number of data accesses and thus to enhance the timing efficiency, novel data structures and memory access schemes for image pixels are proposed. Furthermore, a novel edge-fetching order is presented to strike a balance between the processing throughput and complexity. Based on the proposed structure and access pattern, a six-stage pipelined, two-line De-blocking Filter engine with low-latency data access sequence is designed, aiming to achieve high processing throughput while at the same time maintaining low complexity. The detailed storage structure and data access scheme are illustrated and VLSI architecture for the De-blocking Filter engine is depicted in this paper. In addition, the proposed De-blocking Filter is implemented using TSMC 90nm standard cell library. Experimental results based on post-layout estimations show that the proposed design can achieve 60 frames per second for frame resolution of 4096×2048 pixels (Ultra HD resolution) assuming an operating frequency of 100MHz. Moreover, this design occupies area complexity of 466.5 kGE with power consumption of 26.26 mW. In comparison with prior arts targeting on similar system specification and throughput, the proposed design results in a significantly reduced area complexity.
Abstract: In this paper, we present a novel channelization architecture, which can simultaneously process two channels of complex input data and provide up to 1024 independent channels of complex output data. The proposed architecture is highly modular and generic, so that parameters of each output channel can be dynamically changed even at runtime in terms of the bandwidth, center frequency, output sampling rate, and so on. It consists of one tunable pipelined frequency transform (TPFT)-based coarse channelization block, one tuning unit, and one resampling filter. Based on the analysis of the data dependence between the subbands, a novel channel splitting scheme is proposed to enable multiple subbands to share the proposed TPFT block. The proposed Farrow-based resampling filter does not require division operation and dual-port RAMs resulting in significant area saving. Finally, we implement the proposed channelization architecture in a single field-programmable gate array. The experiment results indicate that our design provides the flexibility associated with the existing works, but with greater resource efficiency.
Abstract: A multilevel per cell (MLC) technique significantly improves the storage density, but also poses serious data integrity challenge for NAND flash memory. This consequently makes the low-density parity-check (LDPC) code and the soft-decision memory sensing become indispensable in the next-generation flash-based solid-state storage devices. However, the use of LDPC codes inevitably increases memory read latency and, hence, degrades speed performance. Motivated by the observation of intracell unbalanced bit error probability and data dependence in the MLC NAND flash memory, this paper proposes two techniques, i.e., intracell data placement interleaving and intracell data dependence aware LDPC decoding, to efficiently improve the LDPC decoding throughput and energy efficiency for the MLC NAND flash-based storage in a mobile device. Experimental results show that, by exploiting the intracell bit-error characteristics, the proposed techniques together can improve the LDPC decoding throughput by up to 84.6% and reduce the energy consumption by up to 33.2% while only incurring less than 0.2% silicon area overhead.
Abstract: Emulations of cellular nonlinear networks on digital reconfigurable hardware are renowned for an efficient computation of massive data, exceeding the accuracy and flexibility of full-custom designs. In this contribution, a digital implementation with polynomial coupling weight functions is proposed for the first time, establishing novel fields of application, e.g., in the medical signal processing and in the solution of partial differential equations. We present an architecture that is capable of processing large-scale networks with a high degree of parallelism, implemented on state-of-the-art field-programmable gate arrays.
Abstract: This paper presents a novel algorithm based on trellis min–max for decoding non-binary low-density parity check (NB-LDPC) codes. This decoder reduces the number of messages exchanged between check node and variable node processors, which decreases the storage resources and the wiring congestion and, thus, increases the throughput of the decoder. Our frame error rate performance simulations show that the proposed algorithm has a negligible performance loss for high rate codes with GF(16) and GF(32), and a performance loss smaller than 0.07 dB for high-rate codes over GF(64). In addition, a layered decoder architecture is presented and implemented on a 90-nm CMOS process for the following high-rate NB-LDPC codes: (2304, 2048) over GF(16), (837, 726) over GF(32), and (1536, 1344) over GF(64). In all cases, the achieved throughput is higher than 1 Gb/s.
Abstract: Although Latin square is a well-known algorithm to construct low-density parity-check (LDPC) codes for satisfying long code length, high code-rate, good correcting capability, and low error floor, it has a drawback of large sub matrix that the hardware implementation will be suffered from large barrel shifter and worse routing congestion in fitting NAND flash applications. In this paper, a top-down design methodology, which not only goes through code construction and optimization, but also hardware implementation to meet all the critical requirements, is presented. A two-step array dispersion algorithm is proposed to construct long LDPC codes with a small sub matrix size. Then, the constructed LDPC code is optimized by masking matrix to obtain better bit-error rate (BER) performance and lower error floor. In addition, our LDPC codes have a diagonal-like structure in the parity-check matrix leading to a proposed hybrid storage architecture, which has the advantages of better area efficiency and large enough data bandwidth for high decoding throughput. To be adopted for NAND flash applications, an (18 900, 17 010) LDPC code with a code-rate of 0.9 and sub matrix size of 63 is constructed and the field-programmable gate array simulations show that the error floor is successfully suppressed down to BER of 10 −12. An LDPC decoder using normalized min-sum variable-node-centric sequential scheduling decoding algorithm is implemented in UMC 90-nm CMOS process. The post layout result shows that the proposed LDPC decoder can achieve a throughput of 1.58 Gb/s at six iterations with a gate count of520k under a clock frequency of166.6 MHz. It meets the throughput requirement of both NAND flash memories with Toggle double data rate 1.0 and open NAND flash interface 2.3NANDinterfaces.
Abstract: This paper presents an algorithm and a VLSI architecture of a configurable joint detection and decoding (CJDD) scheme for multi-input multi-output (MIMO) wireless communication systems with convolutional codes. A novel tree-enumeration strategy is proposed such that the MIMO detection and decoding of convolutional codes can be conducted in single stage using a tree-searching engine. Moreover, this design can be configured to support different combinations of quadrature amplitude modulation (QAM) schemes as well as encoder code rates, and thus can be more practically deployed to real-world MIMO wireless systems. A formal outline of the proposed algorithm will be given and simulation results for 16-QAM and 64-QAM with rate-1/2 and rate-1/3 codes will be presented showing that, compared with the conventional separate scheme, the CJDD algorithm can greatly improve bit error rate (BER) performance with different system settings. In addition, the VLSI architecture and implementation of the CJDD approach will be illustrated. The architectures and circuits are designed to support configurability and flexibility while maintaining high efficiency and low complexity. The post layout experimental results for 16-QAM and 64-QAM with rate-1/2 and rate-1/3 codes show that, compared with the previous configurable design, this architecture can achieve reduced or comparable complexity with improved BER performance.
Abstract: Engineers must consider performance, power consumption, and cost when designing embedded digital systems; furthermore, memory is a key factor in such systems. Code compression is a technique used in embedded systems to reduce the memory usage. Bit Mask-based code compression is a modified version of dictionary-based code compression. In this paper, we applied a small separated dictionary, and variable mask numbers were used with the Bit Mask algorithm to reduce the codeword length of high frequency instructions. In addition, a novel dictionary selection algorithm was proposed to increase the instruction match rates. The fully separated dictionary method was used to improve the performance of the decompression engine without affecting the compression ratio (CR) (the compressed code size divided by original code size). Based on the experimental results, the proposed method can achieve a 7.5% improvement in the CR with nearly no hardware overhead.
Abstract: In this paper, an exportable application-specific instruction-set elliptic curve cryptography processor based on redundant signed digit representation is proposed. The processor employs extensive pipelining techniques for Karatsuba–Ofman method to achieve high throughput multiplication. The proposed processor performs singlepoint multiplication employing points in affine coordinates in 2.26 ms and runs at a maximum frequency of 160 MHz in Xilinx Virtex 5 (XC5VLX110T) field-programmable gate array.
Abstract: This paper presents the design of a fully integrated electrocardiogram (ECG) signal processor (ESP) for the prediction of ventricular arrhythmia using a unique set of ECG features and a naive Bayes classifier. Real-time and adaptive techniques for the detection and the delineation of the P-QRS-T waves were investigated to extract the fiducial points. Those techniques are robust to any variations in the ECG signal with high sensitivity and precision. Two databases of the heart signal recordings from the MIT PhysioNet and the American Heart Association were used as a validation set to evaluate the performance of the processor. Based on application-specified integrated circuit (ASIC) simulation results, the overall classification accuracy was found to be 86% on the out-of-sample validation data with 3-s window size. The architecture of the proposed ESP was implemented using 65-nm CMOS process. It occupied 0.112-mm2 area and consumed 2.78-µW power at an operating frequency of 10 kHz and from an operating voltage of 1 V.
Abstract: In this paper, we propose a low-power technique, called RF power gating, which consists in varying the active time ratio (ATR) of the RF front end at a symbol time scale. This technique is especially well suited to adapt the power consumption of the receiver to the performance needs without changing its architecture. The effect of this technique on the bit error rate (BER) performances is studied for a basic estimator in the specific case of minimum-shift keying signaling. A system-level energy model is also derived and discussed to estimate precisely the power reduction based on the characteristics and the power consumption of each block. This model allows highlighting the different contributors of the power reduction. The BER results and the energy model are finally merged to determine the best ATR meeting the design constraints. Applying this technique to the IEEE 802.15.4 standard, this paper shows that an ATR of 20% is a good tradeoff to meet the packet error rate constraint while maximizing the energy reduction ratio. Using typical block power consumptions, an energy reduction ratio around 20% can be reached. Even better energy reduction ratios (∼60%) are also achievable when most of the blocks are power-gated.
Abstract: A high-speed, low-power, and highly reliable frequency multiplier is proposed for a delay-locked loop-based clock generator to generate a multiplied clock with a high frequency and wide frequency range. The proposed edge combiner achieves a high-speed and highly reliable operation using a hierarchical structure and an overlap canceller. In addition, by applying the logical effort to the pulse generator and multiplication-ratio control logic design, the proposed frequency multiplier minimizes the delay difference between positive- and negative-edge generation paths, which causes a deterministic jitter. Finally, a numerical analysis is performed to analyze and compare the performance of the proposed frequency multiplier with that of previous frequency multipliers. The proposed frequency multiplier is fabricated using a 0.13-µm CMOS process technology, and has the multiplication ratios of 1, 2, 4, 8, and 16, and an output range of 100 MHz–3.3 GHz. The frequency multiplier achieves a power consumption to a frequency ratio of 2.9 µW/MHz.
Abstract: Ring oscillators (ROs) are popular due to their small area, modest power, wide tuning range, and ease of scaling with process technology. However, their use in many applications is limited due to poor phase noise and jitter performance. Thermal noise and flicker noise contribute jitter that decreases inversely with oscillation frequency. This paper describes a frequency boost technique to reduce jitter in ROs. We boost the internal oscillation frequency and introduce a frequency divider following the oscillator to maintain the desired output frequency. This approach offers reduced jitter as well as the opportunity to trade off output jitter with power for dynamic performance management. The oscillator has 32 operating modes, corresponding to different values for the ring size and frequency division. In a 0.5-µm CMOS process, the highest oscillation frequency achieved is 25 MHz with a root-mean-square period jitter of 54 ps and a power consumption of 817 µW at 5 V supply. A jitter model for current-starved oscillators was derived and verified by measurement; a direct relationship between oscillation frequency and jitter was derived and measured. Compared with other oscillators, this design achieves the highest performance in terms of jitter per unit interval and figure-of-merit. The performance is expected to improve in more advanced technologies. The results are summarized to offer design guidance based on the frequency boost technique.
Abstract: Design of a low-energy power-ON reset (POR) circuit is proposed to reduce the energy consumed by the stable supply of the dual supply static random access memory (SRAM), as the other supply is ramping up. The proposed POR circuit, when embedded inside dual supply SRAM, removes its ramp-up constraints related to voltage sequencing and pin states. The circuit consumes negligible energy during ramp-up, does not consume dynamic power during operations, and includes hysteresis to improve noise immunity against voltage fluctuations on the power supply. The POR circuit, designed in the 40-nm CMOS technology within 10.6-µm2 area, enabled 27× reduction in the energy consumed by the SRAM array supply during periphery power-up in typical conditions.
Abstract: Emerging nonvolatile memories (NVMs), such as MRAM, PRAM, and RRAM, have been widely investigated to replace SRAM as the configuration bits in field-programmable gate arrays (FPGAs) for high security and instant power ON. However, the variations inherent in NVMs and advanced logic process bring reliability issue to FPGAs. This brief introduces a low-power variation-tolerant nonvolatile lookup table (nvLUT) circuit to overcome the reliability issue. Because of large ROFF/RON , 1T1R RRAM cell provides sufficient sense margin as a configuration bit and a reference resistor. A single-stage sense amplifier with voltage clamp is employed to reduce the power and area without impairing the reliability. Matched reference path is proposed to reduce the parasitic RC mismatch for reliable sensing. Evaluation shows that 22% reduction in delay, 38% reduction in power, and the tolerance of variations of 2.5× typical RON or ROFF in reliability are achieved for proposed nvLUT with six inputs.
Abstract: Advanced computing systems embed spintronic devices to improve the leakage performance of conventional CMOS systems. High speed, low power, and infinite endurance are important properties of magnetic tunnel junction (MTJ), a spintronic device, which assures its use in memories and logic circuits. This paper presents a PentaMTJ-based logic gate, which provides easy cascading, self-referencing, less voltage headroom problem in precharge sense amplifier and low area overhead contrary to existing MTJ-based gates. PentaMTJ is used here because it provides guaranteed disturbance free reading and increased tolerance to process variations along with compatibility with CMOS process. The logic gate is validated by simulation at the 45-nm technology node using a VerilogA model of the PentaMTJ.
Abstract: A duty-cycle correction technique using a novel pulse width modification cell is demonstrated across a frequency range of 100 MHz–3.5 GHz. The technique works at frequencies where most digital techniques implemented in the same technology node fail. An alternative method of making time domain measurements such as duty cycle and rise/fall times from the frequency domain data is introduced. The data are obtained from the equipment that has significantly lower bandwidth than required for measurements in the time domain. An algorithm for the same has been developed and experimentally verified. The correction circuit is implemented in a 0.13-µm CMOS technology and occupies an area of 0.011 mm2. It corrects to a residual error of less than 1%. The extent of correction is limited by the technology at higher frequencies.
Abstract: The previously proposed average-8T static random access memory (SRAM) has a competitive area and does not require a write-back scheme. In the case of average-8T SRAM architecture, a full-swing local bit line (BL) that is connected to the gate of the read buffer can be achieved with a boosted word line (WL) voltage. However, in the case of an average-8T SRAM based on an advanced technology, such as a 22-nm FinFET technology, where the variation in threshold voltage is large, the boosted WL voltage cannot be used, because it degrades the read stability of the SRAM. Thus, a full-swing local BL cannot be achieved, and the gate of the read buffer cannot be driven by the full supply voltage (VDD), resulting in a considerably large read delay. To overcome the above disadvantage, in this paper, a differential SRAM architecture with a full-swing local BL is proposed. In the proposed SRAM architecture, full swing of the local BL is ensured by the use of cross-coupled pMOSs, and the gate of the read buffer is driven by a full VDD, without the need for the boosted WL voltage. Various configurations of the proposed SRAM architecture, which stores multiple bits, are analyzed in terms of the minimum operating voltage and area per bit. The proposed SRAM that stores four bits in one block can achieve a minimum voltage of 0.42 V and a read delay that is 62.6 times lesser than that of the average-8T SRAM based on the 22-nm FinFET technology.
Abstract: This paper presents a robust energy/area-efficient receiver fabricated in a 28-nm CMOS process. The receiver consists of eight data lanes plus one forwarded-clock lane supporting the hypertransport standard for high-density chip-to-chip links. The proposed all-digital clock and data recovery (ADCDR) circuit, which is well suited for today’s CMOS process scaling, enables the receiver to achieve low power and area consumption. The ADCDR can enter into open loop after lock-in to save power and avoid clock dithering phenomenon. Moreover, to compensate the open loop, a phase tracking procedure is proposed to enable the ADCDR to track the phase drift due to the voltage and temperature variations. Furthermore, the all-digital delay-locked loop circuit integrated in the ADCDR can generate accurate multiphase clocks with the proposed calibrated locking algorithm in the presence of process variations. The precise multiphase clocks are essential for the half-rate sampling and Alexander-type phase detecting. Measurement results show that the receiver can operate at a data rate of 6.4 Gbits/s with a bit error rate.
Abstract: In this paper, a new design procedure has been proposed for realization of logarithmic function via three phases: 1) differentiation; 2) division; and 3) integration for any arbitrary analog signal. All the basic building blocks, i.e., differentiator, divider, and integrator, are realized by operational transconductance amplifier, a current mode device. Realization of exponential, power law and hyperbolic function as the design examples claims that the proposed synthesis procedure has the potential to design a log-based nonlinear system in a systematic and hierarchical manner. The performance of all the proposed circuits has been verified with SPICE simulation.
Abstract: A novel 8-transistor (8T) static random access memory cell with improved data stability in subthreshold operation is designed. The proposed single-ended with dynamic feedback control 8T static RAM (SRAM) cell enhances the static noise margin (SNM) for ultralow power supply. It achieves write SNM of 1.4× and 1.28× as that of isoarea 6T and read-decoupled 8T (RD-8T), respectively, at 300 mV. The standard deviation of write SNM for 8T cell is reduced to 0.4× and 0.56× as that for 6T and RD-8T, respectively. It also possesses another striking feature of high read SNM ∼2.33×, 1.23×, and 0.89× as that of 5T, 6T, and RD-8T, respectively. The cell has hold SNM of 1.43×, 1.23×, and 1.05× as that of 5T, 6T, and RD-8T, respectively. The write time is 71% lesser than that of single-ended asymmetrical 8T cell. The proposed 8T consumes less write power 0.72×, 0.6×, and 0.85× as that of 5T, 6T, and isoarea RD-8T, respectively. The read power is 0.49× of 5T, 0.48× of 6T, and 0.64× of RD-8T. The power/energy consumption of 1-kb 8T SRAM array during read and write operations is 0.43× and 0.34×, respectively, of 1-kb 6T array. These features enable ultralow power applications of 8T.
Abstract: This brief proposes an on-line transparent test technique for detection of latent hard faults which develop in first-input first-output buffers of routers during field operation of NoC. The technique involves repeating tests periodically to prevent accumulation of faults. A prototype implementation of the proposed test algorithm has been integrated into the router-channel interface and on-line test has been performed with synthetic self-similar data traffic. The performance of the NoC after addition of the test circuit has been investigated in terms of throughput while the area overhead has been studied by synthesizing the test hardware. In addition, an on-line test technique for the routing logic has been proposed which considers utilizing the header flits of the data traffic movement in transporting the test patterns.
Abstract: This paper presents an automatic speech–speaker recognition (ASSR) system implemented in a chip which includes a built-in extraction, recognition, and training (ERT) core. For VLSI design (here, ASSR system), the hardware cost and time complexity are always the important issues which are improved in this proposed design in two levels: 1) algorithmic and 2) architecture. At the algorithm level, a newly binary-halved clustering (BHC) is proposed to achieve low time complexity and low memory requirement. In addition, at the architecture level, a new ERT core is proposed and implemented based on data dependence and reuse mechanism to reduce the time and hardware cost as well. Finally, the chip implementation is synthesized, placed, and routed using TSMC 90-nm technology library. To verify the performance of the proposed BHC method, a case study is performed based on nine speakers. Moreover, the validation of the ASSR system is examined in two parts: 1) speech recognition and 2) speaker recognition. The results show that the proposed system can achieve 93.38% and 87.56% of recognition rates during speech and speaker recognition, respectively. Furthermore, the proposed ASSR chip includes 396k gate counts, and consumes power in 8.74 mW. Such results demonstrate that the performance of the proposed ASSR system is superior to the conventional systems.
Abstract: Integral histogram image can accelerate the computing process of feature algorithm in computer vision, but exhibits high computation complexity and inefficient memory access. In this paper, we propose a configurable parallel architecture to improve the computing efficiency of integral histogram. Based on the configurable design in the architecture, multiple integral objects for integral histogram image, such as image intensity, image gradient, and local binary pattern, are well supported. Meanwhile, by means of the proposed strip-based memory partitioning mechanism, this architecture processes the integral histogram quickly with maximal parallelism in a pipeline manner. Besides, in this architecture, the proposed data correlation memory compression mechanism effectively solves the expansion problem of integral histogram memory caused by storing the histogram data. It fully reduces the data redundancy in the integral histograms, and saves a lot of memory resources. Experiments using Cyclone IV-based field-programmable gate array platform and 65-nm technology-based post synthesis show that our architecture improves the average computing speed by 8.6 times with high power efficiency compared with the state-of-the-art works.
Abstract: The field of approximate computing has received significant attention from the research community in the past few years, especially in the context of various signal processing applications. Image and video compression algorithms, such as JPEG, MPEG, and so on, are particularly attractive candidates for approximate computing, since they are tolerant of computing imprecision due to human imperceptibility, which can be exploited to realize highly power-efficient implementations of these algorithms. However, existing approximate architectures typically fix the level of hardware approximation statically and are not adaptive to input data. For example, if a fixed approximate hardware configuration is used for an MPEG encoder (i.e., a fixed level of approximation), the output quality varies greatly for different input videos. This paper addresses this issue by proposing a reconfigurable approximate architecture for MPEG encoders that optimizes power consumption with the goal of maintaining a particular Peak Signal-to-Noise Ratio (PSNR) threshold for any video. We propose two heuristics for automatically tuning the approximation degree of the RABs in these two modules during runtime based on the characteristics of each individual video. Experimental results show that our approach of dynamically adjusting the degree of hardware approximation based on the input video respects the given quality bound (PSNR degradation of 1%–10%) across different videos while achieving a power saving up to 38% over a conventional nonapproximated MPEG encoder architecture. Note that although the proposed reconfigurable approximate architecture is presented for the specific case of an MPEG encoder, it can be easily extended to other DSP applications.
Abstract: Nowadays, many applications require simultaneous computation of multiple independent fast Fourier transform (FFT) operations with their outputs in natural order. Therefore, this brief presents a novel pipelined FFT processor for the FFT computation of two independent data streams. The proposed architecture is based on the multipath delay commutator FFT architecture. It has an N/2-point decimation in time FFT and an N/2-point decimation in frequency FFT to process the odd and even samples of two data streams separately. The main feature of the architecture is that the bit reversal operation is performed by the architecture itself, so the outputs are generated in normal order without any dedicated bit reversal circuit. The bit reversal operation is performed by the shift registers in the FFT architecture by interleaving the data. Therefore, the proposed architecture requires a lower number of registers and has high throughput.
Abstract: In many digital signal processing applications, some parts of a word stored in the embedded static random access memories (SRAMs) are more important than other parts of the word. Due to the differences in importance, memory failures that occur in more important bit locations generally give rise to relatively larger system performance degradation than those in less important locations. This brief presents a low-complexity unequal-error-protection error correcting code (UEEP-ECC) approach for the embedded memories in digital signal processor. In the proposed UEEP-ECC, repetition code is combined with the Bose–Chaudhuri–Hocquenghem code to selectively provide stronger error correction capabilities on more important data portions without a large hardware overhead. An efficient UEEP-ECC generation algorithm that can find the UEEP-ECC code with a minimum power of memory core and ECC logics is also presented. The experimental results show that the UEEP-ECC scheme achieves considerable power savings and data quality improvements in both of the H.264 and fast Fourier transform applications.
Abstract: Soft errors pose a reliability threat to modern electronic circuits. This makes protection against soft errors a requirement for many applications. Communications and signal processing systems are no exceptions to this trend. For some applications, an interesting option is to use algorithmic-based fault tolerance (ABFT) techniques that try to exploit the algorithmic properties to detect and correct errors. Signal processing and communication applications are well suited for ABFT. One example is fast Fourier transforms (FFTs) that are a key building block in many systems. Several protection schemes have been proposed to detect and correct errors in FFTs. Among those, probably the use of the Parseval or sum of squares check is the most widely known. In modern communication systems, it is increasingly common to find several blocks operating in parallel. Recently, a technique that exploits this fact to implement fault tolerance on parallel filters has been proposed. In this brief, this technique is first applied to protect FFTs. Then, two improved protection schemes that combine the use of error correction codes and Parseval checks are proposed and evaluated. The results show that the proposed schemes can further reduce the implementation cost of protection.
Abstract: Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant multiplications (MCM) technique that results in significant saving of computation. However, transpose form configuration does not directly support the block processing unlike direct form configuration. In this paper, we explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for transpose form block FIR filter with optimized register complexity. A generalized block formulation is presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves significantly less area delay product (ADP) and less energy per sample (EPS) than the existing block implementation of direct-form structure for medium or large filter lengths, while for the short-length filters, the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure. Application specific integrated circuit synthesis result shows that the proposed structure for block size 4 and filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed for reconfigurable applications. For the same filter length and the same block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure.
Abstract: Hardware acceleration has been proved an extremely promising implementation strategy for the digital signal processing (DSP) domain. Rather than adopting a monolithic application-specific integrated circuit design approach, in this brief, we present a novel accelerator architecture comprising flexible computational units that support the execution of a large set of operation templates found in DSP kernels. We differentiate from previous works on flexible accelerators by enabling computations to be aggressively performed with carry-save (CS) formatted data. Advanced arithmetic design concepts, i.e., recoding techniques, are utilized enabling CS optimizations to be performed in a larger scope than in previous approaches. Extensive experimental evaluations show that the proposed accelerator architecture delivers average gains of up to 61.91% in area-delay product and 54.43% in energy consumption compared with the state-of-art flexible data paths.
Abstract: Transistor network optimization represents an effective way of improving VLSI circuits. This paper proposes a novel method to automatically generate networks with minimal transistor count, starting from an irredundant sum-of-products expression as the input. The method is able to deliver both series–parallel (SP) and non-SP switch arrangements, improving speed, power dissipation, and area of CMOS gates. Experimental results demonstrate expected gains in comparison with related approaches.
Abstract: In this paper, we analyze the contents of lookup tables (LUTs) of distributed arithmetic (DA)-based block least mean square (BLMS) adaptive filter (ADF) and based on that we propose intra-iteration LUT sharing to reduce its hardware resources, energy consumption, and iteration period. The proposed LUT optimization scheme offers a saving of 60% LUT content for block size 8 and still higher saving for larger block sizes over the conventional design approach. It offers a saving of 60% LUT-update per output and 59% LUT access per output over the recently proposed DA-based BLMS ADF structure for block size 8 and filter length 64. Besides, the proposed structure involves nearly 30% saving in the iteration period over the other for 16-bit coefficient word length. Application specific integrated circuit (ASIC) synthesis result shows that the proposed structure for block size 8 offers a saving of 48% area-delay product (ADP) and 53% energy per sample (EPS) over the existing DA-based BLMS ADF structure on average for different filter lengths, and offers 30% higher sampling rate due to its shorter iteration period. Compared with the existing DA-based LMS ADF structure, the proposed structure involves 68% less ADP and 1.6×less EPS.
Abstract: This paper proposes an efficient pipelined architecture of elliptic curve scalar multiplication (ECSM) over GF(2m). The architecture uses a bit-parallel finite field (FF) multiplier accumulator (MAC) based on the Karatsuba–Ofman algorithm. The Montgomery ladder algorithm is modified for better sharing of execution paths. The data path in the architecture is well designed, so that the critical path contains few extra logic primitives apart from the FF MAC. In order to find the optimal number of pipeline stages, scheduling schemes with different pipeline stages are proposed and the ideal placement of pipeline registers is thoroughly analyzed. We implement ECSM over the five binary fields recommended by the National Institute of Standard and Technology on Xilinx Virtex-4 and Virtex-5 field-programmable gate arrays. The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF(2163) in 6.1µs using 7354 Slices on Virtex-4. Using Virtex-5, the scalar multiplication form=163, 233, 283, 409, and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 µs, respectively, which are faster than previous results.
Abstract: Timing-error-detection (TED)-based systems have been shown to reduce power consumption or increase yield due to reduced margins. This paper shows that the increased adaptability can be a great advantage in the system design in addition to the well-known mitigated susceptibility to ambient and internal variations. Specifically, the design tolerances of the power management are relaxed to enable even greater system-level energy savings than what can be achieved in the logic alone. In addition, the system is simultaneously able to operate near the minimum error point. Here, the power management is a simplified dc–dc converter and the TED is based on time borrowing. The target application is a single-chip system on chip without external discrete components; thus, switched capacitors are used for the dc–dc. The system achieves 7.9% energy reduction at the minimum energy point simultaneously with a 36.4% energy–delay product decrease and a 15% increase in dc–dc efficiency. In addition, the effect of local variations on average system performance is reduced by 12%.
Abstract: Hybrid configurable logic block architectures for field-programmable gate arrays that contain a mixture of lookup tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectures, both nonfracturable and fracturable with varying MUX:LUT logic element ratios are evaluated across two benchmark suites (VTR and CHStone) using a custom tool flow consisting of LegUp-HLS, Odin-II front-end synthesis, ABC logic synthesis and technology mapping, and VPR for packing, placement, routing, and architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for nonfracturable architectures, without any mapper optimizations, we naturally save up to∼8% area postplace and route; both accounting for complex logic block and routing area while maintaining mapping depth. For fracturable architectures, experiments show that only marginal gains are seen after place-and-route up to∼2%. For both nonfracturable and fracturable architectures, we see minimal impact on timing performance for the architectures with best area-efficiency.
Abstract: In this paper, we design a hardware and energy-efficient stochastic lower–upper decomposition (LUD) scheme for multiple-input multiple-output receivers. By employing stochastic computation, the complex arithmetic operations in LUD can be performed with simple logic gates. With proposed dual partition computation method, the stochastic multiplier and divider exhibit high computation accuracy with relative short length stochastic stream. We have designed and synthesized the stochastic LUD with CMOS 130-nm technology. According to the post layout report, the hardware efficiency of the stochastic LUDisashighas1.5×compared with the exiting LUD methods, and the energy efficiency is also higher than the state-of-the-art LUD when the matrix dimension is 8×8andlarger.