期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A compact FPGA-based processor for the Secure Hash Algorithm SHA-256

Rommel García Ignacio Algredo-Badillo Miguel Morales-Sandoval Claudia Feregrino-Uribe René Cumplido 《Computers & Electrical Engineering》2014

This work reports an efficient and compact FPGA processor for the SHA-256 algorithm. The novel processor architecture is based on a custom datapath that exploits the reusing of modules, having as main component a 4-input Arithmetic-Logic Unit not previously reported. This ALU is designed as a result of studying the type of operations in the SHA algorithm, their execution sequence and the associated dataflow. The processor hardware architecture was modeled in VHDL and implemented in FPGAs. The results obtained from the implementation in a Virtex5 device demonstrate that the proposed design uses fewer resources achieving higher performance and efficiency, outperforming previous approaches in the literature focused on compact designs, saving around 60% FPGA slices with an increased throughput (Mbps) and efficiency (Mbps/Slice). The proposed SHA processor is well suited for applications like Wi-Fi, TMP (Trusted Mobile Platform), and MTM (Mobile Trusted Module), where the data transfer speed is around 50 Mbps. 相似文献

2.

Job scheduling is more important than processor allocation forhypercube computers

Krueger P. Ten-Hwang Lai Dixit-Radiya V.A. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(5):488-497

Managing computing resources in a hypercube entails two steps. First, a job must be chosen to execute from among those waiting (job scheduling). Next a particular subcube within the hypercube must be allocated to that job (processor allocation). Whereas processor allocation has been well studied, job scheduling has been largely neglected. The goal of this paper is to compare the roles of processor allocation and job scheduling in achieving good performance on hypercube computers. We show that job scheduling has far more impact on performance than does processor allocation. We propose a new family of scheduling disciplines, called Scan, that have particular performance advantages. We show that performance problems that cannot be resolved through careful processor allocation can be solved by using Scan job-scheduling disciplines. Although the Scan disciplines carry far less overhead than is incurred by even the simplest processor allocation strategies, they are far more able to improve performance than even the most sophisticated strategies. Furthermore, when Scan disciplines are used, the abilities of sophisticated processor allocation strategies to further improve performance are limited to negligible levels. Consequently, a simple O(n) allocation strategy can be used in place of these complex strategies 相似文献

3.

Exploiting Narrow Values for Soft Error Tolerance

Oguz Ergin Osman Unsal Xavier Vera Antonio Gonzalez 《Computer Architecture Letters》2006,5(2):12-12

Soft errors are an important challenge in contemporary microprocessors. Particle hits on the components of a processor are expected to create an increasing number of transient errors with each new microprocessor generation. In this paper we propose simple mechanisms that effectively reduce the vulnerability to soft errors In a processor. Our designs are generally motivated by the fact that many of the produced and consumed values in the processors are narrow and their upper order bits are meaningless. Soft errors canted by any particle strike to these higher order bits can be avoided by simply identifying these narrow values. Alternatively soft errors can be detected or corrected on the narrow values by replicating the vulnerable portion of the value inside the storage space provided for the upper order bits of these operands. We offer a variety of schemes that make use of narrow values and analyze their efficiency in reducing soft error vulnerability of level-1 data cache of the processor 相似文献

4.

Real-time blind audio source separation: performance assessment on an advanced digital signal processor

Danilo Pani Alessandro Pani Luigi Raffo 《The Journal of supercomputing》2014,70(3):1555-1576

Many on-line blind audio source separation (BASS) algorithms have been presented so far to the scientific community, but only a few of them have been evaluated in terms of their real-time performance. In this paper we consider a well-established BASS method (oriented to voices separation) evaluating its performance in terms of separation quality allowed by a real-time embedded computing implementation, also considering novel and state-of-the-art improvements to the it. To this aim, the algorithm has been implemented and ported for real-time execution onto an advanced low-power digital signal processor targeted for complex-domain applications. The optimized embedded implementation is able to perform up to five iterations of the gradient for any input frame of data, achieving good separation levels (up to 11.8 dB of signal to interference ratio on custom recording in real environments). The proposed solution doubles the performance of a C-optimized version running on a traditional PC processor, achieving a better separation result with lower power requirements. 相似文献

5.

Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration

Clark Nathan Zhong Hongtao Tang Wilkin Mahlke Scott 《International journal of parallel programming》2003,31(6):429-449

General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications. 相似文献

6.

Energy Efficiency of a Multi-Core Processor by Tag Reduction

下载免费PDF全文

郑龙董冕雄 Kaoru Ota 金海马俊《计算机科学技术学报》2011,26(3):491-503

We consider the energy saving problem for caches on a multi-core processor.In the previous research on low power processors,there are various methods to reduce power dissipation.Tag reduction is one of them.This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors.We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced.We then propose three algorithms using different heuristics for this assignment problem.We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy.Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for.They also significantly outperform the tag reduction based algorithm on a single-core processor. 相似文献

7.

The inhibition spectrum and the achievement of causal consistency

Carol Critchlow Kim Taylor 《Distributed Computing》1996,10(1):11-27

Summary. We consider the problem of distinguishing causally-consistent global states in asynchronous distributed systems. Such states are fundamental to asynchronous systems, because they correspond to possible simultaneous global states; their detection arises in a variety of distributed applications, including global checkpointing, deadlock detection, termination detection, and broadcasting. A consistent-cut protocol is a protocol which in every run will designate for each processor a state, in such a way that these states together form a consistent cut. We analyze the cost of achieving causal consistency in terms of the extent to which a consistent-cut protocol delays events of the underlying system. We refer to the delaying action of a protocol as inhibition. A protocol using local inhibition may cause the delay of some of a processor’s events until that processor has performed some number of local actions; a protocol using global inhibition may force the delay of some of a processor’s events until that processor has received some communication from other processors. Based on a variety of system and protocol characteristics, including the ability to locally or globally inhibit particular types of events, we give three impossibility results and examine some existing protocols. We are then able to present a thirty-six case summary of protocols and impossibility results for the determination of causally-consistent states as a function of those characteristics. In particular, we demonstrate that local inhibition is necessary and sufficient to solve this problem for general FIFO systems, while global send inhibition is necessary and sufficient for general non-FIFO systems. Received: January 1993 / Accepted: January 1996 相似文献

8.

The Design and Performance Evaluation of the DI-Multicomputer

Lynn Choi Andrew A. Chien 《Journal of Parallel and Distributed Computing》1996,36(2):119

In this paper, we propose a new multicomputer node architecture, theDI-multicomputerwhich uses packet routing on a uniform point-to-point interconnect for both local memory access and internode communication. This is achieved by integrating a router into each processor chip and eliminating the memory bus interface. Since communication resources such as pins and wires are allocated dynamically via packet routing, the DI-multicomputer is able to maximize the available communication resources, providing much higher performance for both intranode and internode communication. Multi-packet handling mechanisms are used to implement a high performance memory interface based on packet routing. The DI-multicomputer network interface provides efficient communication for both short and long messages, decoupling the processor from the transmission overhead for long messages while achieving minimum latency for short messages. Trace-driven simulations based on a suite of message passing applications show that the communication mechanisms of the DI-multicomputer can achieve up to four times speedup when compared to existing architectures. 相似文献

9.

针对流水冲突的微处理器功能验证程序的自动生成

虞志益顾震宇沈泊章倩苓《小型微型计算机系统》2004,25(7):1212-1215

功能验证是处理器设计中的关键问题，而基于激励向量仿真的方法是功能验证的主流技术，其难点在于如何产生高效的测试程序。研究了针对流水冲突的测试程序的自动生成方法。与常规技术相比，该方法适用于深度流水、指令系统复杂的处理器，具有自动化程度高、针对性强等优点。本文方法已应用于32位RISC处理器的验证中，取得了良好的效果。相似文献

10.

A fast inner product processor based on equal alignments

S. P. Smith H. C. Torng 《Journal of Parallel and Distributed Computing》1985,2(4):376-390

Inner product computation is an important operation, invoked repeatedly in matrix multiplications. A high-speed inner product processor can be very useful (among many possible applications) in real-time signal processing. This paper presents the design of a fast inner product processor, with appreciably reduced latency and cost. The inner product processor is implemented with a tree of carry-propagate or carry-save adders; this structure is obtained with the incorporation of three innovations in the conventional multiply/add tree: (1) The leaf-multipliers are expanded into adder subtrees, thus achieving an O(log Nb) latency, where N denotes the number of elements in a vector and b the number of bits in each element. (2) The partial products, to be summed in producing an inner product, are reordered according to their “minimum alignments.” This reordering brings approximately a 20% saving in hardware—including adders and data paths. The reduction in adder widths also yields savings in carry propagation time for carry-propagate adders. (3) For trees implemented with carry-save adders, the partial product reordering also serves to truncate the carry propagation chain in the final propagation stage by 2 log b − 1 positions, thus significantly reducing the latency further. A form of the Baugh and Wooley algorithm is adopted to implement two's complement notation with changes only in peripheral hardware. 相似文献

11.

Mesoscale performance simulation of multicore processor systems

Peter Altevogt Tibor Kiss Mike Kistler Ram Rangan 《Software and Systems Modeling》2013,12(4):731-744

Modern microprocessor design relies heavily on detailed full-chip performance simulations to evaluate complex trade-offs. Typically, different design alternatives are tried out for a specific sub-system or component, while keeping the rest of the system unchanged. We observe that full-chip simulations for such studies is overkill. This paper introduces mesoscale simulation, which employs high-level modeling for the unchanged parts of a design and uses detailed cycle-accurate simulations for the components being modified. This combination of high-level and low-level modeling enables accuracy on par with detailed full-chip modeling while achieving much higher simulation speeds than detailed full-chip simulations. Consequently, mesoscale models can be used to quickly explore vast areas of the design space with high fidelity. We describe a proof-of-concept mesoscale implementation of the memory subsystem of the Cell/B.E. processor and discuss results from running various workloads. 相似文献

12.

Permutation-based range-join algorithms on N-dimensional meshes

Shao Dong Chen Hong Shen Topor R. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(4):413-431

We present four efficient parallel algorithms for computing a nonequijoin, called range-join, of two relations on N-dimensional mesh-connected computers. Range-joins of relations R and S are an important generalization of conventional equijoins and band-joins and are solved by permutation-based approaches in all proposed algorithms. In general, after sorting all subsets of both relations, the proposed algorithms permute every sorted subset of relation S to each processor in turn, where it is joined with the local subset of relation R. To permute the subsets of S efficiently, we propose two data permutation approaches, namely, the shifting approach which permutes the data recursively from lower dimensions to higher dimensions and the Hamiltonian-cycle approach which first constructs a Hamiltonian cycle on the mesh and then permutes the data along this cycle by repeatedly transferring data from each processor to its successor. We apply the shifting approach to meshes with different storage capacities which results in two different join algorithms. The basic shifting join (BASHJ) algorithm can minimize the number of subsets stored temporarily at a processor, but requires a large number of data transmissions, while the buffering shifting join (BUSHJ) algorithm can achieve a high parallelism and minimize the number of data transmissions, but requires a large number of subsets stored at each processor 相似文献

13.

Solving the generalized eigenvalue problem on a synchronous linear processor array

《Parallel Computing》1986,3(2):153-166

We present a parallel method to solve the generalized eigenvalue problem on a linear array of processors, each connected to their nearest neighbors and operating synchronously. We also include a wrap-around connection from end to end. Our method is based on the well-known QZ algorithm of Moler and Stewart, which simultaneously reduces two n × n matrices to upper triangular form by orthogonal or unitary transformations. We show how this algorithm may be partitioned and distributed of n + 1 processors, achieving a speed-up over the serial algorithm of O(n). We use the concept of windows to describe the action of each processor at each step. We show how to incorporate singles shifts, and how to apply orthogonal plane rotations on either side of a matrix without the need to transpose the matrix itself. 相似文献

14.

Design configuration selection for hard-error reliable processors via statistical rules

Ying Zhang Lide Duan Bin Li Lu Peng Xin Fu 《Microprocessors and Microsystems》2014

Lifetime reliability is becoming a first-order concern in processor manufacturing in addition to conventional design goals including performance, power consumption and thermal features since semiconductor technology enters the deep submicron era. This requires computer architects to carefully examine each design option and evaluate its reliability, in order to prolong the lifetime of the target processor. However, the complex wear-out mechanisms which cause processor failure and their interactions with varying microarchitectural configurations are still far from well understood, making the early optimization for chip reliability a challenging problem. To address this issue, we investigate the relationship between processor reliability and the design configuration by exploring a large processor design space in this paper. We employ a rule search strategy to generate a set of rules to identify the optimal configurations for reliability and its tradeoff with other design goals. 相似文献

15.

Job scheduling and processor allocation for grid computing on metacomputers

《Journal of Parallel and Distributed Computing》2005,65(11):1406-1418

Scheduling is a fundamental issue in achieving high performance on metacomputers and computational grids. For the first time, the job scheduling problem for grid computing on metacomputers is studied as a combinatorial optimization problem. A cost model is proposed for modeling communication heterogeneity on computational grids. A processor allocation algorithm is developed which always finds an optimal processor allocation that minimizes the effective execution time of a job when the job is being scheduled. It is proven that the list scheduling (LS) algorithm can achieve reasonable worst-case performance bound in grid environments supporting distributed supercomputing with large applications. We compare the performance of various job scheduling and processor allocation algorithms for grid computing on metacomputers. We evaluate the performance of 128 combinations of two job scheduling algorithms, four initial job ordering strategies, four processor allocation algorithms, and four metacomputers by extensive simulation. It is found that the combination of largest job first (LJF) initial job ordering and minimum effective execution time (MEET) or largest machine first (LMF) processor allocation algorithm yields the best average-case performance, and the choice of FCFS and LS depends on the range of job sizes. It is also observed that communication heterogeneity does have significant impact on schedule lengths. 相似文献

16.

Microwave tomography for breast cancer detection on Cell broadband engine processors

Meilian Xu Parimala Thulasiraman Sima Noghanian 《Journal of Parallel and Distributed Computing》2012

Microwave tomography (MT) is a safe screening modality that can be used for breast cancer detection. The technique uses the dielectric property contrasts between different breast tissues at microwave frequencies to determine the existence of abnormalities. Our proposed MT approach is an iterative process that involves two algorithms: Finite-Difference Time-Domain (FDTD) and Genetic Algorithm (GA). It is a compute intensive problem: (i) the number of iterations can be quite large to detect small tumors; (ii) many fine-grained computations and discretizations of the object under screening are required for accuracy. In our earlier work, we developed a parallel algorithm for microwave tomography on CPU-based homogeneous, multi-core, distributed memory machines. The performance improvement was limited due to communication and synchronization latencies inherent in the algorithm. In this paper, we exploit the parallelism of microwave tomography on the Cell BE processor. Since FDTD is a numerical technique with regular memory accesses, intensive floating point operations and SIMD type operations, the algorithm can be efficiently mapped on the Cell processor achieving significant performance. The initial implementation of FDTD on Cell BE with 8 SPEs is 2.9 times faster than an eight node shared memory machine and 1.45 times faster than an eight node distributed memory machine. In this work, we modify the FDTD algorithm by overlapping computations with communications during asynchronous DMA transfers. The modified algorithm also orchestrates the computations to fully use data between DMA transfers to increase the computation-to-communication ratio. We see 54% improvement on 8 SPEs (27.9% on 1 SPE) for the modified FDTD in comparison to our original FDTD algorithm on Cell BE. We further reduce the synchronization latency between GA and FDTD by using mechanisms such as double buffering. We also propose a performance prediction model based on DMA transfers, number of instructions and operations, the processor frequency and DMA bandwidth. We show that the execution time from our prediction model is comparable (within 0.5 s difference) with the execution time of the experimental results on one SPE. 相似文献

17.

一种硬件预取机构及其对系统影响的研究 总被引：1，自引：0，他引：1

下载免费PDF全文

邓让钰谢伦国肖立权《计算机工程与科学》2001,23(6):70-72

存储器访问延迟已经成为高性能微处理器性能发挥的关键障碍之一。预取是隐藏访存延迟的重要手段。其通常做法是显式执行指令将数据在实际使用前先和取到离微处理器附近的地方,但是这种方法增加了程序设计人员的负担。本文提出了一种硬件预取方法,即在存储控制器中设计一个VPFB机构用来隐藏访存延迟,并通过模拟分析了它的效果。相似文献

18.

The impact of programming paradigms on the efficiency of an individual-based simulation model

David J. Barnes Tim R. Hopkins 《Simulation Modelling Practice and Theory》2003,11(7-8):557-569

We look in detail at an individual-based simulation of the spread of barley yellow dwarf virus. The need for a very large number of individual plants and aphids along with multiple runs using different model parameters mean that it is important to keep memory and processor requirements within reasonable bounds. We present implementations of the model in both imperative and object-oriented programming languages, particularly noting aspects relating to ease of implementation and runtime performance. Finally, we attempt to quantify the cost of some of the decisions made in terms of their memory and processor time requirements. 相似文献

19.

A new direction for computer architecture research

《Computer》1998,31(11):24-32

In the past few years, two important trends have evolved that could change the shape of computing: multimedia applications and portable electronics. Together, these trends will lead to a personal mobile-computing environment, a small device carried all the time that incorporates the functions of the pager, cellular phone, laptop computer, PDA, digital camera, and video game. The microprocessor needed for these devices is actually a merged general-purpose processor and digital-signal processor, with the power budget of the latter. Yet for almost two decades, architecture research has focused on desktop or server machines. We are designing processors of the future with a heavy bias toward the past. To design successful processor architectures for the future, we first need to explore future applications and match their requirements in a scalable, cost-effective way. The authors describe Vector IRAM, an initial approach in this direction, and challenge others in the very successful computer architecture community to investigate architectures with a heavy bias for the future 相似文献

20.

Challenges and trends in processor design

《Computer》1998,31(1):39-48

Chip architects from Sun, Cyrix, Motorola, Mips, Intel and Digital see challenges rather than walls in microprocessor design. They share their insights in this virtual roundtable. Tremblay discusses the conflicting goals of improving how much work a processor does per cycle and at the same time shortening the cycle time. Grohoski says we need to reduce the processor complexity to spend less time debugging that complexity. Burgess thinks tightly interwoven designs will better support focused applications. Killian is confident the industry will solve foreseeable problems. He sees “big data” problems as key design drivers. Colwell sees a convergence of factors that make validation a big concern. He foresees future computers as communication enhancement devices. Rubinfeld names five issues as important to processor design and discusses some challenges specific to high-speed processor design. Despite the competitiveness of their field, these six architects shared several insights of interest to those not intimately connected with processor design 相似文献