期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Numerical simulation of pollutant transport in a shallow-water system on the Cell heterogeneous processor

Carlos H. González Basilio B. Fraguela Diego Andrade José A. García Manuel J. Castro 《The Journal of supercomputing》2013,65(3):1089-1103

相似文献

2.

Worst‐case execution time analysis for a Java processor

Martin Schoeberl Wolfgang Puffitsch Rasmus Ulslev Pedersen Benedikt Huber 《Software》2010,40(6):507-542

In this paper, we propose a solution for a worst‐case execution time (WCET) analyzable Java system: a combination of a time‐predictable Java processor and a tool that performs WCET analysis at Java bytecode level. We present a Java processor, called JOP, designed for time‐predictable execution of real‐time tasks. The execution time of bytecodes, the instructions of the Java virtual machine, is known to cycle accuracy for JOP. Therefore, JOP simplifies the low‐level WCET analysis. A method cache, which fills whole Java methods into the cache, simplifies cache analysis. The WCET analysis tool is based on integer linear programming. The tool performs the low‐level analysis at the bytecode level and integrates the method cache analysis. An integrated data‐flow analysis performs receiver‐type analysis for dynamic method dispatches and loop‐bound analysis. Furthermore, a model checking approach to WCET analysis is presented where the method cache can be exactly simulated. The combination of the time‐predictable Java processor and the WCET analysis tool is evaluated with standard WCET benchmarks and three real‐time applications. The WCET friendly architecture of JOP and the integrated method cache analysis yield tight WCET bounds. Comparing the exact, but expensive, model checking‐based analysis of the method cache with the static approach demonstrates that the static approximation of the method cache is sufficiently tight for practical purposes. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

3.

Automatic instruction-set architecture synthesis for VLIW processor cores in the ASAM project

《Microprocessors and Microsystems》2017

相似文献

4.

CellJoin: a parallel stream join operator for the cell processor

Buğra Gedik Rajesh R. Bordawekar Philip S. Yu 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(2):501-519

Low-latency and high-throughput processing are key requirements of data stream management systems (DSMSs). Hence, multi-core processors that provide high aggregate processing capacity are ideal matches for executing costly DSMS operators. The recently developed Cell processor is a good example of a heterogeneous multi-core architecture and provides a powerful platform for executing data stream operators with high-performance. On the down side, exploiting the full potential of a multi-core processor like Cell is often challenging, mainly due to the heterogeneous nature of the processing elements, the software managed local memory at the co-processor side, and the unconventional programming model in general. In this paper, we study the problem of scalable execution of windowed stream join operators on multi-core processors, and specifically on the Cell processor. By examining various aspects of join execution flow, we determine the right set of techniques to apply in order to minimize the sequential segments and maximize parallelism. Concretely, we show that basic windows coupled with low-overhead pointer-shifting techniques can be used to achieve efficient join window partitioning, column-oriented join window organization can be used to minimize scattered data transfers, delay-optimized double buffering can be used for effective pipelining, rate-aware batching can be used to balance join throughput and tuple delay, and finally single-instruction multiple-data (SIMD) optimized operator code can be used to exploit data parallelism. Our experimental results show that, following the design guidelines and implementation techniques outlined in this paper, windowed stream joins can achieve high scalability (linear in the number of co-processors) by making efficient use of the extensive hardware parallelism provided by the Cell processor (reaching data processing rates of ≈13 GB/s) and significantly surpass the performance obtained form conventional high-end processors (supporting a combined input stream rate of 2,000 tuples/s using 15 min windows and without dropping any tuples, resulting in ≈8.3 times higher output rate compared to an SSE implementation on dual 3.2 GHz Intel Xeon). 相似文献

5.

Prediction of performance and processor requirements in real-timedata flow architectures

Som S. Mielke R.R. Stoughton J.W. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(11):1205-1216

Presents a new data flow graph model for describing the real-time execution of iterative control and signal processing algorithms on multiprocessor data flow architectures. Identified by the acronym ATAMM, for Algorithm to Architecture Mapping Model, the model is important because it specifies criteria for a multiprocessor operating system to achieve predictable and reliable performance. Algorithm performance is characterized by execution time and iteration period. For a given data flow graph representation, the model facilitates calculation of greatest lower bounds for these performance measures. When sufficient processors are available, the system executes algorithms with minimum execution time and minimum iteration period, and the number of processors required is calculated. When only limited processors are available or when processors fail, performance is made to degrade gracefully and predictably. The user off-line is able to specify tradeoffs between increasing execution time or increasing iteration period. The approach to achieving predictable performance is to control the injection rate of input data and to modify the data flow graph precedence relations so that a processor is always available to execute an enabled graph node. An implementation of the ATAMM model in a four-processor architecture based on Westinghouse's VHSIC 1750A Instruction Set Processor is described 相似文献

6.

An FPGA-based low-cost VLIW floating-point processor for CNC applications

《Microprocessors and Microsystems》2017

In the high-speed free-form surface machining, the real-time motion planning and interpolation is a challenging task. This paper presents the design and implementation of a dedicated processor for the interpolation task in computerized numerical control (CNC) machine tools. The jerk-limited look-ahead motion planning and interpolation algorithm has been integrated in the interpolation processor to achieve smooth motion in the high-speed machining. The processor features a compactly designed floating-point parallel computing architecture, which employs a 3-stage pipelined reduced instruction set computer (RISC) core and a very long instruction word (VLIW) floating-point arithmetic unit. A new asynchronous execution mechanism has been employed in the processor to allow multi-cycle instructions to be performed in parallel. The proposed processor has been verified on a low-cost field programmable gate array (FPGA) chip in a prototype controller. Experimental result has demonstrated the significant improvement of the computing performance with the interpolation processor in the free-form surface machining. 相似文献

7.

The use of vector instructions of a processor architecture for emulating the vector instructions of another processor architecture

K. A. Batuzov 《Programming and Computer Software》2017,43(6):366-372

The complexity of software is ever increasing, and it requires more and more computational resources for its execution. A way to satisfy these requirements is the use of vector instructions that can operate with fixed-length vectors of data of the same. A method for representing vector instructions of one processor architecture in terms of the vector instructions of another architecture during the dynamic binary translation is proposed. An implementation of this method that includes the translation of vector addition and memory access increased the performance of the QEMU emulator by a factor greater than three on an artificial example and 12% on a real-life application. 相似文献

8.

A programmable array processor architecture for flexible approximate string matching algorithms

Panagiotis D. Michailidis Konstantinos G. Margaritis 《Journal of Parallel and Distributed Computing》2007

Approximate string matching problem is a common and often repeated task in information retrieval and bioinformatics. This paper proposes a generic design of a programmable array processor architecture for a wide variety of approximate string matching algorithms to gain high performance at low cost. Further, we describe the architecture of the array and the architecture of the cell in detail in order to efficiently implement for both the preprocessing and searching phases of most string matching algorithms. Further, the architecture performs approximate string matching for complex patterns that contain don’t care, complement and classes symbols. We also simulate and evaluate the proposed architecture on a field programmable gate array (FPGA) device using the JHDL tool for synthesis and the Xilinx Foundation tools for mapping, placement, and routing. Finally, our programmable implementation achieves about 8–340 times faster execution than a desktop computer with a Pentium 4 3.5 GHz for all algorithms when the length of the pattern is 1024. 相似文献

9.

Simulation and verification of associative processor arrays

A. W. G. Duller R. Storer 《Parallel Computing》1992,18(12):1403-1414

This work is based on the design of a VLSI processor array comprising single bit processing elements combined with Content Addressable Memory (CAM) [1,2]. The processors are connected in a linear array with 64 currently being combined on a chip. Each processor is linked to 64 bits of data CAM and 4 bits of subset CAM (used for marking subsets of the array for subsequent processing). The architecture is targeted at image applications including pixel based processing as well as higher level symbolic manipulation and incorporates a data shift register linking all of the processing elements which allows data loading and processing to occur concurrently.

The current situation is that an extensive functional simulation package has been written [3] which allows algorithms to be coded and executed on a system which comprises an arbitrary number of array chips together with its controlling hardware. This allows algorithms to be investigated, and tuned to the architecture. A reduced design has been fabricated and the chips are undergoing parametric testing. A full version of the processor array chip will then be produced allowing a complete image system to be tested.

The VLSI design work undertaken so far [2] shows that the blocks which constitute the design can easily be replicated an arbitrary number of times (subject to chip size constraints) to create an application specific CAM array. The need for this type of flexibility has been borne out by the algorithmic work that has been carried out by a number of workers. In order to make the design of application specific arrays possible it is vital that the simulation tools are fast enough to allow adequate testing to be performed on the new design. It is for this reason that the original simulation package, written in C, has been transferred onto a transputer array.

This paper looks at the way in which the simulation is mapped onto the transputers in such a way that an arbitrary number can be used. In addition the problems of verification and validation of the simulator and the VLSI design are addressed. Results are given for a number of different applications which show very encouraging speed-ups. In many ways it has been found that the efficiency with which the simulation can be carried out with a large number of transputers mirrors the efficiency of the processor array in terms of communications overhead. 相似文献

10.

Synthesis of halftone 3D images on an array processor

M.A. Berezovsky A.K. Yablonsky 《Computers & Graphics》1988,12(3-4):433-440

A fast generation of shaded images of CSG-defined objects could be accomplished by employing a general-purpose array processor. To fully exploit the computing power of such a machine, the visualization algorithm should be tailored to that specific kind of architecture. A parallel adaptation of the ray-tracing method and algorithm kernel operations is introduced. The performance analysis shows suitability of the approach for interactive CAD applications. Test pictures are presented. 相似文献

11.

AEGIS: A single-chip secure processor

《Information Security Technical Report》2005,10(2):63-73

This article presents the AEGIS secure processor architecture, which enables new applications by ensuring private and authentic program execution even in the face of physical attack. Our architecture uses two new primitives to achieve physical security. First, we describe Physical Random Functions which reliably protect and share secrets in a manner that is cheaper and more secure than existing solutions based on non-volatile memory. Second, off-chip memory protection mechanisms ensure the integrity and the privacy of off-chip memory. Our processor, with its new protection mechanisms, has been implemented on an FPGA, and is fully functional. We briefly assess the cost of the security mechanisms in our processor and show that it is reasonable. 相似文献

12.

Design and construction of an array processor

AM Ahmed YK Abaas 《Microprocessors and Microsystems》1980,4(3):83-88

Design and construction of an array processor that performs autocorrelation functions is presented. The architecture of this system offers speed and avoids complexity. Random access memory with shifting across zero techniques and a high speed address generator are used. Performance is measured for different sizes of arrays and compared with required time of processing the same arrays using software. 相似文献

13.

A continuation-based noninterruptible multithreading processor architecture

Satoshi Amamiya Makoto Amamiya Ryuzo Hasegawa Hiroshi Fujita 《The Journal of supercomputing》2009,47(2):228-252

Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation. 相似文献

14.

超标量处理器中引入SMT技术的性能分析研究

下载免费PDF全文

史莉雯樊晓桠黄小平《计算机工程与应用》2009,45(5):13-15

同时多线程（SMT）是一种允许多个独立的线程每周期发射多条指令的技术,这种技术充分利用了可能存在的指令级并行和线程级并行,提高了有限资源的利用率。文章以西北工业大学航空微电子中心自主研发的32位超标量处理器“龙腾R2”为基础,引入SMT技术,在基本不改变内部结构大小、不增加执行功能部件、仅做一些必要修改的前提条件下进行研究。通过仿真不同的线程数和各种线程组合,进行性能分析。尽管存在制约性能提升的一些因素,引入SMT技术后依然获得了最高约50%的性能增加。相似文献

15.

基于VPM和随机激励的处理器核仿真建模

下载免费PDF全文

许彤张仕健吕涛《计算机工程》2010,36(20):19-21

为提高处理器核仿真模型的效率,提出基于SimpleScalar架构对龙芯1号处理器进行虚拟处理器模型行为建模,IPC平均误差为2.3%,速度达到每秒1 000 000条指令。基于可控随机事件机制实现的总线功能模型可以为片上系统(SoC)设计提供激励主动生成方案和片上互连验证功能。实验结果证明,该方法对处理器IP仿真建模具有普适意义,能够被无缝融入SoC流程中。相似文献

16.

容错处理器阵列的多逻辑列并行重构算法

章子凯武继刚姜文超刘竹松《计算机工程与科学》2018,40(1):24-33

处理器阵列的容错重构技术是片上网络多核、众核高性能体系结构的可靠性技术之一。现有的最大逻辑阵列并行重构技术仅对单条逻辑列的构造实现了并行化,而对多条逻辑列的同步并行仍未见可行算法。依据处理器阵列的潜在并行性,在分治策略的基础上,提出了一种阵列分块的并行重构算法。算法对处理器阵列实施横向分块划分,对每个阵列块进行并行重构,并对所得逻辑子阵列进行归并,实现了多条逻辑列的同步并行重构。与现有的并行算法相比,新算法同样能够生成最大逻辑列,并且减少了通信开销与计算中的数据冗余,有效提高了运行速度。实验结果表明,在物理阵列大小为64×64的处理器阵列上,运行速度比现有并行算法提高39.55%,并且具有良好的可扩展性。相似文献

17.

Design of a vector processor

下载免费PDF全文

Qi Lin 《计算机科学技术学报》1986,1(1):26-34

This paper discusses the inherent parallelism limits on several applications for vector computers, the parallel capabilities of several architectures and two ways (traditional instruction control flow and data control flow) by which the capabilities can be used. Then a scheme for a pipelined vector processor of multi-processing units is presented. The basic system structure and its function on highly sparse vector processing are described. A vector cache system and a distributed main memory are also considered, which are intended to sustain extremely high access rates for the processor. A microprocessor based vector processor is constructed, which can simulate the high performance version of the processor. 相似文献

18.

UNIRED II: The high performance inference processor for the parallel inference machine PIE64

Kentaro Shimada Hanpei Koike Hidehiko Tanaka 《New Generation Computing》1993,11(3-4):251-269

UNIRED II is the high performance inference processor for the parallel inference machine PIE64. It is designed for the committed choice language Fleng, and for use as an element processor of parallel machines. Its main features are: 1) a tag architecture, 2) three independent memory buses (instruction fetching, data reading, and data writing), 3) a dedicated instruction set for efficient execution of Fleng, 4) multi-context processing for reducing pipeline interlocking and overhead of interprocessor synchronization. With the multi-context processing mechanism, the internal pipeline is shared by several independent instruction streams (contexts), and which contexts are to be executed is determined cycle by cycle. So, UNIRED II acts as a shared-pipeline MIMD processor. In this paper, several architectural features including the multi-context processing and the instruction set are described. Performance measurement results by simulation are also presented. On a 10MHz UNIRED II, 920K Reduction Per Second has been achieved, and it is shown that the multi-context processing mechanism is very effective for improved performance. 相似文献

19.

Parallel exact inference on the Cell Broadband Engine processor

Yinglong Xia Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2010

We present the design and implementation of a parallel exact inference algorithm on the Cell Broadband Engine (Cell BE) processor, a heterogeneous multicore architecture. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with the network structure and clique size. In this paper, we exploit parallelism in exact inference at multiple levels. We propose a rerooting method to minimize the critical path for exact inference, and an efficient scheduler to dynamically allocate SPEs. In addition, we explore potential table representation and layout to optimize DMA transfer between local store and main memory. We implemented the proposed method and conducted experiments on the Cell BE processor in the IBM QS20 Blade. We achieved speedup up to 10 × on the Cell, compared to state-of-the-art processors. The methodology proposed in this paper can be used for online scheduling of directed acyclic graph (DAG) structured computations. 相似文献

20.

A compact FPGA-based processor for the Secure Hash Algorithm SHA-256

Rommel García Ignacio Algredo-Badillo Miguel Morales-Sandoval Claudia Feregrino-Uribe René Cumplido 《Computers & Electrical Engineering》2014

This work reports an efficient and compact FPGA processor for the SHA-256 algorithm. The novel processor architecture is based on a custom datapath that exploits the reusing of modules, having as main component a 4-input Arithmetic-Logic Unit not previously reported. This ALU is designed as a result of studying the type of operations in the SHA algorithm, their execution sequence and the associated dataflow. The processor hardware architecture was modeled in VHDL and implemented in FPGAs. The results obtained from the implementation in a Virtex5 device demonstrate that the proposed design uses fewer resources achieving higher performance and efficiency, outperforming previous approaches in the literature focused on compact designs, saving around 60% FPGA slices with an increased throughput (Mbps) and efficiency (Mbps/Slice). The proposed SHA processor is well suited for applications like Wi-Fi, TMP (Trusted Mobile Platform), and MTM (Mobile Trusted Module), where the data transfer speed is around 50 Mbps. 相似文献