期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient mapping and acceleration of AES on custom multi‐core architectures

Amit Pande Joseph Zambreno 《Concurrency and Computation》2011,23(4):372-389

Multi‐core processors can deliver significant performance benefits for multi‐threaded software by adding processing power with minimal latency, given the proximity of the processors. Cryptographic applications are inherently complex and involve large computations. Most cryptographic operations can be translated into logical operations, shift operations, and table look‐ups. In this paper we design a novel processor (called mu‐core) with a reconfigurable Arithmetic Logic Unit, and design custom two‐dimensional multi‐core architectures on top of it to accelerate cryptographic kernels. We propose an efficient mapping of instructions from the multi‐core grid to the individual processor cores and illustrate the performance of AES‐128E algorithm over custom‐sized grids. The model was developed using Simulink and the performance analysis suggests a positive trend towards development of large multi‐core (or multi‐ µ‐core) architectures to achieve high throughputs in cryptographic operations. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

2.

Towards pattern‐based architectures for event processing systems

Ralf Bruns Jürgen Dunkel 《Software》2014,44(11):1395-1416

Recently, event processing (EP) has gained considerable attention as an individual discipline in computer science. From a software engineering perspective, EP systems still lack the maturity of well‐established software architectures. For the development of industrial EP systems, generally accepted software architectures based on proven design patterns and principles are still missing. In this article, we introduce a catalog of design patterns that supports the development of event‐driven architectures (EDAs) and complex EP systems. The design principles originate from experiences reported in publications as well as from our own experiences in building EP systems with industrial and academic partners. We present several patterns on different layers of abstractions that define the overall structure as well as the building blocks for EP systems. Architectural patterns that determine the top‐level structure of an EDA can be distinguished from design patterns that specify the basic mechanisms of EP. The practical application of the catalog of patterns is described by the pattern‐based design of a sample EDA for a sensor‐based energy control system. Finally, we propose a coherent and general reference architecture for EP derived from the proposed patterns.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

3.

Parallel BVH Construction using Progressive Hierarchical Refinement

下载免费PDF全文

J. Hendrich D. Meister J. Bittner 《Computer Graphics Forum》2017,36(2):487-494

We propose a novel algorithm for construction of bounding volume hierarchies (BVHs) for multi‐core CPU architectures. The algorithm constructs the BVH by a divisive top‐down approach using a progressively refined cut of an existing auxiliary BVH. We propose a new strategy for refining the cut that significantly reduces the workload of individual steps of BVH construction. Additionally, we propose a new method for integrating spatial splits into the BVH construction algorithm. The auxiliary BVH is constructed using a very fast method such as LBVH based on Morton codes. We show that the method provides a very good trade‐off between the build time and ray tracing performance. We evaluated the method within the Embree ray tracing framework and show that it compares favorably with the Embree BVH builders regarding build time while maintaining comparable ray tracing speed. 相似文献

4.

A framework to generate domain-specific manycore architectures from dataflow programs

《Microprocessors and Microsystems》2020

In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures. 相似文献

5.

Carrying on the legacy of imperative languages in the future parallel computing era

《Parallel Computing》2014,40(3-4):1-33

There has been a renewed interest in dataflow computing models in recent years of technology scaling. Potentiality of exploiting huge parallelism, with the expense of low power, simpler circuit, less silicon area, is the main characteristic of a dataflow model. Growing trends in housing large number of functional units in a single chip, making use of local clocks, reducing energy consumptions, avoiding global wires are the main reasons behind the resurgence of dataflow models. To program a dataflow machine, new architectures suggest imperative languages rather than functional type dataflow languages or parallel languages because this is the right way to make the new architectures popular among the general community. Although for several decades scientists have been working on how imperative languages can be used in dataflow models efficiently, there is no systematic review on those works. Existing reviews on dataflow paradigm mainly focus on the architectures. Although few papers review programming languages of dataflow architectures, their discussions are limited to only dataflow languages and visual programming languages which are fundamentally different from imperative languages. In this paper, we conduct a systematic review on those works that attempt to provide a way to use imperative languages in any type of dataflow architectures. Our survey of compilers and related architectures cover the aspects like translation mechanisms of program construct, their optimization techniques, memory ordering methods, program allocation and scheduling and special architectural features. We also present some of our observations and future research directions obtained by exploring the literature. 相似文献

6.

HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs

Duksu Kim Jae‐Pil Heo Jaehyuk Huh John Kim Sung‐eui Yoon 《Computer Graphics Forum》2009,28(7):1791-1800

We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi‐core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports self‐collision detection. HPCCD takes advantage of hybrid multi‐core architectures – using the general‐purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock‐free parallel algorithm in the main loop of our BVH‐based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi‐core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU‐cores and two GPUs, compared to using a single CPU‐core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles. 相似文献

7.

HPC‐GAP: engineering a 21st‐century high‐performance computer algebra system

Reimer Behrends Kevin Hammond Vladimir Janjic Alexander Konovalov Steve Linton Hans‐Wolfgang Loidl Patrick Maier Phil Trinder 《Concurrency and Computation》2016,28(13):3606-3636

Symbolic computation has underpinned a number of key advances in Mathematics and Computer Science. Applications are typically large and potentially highly parallel, making them good candidates for parallel execution at a variety of scales from multi‐core to high‐performance computing systems. However, much existing work on parallel computing is based around numeric rather than symbolic computations. In particular, symbolic computing presents particular problems in terms of varying granularity and irregular task sizes that do not match conventional approaches to parallelisation. It also presents problems in terms of the structure of the algorithms and data. This paper describes a new implementation of the free open‐source GAP computational algebra system that places parallelism at the heart of the design, dealing with the key scalability and cross‐platform portability problems. We provide three system layers that deal with the three most important classes of hardware: individual shared memory multi‐core nodes, mid‐scale distributed clusters of (multi‐core) nodes and full‐blown high‐performance computing systems, comprising large‐scale tightly connected networks of multi‐core nodes. This requires us to develop new cross‐layer programming abstractions in the form of new domain‐specific skeletons that allow us to seamlessly target different hardware levels. Our results show that, using our approach, we can achieve good scalability and speedups for two realistic exemplars, on high‐performance systems comprising up to 32000 cores, as well as on ubiquitous multi‐core systems and distributed clusters. The work reported here paves the way towards full‐scale exploitation of symbolic computation by high‐performance computing systems, and we demonstrate the potential with two major case studies. © 2016 The Authors. Concurrency and Computation: Practice and Experience Published by John Wiley & Sons Ltd. 相似文献

8.

A scalable architecture for concurrent online auctions

Bill Karakostas 《Concurrency and Computation》2014,26(3):841-850

Online auction systems are characterised by a number of functional and performance management requirements, caused by the potentially very large numbers of distributed concurrent bidders, as well as by the auction rules. Such systems are typically implemented as three tier, thread‐based architectures, whose performance does not scale up well with an increase in the number of concurrent bidders. Nor such systems can take advantage of new Cloud based environments. In this paper, we propose an architectural framework for online auctions developed on top of a soft real‐time platform (Open Telecom Platform) using a concurrent language (Erlang) and an embedded Web server (Yaws). The proposed framework can scale up to hundreds of thousands of concurrent users while its performance can benefit from multicore and symmetric multiprocessing computer architectures. We demonstrate the capabilities of the framework by developing prototypes for two auction types known as ‘unique bid’ and ‘penny’, analyse their performance characteristics and compare them with that of existing auction systems. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

9.

Glyphmaker: creating customized visualizations of complex data

Ribarsky W. Ayers E. Eble J. Mukherjea S. 《Computer》1994,27(7):57-64

Glyphmaker allows nonexpert users to customize their own graphical representations using a simple glyph editor and a point-and-click binding mechanism. In particular, users can create and then alter bindings to visual representations, bring in new data or glyphs with associated bindings, change ranges for bound data, and do these operations interactively. They can also focus on data down to any level of detail, including individual elements, and then isolate or highlight the focused region. These features empower users, letting them employ their specialized domain knowledge to create customized visual representations for further exploration and analysis. For ease of design and use, we built Glyphmaker on top of Iris Explorer, the Silicon Graphics Inc. (SGI) dataflow visualization system. The current version of Glyphmaker has been successfully tested on a materials system simulation. We are planning a series of tests and evaluations by scientists and engineers using real data 相似文献

10.

OR-parallel evaluation of logic programs on a multi-ring dataflow machine

A. V. S. Sastry L. M. Patnaik 《New Generation Computing》1991,10(1):23-53

Logic programming languages have gained wide acceptance because of two reasons. First is their clear declarative semantics and the second is the wide scope for parallelism they provide which can be exploited by building suitable parallel architectures. In this paper, we propose a multi-ring dataflow machine to support theOR-parallelism and theArgument parallelism of logic programs. A new scheme is suggested for handling the deferred read mechanism of the dataflow architecture. The required data structures, the dataflow actors and the builtin dataflow procedures for OR-parallel execution are discussed. Multiple binding environments arising in the OR-parallel execution are handled by a new scheme called thetagged variable scheme. Schemes for constrained OR-parallel execution are also discussed. 相似文献

11.

一种缓存数据流信息的处理器前端设计

刘炳涛王达叶笑春张浩范东睿张志敏《计算机研究与发展》2016,53(6):1221-1237

为了能够同时发掘程序的线程级并行性和指令级并行性,动态多核技术通过将数个小核重构为一个较强的虚拟核来适应程序多样的需求.通常这种虚拟核性能弱于占有等量芯片资源的原生核,一个重要的原因就是取指、译码和重命名等流水线的前端各阶段具有串行处理的特征较难经重构后协同工作.为解决此问题,提出了新的前端结构——数据流缓存,并给出与之配合的向量重命名机制.数据流缓存利用程序的数据流局部性,存储并重用指令基本块内的数据依赖等信息.处理器核利用数据流缓存能更好地发掘程序的指令级并行性并降低分支预测错误的惩罚,而动态多核技术中的虚拟核通过使用数据流缓存旁路传统的流水线前端各阶段,其前端难协同工作的问题得以解决.对SPEC CPU2006中程序的实验证明了数据流缓存能够以有限代价覆盖大部分程序超过90%的动态指令,然后分析了添加数据流缓存对流水线性能的影响.实验证明,在前端宽度为4条指令、指令窗口容量为512的配置下,采用数据流缓存的虚拟核性能平均提升9.4%,某些程序性能提升高达28%. 相似文献

12.

A Force-Directed Scheduling based architecture generation algorithm and design tool for FPGAs

Matthew Areno Brandon Eames Joshua Templin 《Journal of Systems Architecture》2010,56(2-3):124-135

相似文献

13.

实时微处理器体系结构综述 总被引：1，自引：0，他引：1

下载免费PDF全文

石伟张明郭御风龚锐《计算机工程与科学》2015,37(5):857-864

实时应用已经成为嵌入式应用中一类快速崛起的典型应用。作为实时系统的核心部件,实时微处理器体系结构是微处理器领域的一个重要研究方向。与通用处理器追求最大吞吐量不同,实时处理器要求具有紧凑且可计算的最坏执行时间。传统的实时处理器往往采用较为简单的处理器结构,避免复杂结构引入执行时间的不确定性。随着实时应用对处理器性能需求越来越高,实时处理器正逐渐向多线程与多核结构发展。在多线程与多核处理器中,共享资源竞争导致实时系统的确定性变差,对实时处理器体系结构带来了更大挑战。对实时微处理器体系结构进行综述,首先从指令集、微体系结构、存储、I/O、任务调度等多个方面对传统实时处理器进行分析;然后分别对采用多线程与多核结构的高性能实时处理器展开分析;最后对几种商用实时处理器结构进行比较,总结实时处理器发展现状与未来发展趋势。相似文献

14.

Super-Streaming: A New Object Delivery Paradigm for Continuous Media Servers

Shahabi Cyrus Alshayeji Mohammad H. 《Multimedia Tools and Applications》2000,11(1):129-155

A number of studies have focused on the design of continuous media, CM, (e.g., video and audio) servers to support the real-time delivery of CM objects. These systems have been deployed in local environments such as hotels, hospitals and cruise ships to support media-on-demand applications. They typically stream CM objects to the clients with the objective of minimizing the buffer space required at the client site. This objective can now be relaxed due to the availability of inexpensive storage devices at the client side. Therefore, we propose a Super-streaming paradigm that can utilize the client side resources in order to improve the utilization of the CM server. To support super-streaming, we propose a technique to enable the CM servers to deliver CM objects at a rate higher than their display bandwidth requirement. We also propose alternative admission control policies to downgrade super-streams in favor of regular streams when the resources are scarce. We demonstrate the superiority of our paradigm over streaming with both analytical and simulation models.Moreover, new distributed applications such as distant-learning, digital libraries, and home entertainment require the delivery of CM objects to geographically disbursed clients. For quality purposes, recently many studies proposed dedicated distributed architectures to support these types of applications. We extend our super-streaming paradigm to be applicable in such distributed architectures. We propose a sophisticated resource management policy to support super-streaming in the presence of multiple servers, network links and clients. Due to the complexity involved in modeling these architectures, we only evaluate the performance of super-streaming by a simulation study. 相似文献

15.

面向数据流结构的指令映射优化方法

李易常成娟卢圣健江道忠范东睿叶笑春《计算机工程与科学》2019,41(1):9-13

在高性能计算领域,数据流是一类重要的计算结构,也在很多实际场景表现出很好的性能和适用性。在数据流计算模式中,程序是以数据流图来表示的,数据流计算中一个关键的问题是如何将数据流图映射到多个执行单元上。通过分析现有数据流结构的指令映射方法及其不足,提出了基于数据流结构的新型指令映射优化方法。主要是根据多地址共享数据包的特性对指令映射方法进行优化,延迟多地址共享数据路由包的拆分,减少网络拥堵。相似文献

16.

Tracking Control for Stochastic Multi‐Agent Systems Based on Hybrid Event‐Triggered Mechanism

Mali Xing Feiqi Deng 《Asian journal of control》2019,21(5):2352-2363

This paper deals with the leader‐following consensus for nonlinear stochastic multi‐agent systems. To save communication resources, a new centralized/distributed hybrid event‐triggered mechanism (HETM) is proposed for nonlinear multi‐agent systems. HETMs can be regarded as a synthesis of continuous event‐triggered mechanism and time‐driven mechanism, which can effectively avoid Zeno behavior. To model the multi‐agent systems under centralized HETM, the switched system method is applied. By utilizing the property of communication topology, low‐dimensional consensus conditions are obtained. For the distributed hybrid event‐triggered mechanism, due to the asynchronous event‐triggered instants, the time‐varying system method is applied. Meanwhile, the effect of network‐induced time‐delay on the consensus is also considered. To further reduce the computational resources by constantly testing whether the broadcast condition has been violated, self‐triggered implementations of the proposed event‐triggered communication protocols are also derived. A numerical example is given to show the effectiveness of the proposed method. 相似文献

17.

A task-uncoordinated distributed dataflow model for scalable high performance parallel program execution

《Parallel Computing》2016

相似文献

18.

New techniques for simulating high performance MPI applications on large storage networks 总被引：1，自引：0，他引：1

Alberto Núñez Javier Fernández Jose D. Garcia Félix Garcia Jesús Carretero 《The Journal of supercomputing》2010,51(1):40-57

In this work, we propose new techniques to analyze the behavior, the performance, and specially the scalability of High Performance Computing (in short, HPC) applications on different computing architectures. Our final objective is to test applications using a wide range of architectures (real or merely designed) and scaling it to any number of nodes or components. This paper presents a new simulation framework, called SIMCAN, for HPC architectures. The main characteristic of the proposed simulation framework is the ability to be configured for simulating a wide range of possible architectures that involve any number of components. SIMCAN is developed to simulate complete HPC architectures, but putting special emphasis on the storage and network subsystems. The SIMCAN framework can handle complete components (nodes, racks, switches, routers, etc.), but also key elements of the storage and network subsystems (disks, caches, sockets, file systems, schedulers, etc.). We also propose several methods to implement the behavior of HPC applications. Each method has its own advantages and drawbacks. In order to evaluate the possibilities and the accuracy of the SIMCAN framework, we have tested it by executing a HPC application called BIPS3D on a hardware-based computing cluster and on a modeled environment that represent the real cluster. We also checked the scalability of the application using this kind of architecture by simulating the same application with an increased number of computing nodes. 相似文献

19.

Towards evolvable software architectures based on systems theoretic stability

Herwig Mannaert Jan Verelst Kris Ven 《Software》2012,42(1):89-116

相似文献

20.

Shared memory multiprocessor architectures for software IP routers

Luo Y. Laxmi Narayan Bhuyan Chen X. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(12):1240-1249

We propose new shared memory multiprocessor architectures and evaluate their performance for future Internet protocol (IP) routers based on symmetric multiprocessor (SMP) and cache coherent nonuniform memory access (CC-NUMA) paradigms. We also propose a benchmark application suite, RouterBench, which consists of four categories of applications representing key functions on the time-critical path of packet processing in routers. An execution driven simulation environment is created to evaluate SMP and CC-NUMA router architectures using this RouterBench. The execution driven simulation can produce accurate cycle-level execution time prediction and reveal the impact of various architectural parameters on the performance of routers. We port the FUNET trace and its routing table for use in our experiments. We find that the CC-NUMA architecture provides an excellent scalability for design of high-performance IP routers. Results also show that the CC-NUMA architecture can sustain good lookup performance, even at a high frequency of route updates. 相似文献