首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Electron Repulsion Integrals (ERIs) are a common bottleneck in ab initio computational chemistry. It is known that sorted/reordered execution of ERIs results in efficient SIMD/vector processing. This paper shows that reconfigurable computing and heterogeneous processor architectures can also benefit from a deliberate ordering of ERI tasks. However, realizing these benefits as net speedup requires a very rapid sorting mechanism. This paper presents two such mechanisms. Included in this study are analytical, simulation-based, and experimental benchmarking approaches to consider five use cases for ERI sorting, i.e. SIMD processing, reconfigurable computing, limited address spaces, instruction cache exploitation, and data cache exploitation. Specific consideration is given to existing cache-based processors, FPGAs, and the Cell Broadband Engine processor. It is proposed that the analyses conducted in this work should be built upon to aid the development of software autotuners which will produce efficient ab initio computational chemistry codes for a variety of computer architectures.  相似文献   

2.
Louri  A. 《Micro, IEEE》1991,11(2)
A 3-D optical architecture currently under investigation is described. This model, a single-instruction, multiple-data (SIMD) system, exploits spatial parallelism and processes 2-D binary images as fundamental computational entities using symbolic substitution logic. This system effectively implements highly structured data-parallel algorithms, such as signal and image processing, partial differential equations, multidimensional numerical transforms, and numerical supercomputing. The model includes a hierarchical mapping technique that helps design the algorithms and maps them onto the proposed optical architecture. The symbolic substitution logic and the mapping of data-parallel algorithms are discussed. The theoretical performance of the optical system was estimated and compared with that of electronic SIMD array processors. Preliminary results show that the system provides greater computational throughput and efficiency than its electronic counterparts  相似文献   

3.
Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.  相似文献   

4.
Pixel-per-processing element (PPE) ratio—the amount of image data directly mapped to each processing element—has a significant impact on the area and energy efficiency of embedded SIMD architectures for image processing applications. This paper quantitatively evaluates the impact of PPE ratio on system performance and efficiency for focal-plane SIMD image processing architectures by comparing throughput, area efficiency, and energy efficiency for a range of common application kernels using architectural and workload simulation. While the impact of grain size is affected by the mix of executed instructions within an application program, the most efficient PPE ratio often does not occur at PE grain size extremes (i.e., one pixel per processor or one processor per image). In this study, a set of four image processing application tasks is implemented on eight different SIMD configurations. Each configuration has a different PPE ratio and a different amount of local memory. Cycle accurate simulation and analytical technology modeling allows assessment of execution performance, area efficiency, and energy efficiency for each configuration. Results show the highest area and energy efficiency are achieved at PPE ratios between 16 and 256. Using these evaluation techniques (application grain size retargeting combined with area and energy technology modeling), a new class of efficient, embedded SIMD architectures for image processing can be designed.  相似文献   

5.
In the last decade, the volume of unstructured data that Internet and enterprise applications create and consume has been growing at impressive rates. The tools we use to process these data are search engines, business analytics suites, natural-language processors and XML processors. These tools rely on tokenization, a form of regular expression matching aimed at extracting words and keywords in a character stream. The further growth of unstructured data-processing paradigms depends critically on the availability of high-performance tokenizers. Despite the impressive amount of parallelism that the multi-core revolution has made available (in terms of multiple threads and wider SIMD units), most applications employ tokenizers that do not exploit this parallelism. I present a technique to design tokenizers that exploit multiple threads and wide SIMD units to process multiple independent streams of data at a high throughput. The technique benefits indefinitely from any future scaling in the number of threads or SIMD width. I show the approach’s viability by presenting a family of tokenizer kernels optimized for the Cell/B.E. processor that deliver a performance seen, so far, only on dedicated hardware. These kernels deliver a peak throughput of 14.30 Gbps per chip, and a typical throughput of 9.76 Gbps on Wikipedia input. Also, they achieve almost-ideal resource utilization (99.2%). The approach is applicable to any SIMD enabled processor and matches well the trend toward wider SIMD units in contemporary architecture design.  相似文献   

6.
The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz.  相似文献   

7.
二维SIMD结构是指一个由N×N的处理单元按一定的拓扑结构连接组成的阵列结构,其同行/列的处理单元以SIMD方式工作。二维SIMD结构作为多媒体加速部件广泛应用在各种多媒体处理的SOC中,因此其体系结构的设计是获得高性能多媒体计算的重要因素。结合多媒体应用程序的特点,研究分析不同设计参数对二维SIMD结构性能的影响,并设计实现了一个二维SIMD结构的性能模拟器。实验结果显示了二维SIMD结构对多媒体程序有很好的加速比并证实了研究分析结论。  相似文献   

8.
数字图像处理需要大量的数据运算,要求系统具有很高的数据吞吐量。并行处理结构能较好地满足这一要求。介绍一种SIMD并行多DSP数字图像处理系统。该系统具有避免冲突、能连续处理图像数据、处理器间通信及I/O部分简单、硬件及软件模块化等优点。  相似文献   

9.
针对如何利用高性能多核化设备,提高网络安全产品的处理能力,设计和实现了一种基于x86架构的Llinux平台多核绑定技术。该技术首先建立DMA缓冲队列映射,减少网卡访问次数,采用SIMD多核思想设计和实现了虚拟数据桶,并对进入数据桶的数据实施负载均衡;将Netfilter主函数多线程化,并采用内核线程绑定技术将多线程绑定到指定核.实验结果表明,DMA缓冲队列映射可以提高网络设备的I/O吞吐量,虚拟数据桶减少了数据包二次拷贝的开销,节省内核态内存,多核绑定技术提高网络安全设备多核利用率和数据包处理能力。  相似文献   

10.
一种SIMD优化中的向量寄存器部分重用方法   总被引:1,自引:0,他引:1       下载免费PDF全文
SIMD架构用于多媒体加速,已经广泛应用于现代通用处理器中.SIMD架构的数据并行性可大大提高处理器的运算能力,但由于存储系统的速度远远不能与其匹配,使得应用程序的性能很难获得进一步的提高.因此,本文基于SIMD架构的访存特性,提出了一种向量寄存器部分重用的方法,以提高访存效率;并给出了相应的程序转换算法,通过数据相关性的分
分析,在应用程序向量化时,生成采用向量寄存器部分重用的优化代码.实验结果说明,该算法对多媒体应用程序的性能有显著的提高.  相似文献   

11.
In this article, we present the prototypical implementation of the scalable GigaNetIC chip multiprocessor architecture. We use an FPGA-based rapid prototyping system to verify the functionality of our architecture in a network application scenario before fabricating the ASIC in a modern CMOS standard cell technology. The rapid prototyping environment gives us the opportunity to test our multiprocessor architecture with Ethernet-based data streams in a real network scenario. Our system concept is based on a massively parallel processor structure. Due to its regularity, our architecture can be easily scaled to accommodate a wide range of packet processing applications with various performance and throughput requirements at high reliability. Furthermore, the composition based on predefined building blocks guarantees fast design cycles and simplifies system verification. We present standard cell synthesis results as well as a performance analysis for a firewall application with various couplings of hardware accelerators. Finally, we compare implementations of our architecture with state-of-the-art desktop CPUs. We use simple, general-purpose applications as well as the introduced packet processing tasks to determine the performance capabilities and the resource efficiency of the GigaNetIC architecture. We show that, if supported by the application, parallelism offers more opportunities than increasing clock frequencies.  相似文献   

12.
In this paper, we present a system called KAFA (Kaist Fuzzy Accelerator) which provides various fuzzy inference methods and fuzzy set operations. The basic idea of this study is to develop a more general purpose hardware system. The architecture has SIMD structure, which consists of two parts; a system control unit (main controller), and an arithmetic unit (fuzzy processing element (FPE)). Microinstruction codes are defined and any fuzzy operation can be programmed by using these microinstructions. Each FPE has the maximum speed of 10 M FLOPS. As the KAFA contains 128 FPE's, if a fuzzy set consists of 128 elements, we achieve the peak performance of 10 M FSOPS (fuzzy set operation per second) under 10 MHz clock frequency. This system also includes the parallel algorithms for defuzzification on the SIMD mode architecture using KAFA network. The prototype of the proposed architecture was developed with the FPGA chips. The speed of the KAFA holds promise for the development of the new fuzzy application system such as automatic control, fuzzy expert systems, real time systems and fuzzy databases  相似文献   

13.
Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost.  相似文献   

14.
《Computer》2007,40(3):93-95
Conventional wisdom holds that the service-oriented architecture approach is the silver bullet for all IT problems nowadays. According to this view, SOA leads to near-perfect applications in which every function is implemented as a service, and a service can call any other service to implement its functionality. This includes not only services that provide business functionality, but also nonfunctional services for logging, monitoring, data transformation, and so on. Services that exist as independent concepts at design time are implemented as independent execution entities at runtime. Assuming that the conceptual system structure is equally useful during execution is a naive and potentially dangerous mistake. Instead, applying high-performance transaction system design criteria that optimize for runtime properties like performance, throughput, and resiliency should be paramount  相似文献   

15.
传统的图形处理器中的像素混合单元是用功能固定的电路来实现的,实现了一个高性能的面向移动设备的可编程像素渲染器的设计。该处理器采用定点数操作,实现了4路共128位的单指令多数据的运算单元和具备数据旁路功能的8级流水线。这些结构特性有效地减少了电路面积,提高了像素渲染器的运算速度。该像素渲染器在FPGA平台上的实验结果表明,用户可以通过编程实现自定义的像素混合算法,以渲染出各种不同的特殊效果。  相似文献   

16.
The high chip-level integration enables the implementation of large-scale parallel processing architectures with 64 and more processing nodes on a single chip or on an FPGA device. These parallel systems require a cost-effective yet high-performance interconnection scheme to provide the needed communications between processors. The massively parallel Network on Chip (mpNoC) was proposed to address the demand for parallel irregular communications for massively parallel processing System on Chip (mppSoC). Targeting FPGA-based design, an efficient mpNoC low level RTL implementation is proposed taking into account design constraints. The proposed network is designed as an FPGA based Intellectual Property (IP) able to be configured in different communication modes. It can communicate between processors and also perform parallel I/O data transfer which is clearly a key issue in an SIMD system. The mpNoC RTL implementation presents good performances in terms of area, throughput and power consumption which are important metrics targeting an on chip implementation. mpNoC is a flexible architecture that is suitable for use in FPGA-based parallel systems. This paper introduces the basic mppSoC architecture. It mainly focuses on the mpNoC flexible IP based design and its implementation on FPGA. The integration of mpNoC in mppSoC is also described. Implementation results on a Stratix II FPGA device are given for three data-parallel applications ran on mppSoC. The obtained good performances justify the effectiveness of the proposed parallel network. It is shown that the mpNoC is a lightweight parallel network making it suitable for both small as well as large FPGA-based parallel systems.  相似文献   

17.
多媒体智能数据库系统是一个对象数据库管理系统。体系结构的选择对它的性能和功能有非常重要的影响。在体系结构的选择过程中我们始终遵循这样的一条准则:性能更为重要,因为功能可以在ODBMS的上层的应用程序中增加,而性能上的缺陷是不可能在应用程序层次上得以弥补的。本文讨论和比较了ODBMS的若干体系结构及其实现方案,最后给出了一个有较高性能MIDS体系结构方案。  相似文献   

18.
SIMD arrays are likely to become increasingly important as coprocessors in domain specific systems as architects continue to leverage RAM technology in their design. The problem this work addresses is the efficient evaluation of SIMD arrays with respect to complex applications while accounting for operating frequency and chip area. The underlying issues include the size of the architecture space, the lack of portability of the test programs, and the inherent complexity of simulating up to hundreds of thousands of processing elements. The overall method we use is to combine architecture level and Electronic Design Automation (EDA) level modeling by using an EDA-based tool to calibrate architectural simulations. The resulting system retains much of the high throughput of the architecture level simulator but it also has accuracy similar to that of an early pass EDA synthesis and circuit simulation. The particular problem of computational cost of the architectural level simulation is addressed with a novel approach to trace-based simulation (we call it trace compilation), which we find to be one to two orders of magnitude faster than instruction level simulation while still retaining much of the accuracy of the model. Furthermore, traces must be generated for only a small fraction of the possible parameter combinations. Using trace compilation also addresses program portability by allowing the user to code in a single data parallel language with a single compiler, regardless of the target architecture. We have used our system to evaluate thousands of potential SIMD array designs with respect to real applications and present some sample results.  相似文献   

19.
在计算机图形学、积分计算和神经网络等应用场景中,平方根函数的高性能实现在构建处理器的基础软件生态中起到了十分重要的作用.随着A RM架构处理器得到广泛的使用,研究A RM架构下的函数快速算法实现变得更加关键.当前大量处理器都采用了SIMD架构,所以,研究基于SIMD实现高性能函数计算方法具有重要的研究意义和发展前景.因此,对平方根函数进行了高性能的实现与优化.通过分析IEEE 754标准的浮点数在内存中的存储格式,设计了高效的平方根函数算法;然后通过结合平方根倒数和泰勒公式算法,进一步提高了算法精度;最后通过SIMD优化进一步提升了算法性能.实验结果表明,在满足精度的前提下,相比于libm算法库,实现的平方根函数的,性能提高了约7倍,相比于A RM V8提供的计算平方根的指令在性能上提高了约3倍.  相似文献   

20.
The ever-increasing need for high performance in scientific computation and engineering applications will push high-performance computing beyond the exascale. As an integral part of a supercomputing system, highperformance processors and their architecture designs are crucial in improving system performance. In this paper, three architecture design goals for high-performance processors beyond the exascale are introduced, including effective performance scaling, efficient resource utilization, and adaptation to diverse applications. Then a high-performance many-core processor architecture with scalar processing and application-specific acceleration (Massa) is proposed, which aims to achieve the above three goals by employing the techniques of distributed computational resources and application-customized hardware. Finally, some future research directions regarding the Massa architecture are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号