首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Svend E. Knudsen 《Software》2011,41(4):393-402
A simple programming abstraction based on the notion of independence is introduced as a means for mapping the independence inherent in an algorithm explicitly into its programmed solution. This enables a compiler and runtime system to exploit the independence and achieve efficient parallelism of execution on multicore processors. The constructs needed to express mutual independence among statements are proposed and their implementation in iOberon, an extension of the Active Oberon programming language, is defined. The programming language extensions, runtime support, and performance measurements are described in detail. We believe that this concept of specifying local disjoint program fragments can be applied to other programming languages. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

2.
We tackle the parallelization of Non-Negative Matrix Factorization (NNMF), using the Alternating Least Squares and Lee and Seung algorithms, motivated by its use in audio source separation. For the first algorithm, a very suitable technique is the use of active set algorithms for solving several non-negative inequality constraints least squares problems. We have addressed the NNMF for dense matrix on multicore architectures, by organizing these optimization problems for independent columns. Although in the sequential case, the method is not as efficient as the block pivoting variant used by other authors, they are very effective in the parallel case, producing satisfactory results for the type of applications where is to be used. For the Lee and Seung method, we propose a reorganization of the algorithm steps that increases the convergence speed and a parallelization of the solution. The article also includes a theoretical and experimental study of the performance obtained with similar matrices to that which arise in applications that have motivated this work.  相似文献   

3.
We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh.  相似文献   

4.
5.
The MAJC architecture enhances application performance by exploiting parallelism at multiple levels-instruction, data, thread, and process. Supporting vertical multithreading, speculative multithreading, and chip multiprocessors, the scalable VLIW architecture is also capable of advanced speculation and predication and treats all data types similarly  相似文献   

6.
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.  相似文献   

7.
In the present era, energy is progressively turning into the major limitation in designing multicore chips. However, power and performance are the primary segments of energy, which are contrarily correlated in multicore architectures. This research primarily focused on optimizing energy level of multicore chips using parallel workloads by utilizing either power or execution advancement based on machine learning computation on dynamic programming. To do as such, the novel dynamic machine learning-based heuristic energy optimization (DML-HEO) algorithm has been designed and developed in this research on application-specific controllers to optimize energy-level on multicore architecture. Here DML-HEO is implemented on the controller to maximize the execution inside a fixed power spending plan or to limit the expended capacity to accomplish a similar pattern execution. The controller is additionally scalable as it does not bring about critical overhead due to the increase in quantity of cores. The strategy has been assessed utilizing controllers on a full-framework test system at lab-scale analysis. The experimental results demonstrate that our proposed DML-HEO system shows improving performance than the traditional system.  相似文献   

8.
Resource integration using a large knowledge base in Carnot   总被引:2,自引:0,他引:2  
Collet  C. Huhns  M.N. Shen  W.-M. 《Computer》1991,24(12):55-62
A method for integrating separately developed information resources that overcomes incompatibilities in syntax and semantics and permits the resources to be accessed and modified coherently is described. The method provides logical connectivity among the information resources via a semantic service layer that automates the maintenance of data integrity and provides an approximation of global data integration across systems. This layer is a fundamental part of the Carnot architecture, which provides tools for interoperability across global enterprises  相似文献   

9.
10.
We present a novel architecture to develop Virtual Environments (VEs) for multicore CPU systems. An object-centric method provides a uniform representation of VEs. The representation enables VEs to be processed in parallel using a multistage, dual-frame pipeline. Dynamic work distribution and load balancing is accomplished using a thread migration strategy with minimal overhead. This paper describes our approach, and shows it is efficient and scalable with performance experiments. Near linear speed-ups have been observed in experiments involving up to 1,000 deformable objects on a six-core i7 CPU. This approach’s practicality is demonstrated with the development of a medical simulation trainer for a craniotomy procedure.  相似文献   

11.
Distributed control systems are currently evolving towards industrial Internet of Things (IoT) systems communicating fully using Internet protocols. This creates opportunities for streamlining costly commissioning processes, which today require substantial manual work for installing, configuring, and integrating thousands of actuators and sensors. The vision of “plug-and-produce” control systems has been pursued for more than 15 years, but existing approaches fell short regarding configuration tasks and vendor neutrality. This paper introduces the standards-based IoT reference architecture OpenPnP, which allows largely automating the configuration and integration tasks of industrial commissioning processes. The architecture includes a number of design and technology decisions and the required implementation can be scaled down to resource-constrained industrial devices. This paper demonstrates how OpenPnP can reduce configuration and integration efforts up to 90% in typical settings, while potentially scaling well up to tens of thousands of communicated signals. Practitioners can orient their implementations towards OpenPnP, therefore potentially enabling “plug-and-produce” in many thousands of control systems.  相似文献   

12.
In a typical ambulatory health monitoring systems, wearable medical sensors are deployed on the human body to continuously collect and transmit physiological signals to a nearby gateway that forward the measured data to the cloud-based healthcare platform. However, this model often fails to respect the strict requirements of healthcare systems. Wearable medical sensors are very limited in terms of battery lifetime, in addition, the system reliance on a cloud makes it vulnerable to connectivity and latency issues. Compressive sensing (CS) theory has been widely deployed in electrocardiogramme ECG monitoring application to optimize the wearable sensors power consumption. The proposed solution in this paper aims to tackle these limitations by empowering a gateway-centric connected health solution, where the most power consuming tasks are performed locally on a multicore processor. This paper explores the efficiency of real-time CS-based recovery of ECG signals on an IoT-gateway embedded with ARM’s big.little™ multicore for different signal dimension and allocated computational resources. Experimental results show that the gateway is able to reconstruct ECG signals in real-time. Moreover, it demonstrates that using a high number of cores speeds up the execution time and it further optimizes energy consumption. The paper identifies the best configurations of resource allocation that provides the optimal performance. The paper concludes that multicore processors have the computational capacity and energy efficiency to promote gateway-centric solution rather than cloud-centric platforms.  相似文献   

13.
Multicore computers are expected to be used to process a higher volume of data in the future. Current mesh-like multicore architecture is inadequate to increase memory-level-parallelism because of its poor core-to-core interconnection topology. In some architecture, each node has communication and computation components – switching component of such a node consumes power while the node is only computing and vice versa. In this paper, we propose a folded-torus based topology to improve performance and energy saving. In this architecture, nodes are separated between network switches and computing cores. Using folded-torus concept, we develop a scheme to connect the components (switches and cores) of a multicore architecture. Experimental results show that the proposed architecture outperforms Raw Architecture Workstation (RAW), Triplet Based Architecture (TriBA), and Logic-Based Distributed Routing (LBDR) architecture by reducing the switches more than 53%, the power consumption by up to 71%, and the average delay by up to 58%.  相似文献   

14.
HPRC (High-Performance Reconfigurable Computing) systems include multicore processors and reconfigurable devices acting as custom coprocessors. Due to economic constraints, the number of reconfigurable devices is usually smaller than the number of processor cores, thus preventing that a 1:1 mapping between cores and coprocessors could be achieved. This paper presents a solution to this problem, based on the virtualization of reconfigurable coprocessors. A Virtual Coprocessor Monitor (VCM) has been devised for the XtremeData XD2000i In-Socket Accelerator, and a thread-safe API is available for user applications to communicate with the VCM. Two reference applications, an IDEA cipher and an Euler CFD solver, have been implemented in order to validate the proposed architecture and execution model. Results show that the benefits arising from coprocessor virtualization outperform its overhead, specially when code has a significant software weight.  相似文献   

15.
16.
Enterprise mashups leverage various source of information to compose new situational applications. The architecture of such applications must address integration issues: it needs to deal with heterogeneous local and/or public data sources, and build value-added applications on existing corporate IT systems. In this paper, we leverage enterprise architecture integration patterns to compose reusable mashup components. We present a service oriented architecture that addresses reusability and integration needs for building enterprise mashup applications. Key techniques to customize this architecture are developed for mashups with themed data on location maps. The usage of this architecture is illustrated by a property valuation application derived from a real-world scenario. We demonstrate and discuss how this state-of-the-art architecture design method can be applied to enhance the design and development of emerging enterprise mashups.  相似文献   

17.
The Parallel.FX Task Parallel Library is the latest tool developed for multicore parallelism optimization using the .NET technology. It is a managed concurrency library that provides optimized managed code for multicore processors using a new thread pool that withstands cancellation, waiting and pool isolation, among many other features. The Task Parallel Library also uses dynamic work stealing techniques for superior scalability. This paper analyzes the performance improvement of using the Task Parallel Library of Parallel.FX when applying a Multi-Objective Evolutionary Algorithm to solve a timetabling problem. For comparative purposes, this algorithm has also been parallelized using threads. The results obtained show that both alternatives allow a reduction in the runtime necessary to solve this problem. However, parallelizing the code using the Task Parallel Library of Parallel.FX has the advantage of being easier and the code size is much smaller than directly programming threads.  相似文献   

18.
Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In the context of data-intensive embedded applications, there have been two complementary approaches to enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism. Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This can be achieved by considering multiple loop nests simultaneously. Although compilers address these two problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously. Therefore, an integrated approach that combines these two can generate much better results than each individual approach. Based on these observations, this paper proposes a constraint network (CN)-based formulation for data locality optimization and code parallelization. The paper also presents experimental evidence, demonstrating the success of the proposed approach, and compares our results with those obtained through previously proposed approaches. The experiments from our implementation indicate that the proposed approach is very effective in enhancing data locality and parallelization.  相似文献   

19.
《电子技术应用》2017,(3):16-20
如今FFT卷积广泛应用于数字信号处理,并且过去几年证实了异构多核可编程系统(HMPS)的发展。另外,HMPS已经成为DSP领域的主流趋势。因此,研究基于HMPS大点FFT卷积的高效地实现显得非常重要。基于重叠相加FFT卷积方法,设计一款针对输入数据流的高效流水重叠相加滤波器。介绍了基于HMPS的大点FFT卷积实现,获得了高精度的滤波效果。此外,采用流水技术的滤波器设计,提高系统处理速度、数据吞吐率和任务并行度。基于Xilinx XC7V2000T FPGA开发板上的实验表明,参与运算的采样点越大,系统的任务并行度、处理速度和数据吞吐率就会越高。当采样点达到1M时,系统的平均任务平行度达到了5.33,消耗了2.745×10~6个系统时钟周期数,并且绝对误差精度达到10~(-4)。  相似文献   

20.
The Journal of Supercomputing - The number of transmit and receiver antennas is an important factor that affects the performance and complexity of a MIMO system. A MIMO system with very large...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号