首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The efficient design of computation intensive multidimensional signal processing applications requires dealing with three kinds of constraints: those implied by the data dependencies, the non-functional requirements (real-time, power consumption) and resources availability of the execution platform. Modeling and Analysis of Real-time and Embedded systems (MARTE) UML profile through its repetitive structure modeling (RSM) package is well suited to model the inherent parallelism within these applications, a compact representation of parallel execution platforms and the distributive mapping of one on another. The execution of such a specification respects the whole set of constraints defined upon, while the quality of the scheduling is directly linked to the quality of the mapping of the multidimensional structures (data arrays or parallel loop nests) into time and space. We propose here a strategy to use a refactoring tool dedicated to this kind of application that allows to find good trade-offs in the usage of storage and computation resources and in parallelism (both task and data parallelism) exploitation. This strategy is illustrated on an industrial radar application.  相似文献   

2.
It is expected that Chip Multiprocessor Systems (CMPs) will contain more and more cores in every new generation. However, applications for these systems do not scale at the same pace. In order to obtain a good CMP utilization several applications will need to coexist in the system and in those cases virtualization of the CMP system will become mandatory. In this paper we analyze two virtualization strategies at NoC-level aiming to isolate the traffic generated by each application to reduce or even eliminate interferences among messages belonging to different applications. The first model handles most interferences among messages with a virtual-channels (VCs) implementation reducing both execution time and network latency. However, using VCs results in area and power overhead due to the cost of control and buffer implementation. In contrast, the second model is based on the resource partitioning strategies which results in a space partitioning of the CMP chip in several regions. For this last model, Virtual-Regions (VR), we use a reconfiguration algorithm of the network that is able to dynamically adapt the network partitions in order to satisfy the application requirements. The paper shows a comparison of both models and identifies their main advantages and disadvantages. From our experimental results, we show that our proposal obtains in terms of execution time average improvements of 30% for parallel applications when compared to a baseline scenario. Moreover, when compared to a VCs implementation, our proposal improves the average execution time by 9% for parallel applications.  相似文献   

3.
并行计算在网络化控制中的机遇与挑战   总被引:7,自引:1,他引:7  
先进控制中存在着大量费时的计算,这限制了它们的应用.现代控制系统正在 向着网络化方向发展,为在控制系统中实施并行计算提供了必要的环境.控制系统的网络化 从根本上导致了控制应用中计算模式的变化.在网络化控制系统中运用并行计算技术可以 提高先进控制算法的计算速度,从而提高其控制品质和扩展其应用范围.本文探讨了在当今 流行的网络控制系统体系结构上实施并行计算以解决控制应用中的计算问题的可能性以及 待解决的问题.  相似文献   

4.
5.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

6.
本文在并行系统模拟环境中,采集了一个迭代类并行程序实例的运行时间数据,据此,分析了影响程序运行时间的主要因素,建立了一个并行程序运行时间推算模型,从而可以在迭代次数,输入数据规模,以及并行系统的配置等三个方向上对程序运行时间进行预测,实验数据表明,该模型是相当精确的,可以为我们节省大量的模拟时间。  相似文献   

7.
Distributed systems such as networks of workstations are becoming an increasingly viable alternative to traditional supercomputer systems for running complex scientific applications. A large number of these applications require solving sets of partial differential equations (PDEs). In this paper, we describe the implementation and performance of SPEED (Scalable Partial differential Equation Environment on Distributed systems), a parallel platform which provides an efficient solution for time-dependent PDEs. SPEED allows the inclusion of a wide range of parameters and programming aids. PVM is employed as the underlying message-passing system. The parallel implementation has been performed using two algorithms. The first algorithm is a two-phase scheme which uses the conventional technique of alternating phases of computation and communication. The second algorithm employs a pre-computation technique that allows overlapping of computation and communication. Both methods yield significant speedups. The pre-computation technique reduces the communication time between the workstations but incurs additional overhead in buffer management. Hence, if the saving in communication time is larger than the overhead, the pre-computation technique outperforms the two-phase algorithm. SPEED also provides a performance prediction methodology that can accurately predict the performance of a given application on the system before running the application. This methodology allows the user to tune various parameters in order to identify system bottlenecks and maximize the performance.  相似文献   

8.
描述了一种多租户高可用并行任务调度框架MTHPT的设计思想、体系结构和实现技术,MTHPT包括3部分:任务定义与配置、异步并行任务调度模式、消息告警与监视.任务调度引擎和任务执行组件采用分开部署、异步并行调度和快速回调的模式,快速释放调度引擎占用的线程资源,解决了部分任务执行周期长、定时任务无法按时执行等影响业务系统性能的问题.任务调度配置提供了多租户应用模式.实验分析及评估表明,MTHPT提高了应用系统的任务调度并行调度效率和稳定性.  相似文献   

9.
The dynamic configuration and evolution of large-scale heterogeneous systems has made the enforcement of security requirements one of the most critical phases throughout the system development lifecycle. In this paper, we propose a framework architecture to associate the security policies with the specification and the execution phases of applications defined for these systems. Our proposed framework is based on an aspect-oriented programming approach and on the organization-based access control model to dynamically enforce and manage the access and the usage control. The deployment of the framework modules, proposed in this paper, takes into account the changes that may occur in the security policy during the application execution. We also present the implementation as well as the evaluation of our proposition.  相似文献   

10.
Metaschedulers co-allocate resources by requesting a fixed number of processors and usage time for each cluster. These static requests, defined by users, limit the initial scheduling and prevent rescheduling of applications to other resource sets. It is also difficult for users to estimate application execution times, especially on heterogeneous environments. To overcome these problems, metaschedulers can use performance predictions for automatic resource selection. This paper proposes a resource co-allocation technique with rescheduling support based on performance predictions for multi-cluster iterative parallel applications. Iterative applications have been used to solve a variety of problems in science and engineering, including large-scale computations based on the asynchronous model more recently. We performed experiments using an iterative parallel application, which consists of benchmark multiobjective problems, with both synchronous and asynchronous communication models on Grid’5000. The results show run time predictions with an average error of 7% and prevention of up to 35% and 57% of run time overestimations to support rescheduling for synchronous and asynchronous models, respectively. The performance predictions require no application source code access. One of the main findings is that as the asynchronous model masks communication and computation, it requires no network information to predict execution times. By using our co-allocation technique, metaschedulers become responsible for run time predictions, process mapping, and application rescheduling; releasing the user from these burden tasks.  相似文献   

11.
直方图生成算法(Histogram Generation)是一种顺序的非规则数据依赖的循环运算,已在许多领域被广泛应用。但是,由于非规则的内存访问,使得多线程对共享内存访问会产生很多存储体冲突(Bank Conflict),从而阻碍并行效率。如何在并行处理器平台,特别是当前最先进的图像处理单元(Graphic Processing Unit,GPU)实现高效的直方图生成算法是很有研究价值的。为了减少直方图生成过程中的存储体冲突,通过内存填充技术,将多线程的共享内存访问均匀地分散到各个存储体,可以大幅减少直方图生成算法在GPU上的内存访问延时。同时,通过提出有效可靠的近似最优配置搜索模型,可以指导用户配置GPU执行参数,以获得更高的性能。经实验验证,在实际应用中,改良后的算法比原有算法性能提高了42%~88%。  相似文献   

12.
As computers continually improve in performance and decrease in manufacturing cost, distributed systems consisting of multiple computers implemented as parallel computation platforms have become viable for engineering applications which demand intensive computation power. This paper proposes an extended version of a previously developed low cost parallel computation platform called para worker. The system is based on a cluster structure which is a form of a distributed system. The new system is termed para worker 2 which differentiates it from the earlier system. The new proposed system adds enhanced features of improved dynamic object reallocation, adaptive consistency protocols, and location transparency. Performance of the para worker 2 has proven to be superior to the para worker. Testing was based on an execution of Genetic Algorithm to solve the Economic Dispatch problem in Power Engineering. The proposal is particularly useful for the implementation and execution of computational intelligence techniques such as evolutionary computing for engineering applications.  相似文献   

13.
光流法是计算机视觉中的一个基础性算法,可广泛应用于运动检测、运动估计、视频分析等领域。但高质量光流法最大的问题是计算复杂、速度慢,限制了它在实际系统中的应用。针对一种混合亮度和梯度模型的高质量光流法,为其设计了一种高效、可扩展的并行计算方法。通过在具有代表性的网络众核架构-Tilera上进行验证,对于分辨率为640×480的图片,提出的并行计算方法在具有36核的Tilera处理器上执行时间为0.80秒,比主频3.40 GHz的CPU i3-3240快2.56倍,但功耗不到其1/6。当用于嵌入式环境时,其速度比ARM9处理器快33倍,而功耗只有它的一半。实验表明该并行算法具有良好的扩展性,可通过选择不同核数的处理器满足系统对性能、功耗的综合需求。  相似文献   

14.
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.  相似文献   

15.
The recent data deluge needing to be processed represents one of the major challenges in the computational field. This fact led to the growth of specially-designed applications known as data-intensive applications. In general, in order to ease the parallel execution of data-intensive applications input data is divided into smaller data chunks that can be processed separately. However, in many cases, these applications show severe performance problems mainly due to the load imbalance, inefficient use of available resources, and improper data partition policies. In addition, the impact of these performance problems can depend on the dynamic behavior of the application.This work proposes a methodology to dynamically improve the performance of data-intensive applications based on: (i) adapting the size and the number of data partitions to reduce the overall execution time; and (ii) adapting the number of processing nodes to achieve an efficient execution. We propose to monitor the application behavior for each exploration (query) and use gathered data to dynamically tune the performance of the application. The methodology assumes that a single execution includes multiple related queries on the same partitioned workload.The adaptation of the workload partition factor is addressed through the definition of the initial size for the data chunks; the modification of the scheduling policy to send first data chunks with large processing times; dividing of the data chunks with the biggest associated computation times; and joining of data chunks with small computation times. The criteria for dividing or gathering chunks are based on the chunks’ associated execution time (average and standard deviation) and the number of processing elements being used. Additionally, the resources utilization is addressed through the dynamic evaluation of the application performance and the estimation and modification of the number of processing nodes that can be efficiently used.We have evaluated our strategy using as cases of study a real and a synthetic data-intensive application. Analytical expressions have been analyzed through simulation. Applying our methodology, we have obtained encouraging results reducing total execution times and efficient use of resources.  相似文献   

16.
Molecular dynamics simulation is a class of applications that require reducing the execution time of fixed-size problems. This reduction in execution time is important to drug design and protein interaction studies. Many implementations of parallel molecular dynamics have been developed, but very little work has addressed issues related to the use of machines with 50,000 processors for modest-sized problems in the range of 50,000 atoms. Current massively parallel machines present a major obstacle to achieving good performance:communication overhead. In this paper we quantify the communication latency and network bandwidth necessary to achieve 30–40% efficiency on future message-passing machines with sizes on the order of tens of thousands of processors, for executing molecular dynamics problems with the same order of atoms. We derive an analytical model of a benchmark application that simulates a system of helium atoms executing on the Intel Touchstone Delta using an interaction decomposition method. This model is validated and used to extrapolate information on the startup time and network bandwidth. The results indicate that for an MPP with a four-dimensional mesh topology using 400 MHz processors, the communication startup time must be at most 30 clock cycles and the network bandwidth at least 2.3 GB/s. This configuration results in 30–40% efficiency of the MPP for a problem with 50,000 atoms executing on 50,000 processors.  相似文献   

17.
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.  相似文献   

18.
This research defines and analyzes a methodology for deriving a performance model for SPMD hybrid parallel applications. Hybrid parallelism combines shared memory and message passing computing models. This work extends the current practice of application performance modelling by development of a methodology for hybrid applications with these procedures.
  • Creation of a model based on complexity analysis of an application code and its data structures.
  • Enhancement of a static complexity model by dynamic factors to capture execution time phenomena, such as memory hierarchy effects.
  • Quantitative analysis of model characteristics and the effects of perturbations in measured parameters.
These research results are presented in the context of a hybrid parallel implementation of a sparse linear algebra kernel. A model for this kernel is derived and analyzed using the methodology. Application of the model on two large parallel computing platforms provides case studies for the methodology. Operating system issues, machine balance factor, and memory hierarchy effects on model accuracy are examined. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

19.
Particle swarm optimization (PSO) algorithm is a population-based algorithm for finding the optimal solution. Because of its simplicity in implementation and fewer adjustable parameters compared to the other global optimization algorithms, PSO is gaining attention in solving complex and large scale problems. However, PSO often requires long execution time to solve those problems. This paper proposes a parallel PSO algorithm, called delayed exchange parallelization, which improves performance of PSO on distributed environment by hiding communication latency efficiently. By overlapping communication with computation, the proposed algorithm extracts parallelism inherent in PSO. The performance of our proposed parallel PSO algorithm was evaluated using several applications. The results of evaluation showed that the proposed parallel algorithm drastically improved the performance of PSO, especially in high-latency network environment.  相似文献   

20.
Executing traditional Message Passing Interface (MPI) applications on multi-core cluster balancing speed and computational efficiency is a difficult task that parallel programmers have to deal with. For this reason, communications on multi-core clusters ought to be handled carefully in order to improve performance metrics such as efficiency, speedup, execution time and scalability. In this paper we focus our attention on SPMD (Single Program Multiple Data) applications with high communication volume and synchronicity and also following characteristics such as: static, local and regular. This work proposes a method for SPMD applications, which is focused on managing the communication heterogeneity (different cache level, RAM memory, network, etc.) on homogeneous multi-core computing platform in order to improve the application efficiency. In this sense, the main objective of this work is to find analytically the ideal number of cores necessary that allows us to obtain the maximum speedup, while the computational efficiency is maintained over a defined threshold (strong scalability). This method also allows us to determine how the problem size must be increased in order to maintain an execution time constant while the number of cores are expanded (weak scalability) considering the tradeoff between speed and efficiency. This methodology has been tested with different benchmarks and applications and we achieved an average improvement around 30.35% of efficiency in applications tested using different problems sizes and multi-core clusters. In addition, results show that maximum speedup with a defined efficiency is located close to the values calculated with our analytical model with an error rate lower than 5% for the applications tested.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号