首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
针对如何发挥异构多核处理器的优势从而提高程序执行效率的问题,提出了Cell异构多核处理器上实现线程同步流水并行和迭代同步流水并行两种优化技术,该优化技术可以有效地提高非规则写和控制结构非规则的执行速度。通过在Cell处理器上对NAS benchmarks中的IS、EP、LU以及SPEC2001中的MOLDYN进行测试,结果表明该流水并行方案有效地改善了临界区和flush操作的执行效率,明显地提高了程序的执行速度。  相似文献   

2.
Recent research has highlighted the potential benefits of single-ISA heterogeneous multicore processors over cost-equivalent homogeneous ones, and it is likely that future processors will integrate cores that have the same instruction set architecture (ISA) but offer different performance and power characteristics. To fully tap into the potential of these processors, the operating system must be aware of the hardware asymmetry when making scheduling decisions and map applications to cores in consideration of their performance characteristics. We propose a Heterogeneity-Aware Signature-Supported (HASS) scheduling algorithm that performs this mapping using per-thread architectural signatures, which are compact summaries of threads’ architectural properties. We implemented HASS in OpenSolaris, and demonstrated that it always outperforms a heterogeneity-agnostic scheduler (by as much as 12.5%) for workloads exhibiting sufficient diversity. Our evaluation also includes an extensive comparison with other heterogeneity-aware schedulers to provide a more clear understanding of the pros and cons behind HASS.  相似文献   

3.
Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution‐driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two‐Phase Trace‐driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle‐accurate simulation‐based trace generation phase and can be omitted in the repeated trace‐driven simulations. We report our experiences with tsim, an event‐driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16‐thread parallel applications. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

4.
Performance heterogeneous multicore processors (HMP for brevity) consisting of multiple cores with the same instruction set but different performance characteristics (e.g., clock speed, issue width), are of great concern since they are able to deliver higher performance per watt and area for programs with diverse architectural requirements than comparable homogeneous ones. However, such power and area efficiencies of performance heterogeneous multicore systems can only be achieved when workloads are matched with cores according to both the properties of the workload and the features of the cores.  相似文献   

5.
In this paper, we conduct performance scaling analysis of multithreaded multicore processors (MMPs) for parallel computing. We propose a thread-level closed-queuing network model covering a fairly large design space, accounting for hardware scaling models, coarse-grain, fine-grain, and simultaneous multithreading (SMT) cores, shared resources, including cache, memory, and critical sections. We then derive a closed-form solution for this model in terms of speedup performance measure. This solution makes it possible to analyze performance scaling properties of MMPs along multiple dimensions. In particular, we show that for the parallelizable part of the workload, the speedup, in the absence of resource contention, is no longer just a linear function of parallel processing unit counts, as predicted by Amdahl’s law, but also a strong function of workload characteristics, ranging from strong memory-bound to strong CPU-bound workloads. We also find that with core multithreading, super linear speedup, higher than that predicted by Amdahl’s law, may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload, as dictated by Amdahl’s law. As a result, to improve speedup performance for MMPs, one should strive to enhance memory parallelism and confine critical sections as locally as possible, e.g., to the smallest possible number of threads in the same core.  相似文献   

6.
This paper explores the suitability of the emerging passive star-coupled optical interconnection using wavelength division multiplexing as the system interconnect to provide high bandwidth (Gbits/sec) communication demanded by heterogeneous systems. Several different communication strategies (combinations of communication topologies and protocols) are investigated under a representative master-slave computational model. The interplay between system speed, network speed, task granularity, and degree of parallelism is studied using both analytical modeling and simulations. It is shown that a hierarchical ALOHA-based communication strategy between the master and the slaves, implemented on top of the passive star-coupled network, leads to a considerable reduction in channel contention and provides 50–80% reduction in task completion time for applications with medium to high degrees of coarse grain parallelism. Comparable reduction in channel contention is also shown to be achieved by using tunable acoustooptic filters at master nodes.  相似文献   

7.
In this study, we consider an environment composed of a heterogeneous cluster of multicore-based machines used to analyze satellite data. The workload involves large data sets and is subject to a deadline constraint. Multiple applications, each represented by a directed acyclic graph (DAG), are allocated to a dedicated heterogeneous distributed computing system. Each vertex in the DAG represents a task that needs to be executed and task execution times vary substantially across machines. The goal of this research is to assign the tasks in applications to a heterogeneous multicore-based parallel system in such a way that all applications complete before a common deadline, and their completion times are robust against uncertainties in execution times. We define a measure that quantifies robustness in this environment. We design, compare, and evaluate five static resource allocation heuristics that attempt to maximize robustness. We consider six different scenarios with different ratios of computation versus communication, and loose and tight deadlines.  相似文献   

8.
A hierarchical representation for heterogeneous object modeling is presented in this paper. To model a heterogeneous object, Boundary representation is used for geometry representation, and a novel Heterogeneous Feature Tree (HFT) structure is proposed to represent the material distributions. HFT structure hierarchically organizes the material variation dependency relationships and is intuitive in modeling different types of material gradations. Based on the HFT structure, a recursive material evaluation algorithm is proposed to dynamically evaluate the material compositions at a specific location. Such a hierarchical representation guarantees complex material gradations and the user's design intent can be intuitively represented. Example heterogeneous objects modeled with this scheme are provided and potential applications are discussed.  相似文献   

9.
In recent years, processor technology has evolved towards multicore processors, which include multiple processing units (cores) in a single package. Those cores, having their own private caches, often share a higher level cache memory dedicated to each processor die. This multi-level cache hierarchy in multicore processors raises the importance of cache utilization problem. Assigning parallel-running software components with common data to processor cores that do not share a common cache increases the number of cache misses. In this paper we present a novel approach that uses model-based information to guide the OS scheduler in assigning appropriate core affinities to software objects at run-time. We build graph models of software and cache hierarchies of processors and devise a graph matcher algorithm that provides mapping between these two graphs. Using this mapping we obtain candidate core sets that each software object can be affiliated with at run-time. These affiliations are determined based on the idea that software components that have the potential to share common data at run-time should run on cores that share a common cache. We also develop an object dispatcher algorithm that keeps track of object affiliations at run-time and dispatches objects by using the information from the compile-time graph matcher. We apply our approach on design pattern implementations and two different application program running on servers using CFS scheduling. Our results show that cache-aware dispatching based on information obtained from software model, decreases number of cache misses significantly and improves CFS’ scheduling performance.  相似文献   

10.
Given the proliferation of layered, multicore- and SMT-based architectures, it is imperative to deploy and evaluate important, multi-level, scientific computing codes, such as meshing algorithms, on these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level, medium-grain at the cavity level and fine-grain at the element level. This multi-grain data parallel approach targets clusters built from commercially available SMTs and multicore processors. The exploitation of the coarser degree of granularity facilitates scalability both in terms of execution time and problem size on loosely-coupled clusters. The exploitation of medium-grain parallelism allows performance improvement at the single node level. Our experimental evaluation shows that the first generation of SMT cores is not capable of taking advantage of fine-grain parallelism in PCDM. Many of our experimental findings with PCDM extend to other adaptive and irregular multigrain parallel algorithms as well.  相似文献   

11.
Load sharing in large, heterogeneous distributed systems allows users to access vast amounts of computing resources scattered around the system and may provide substantial performance improvements to applications. We discuss the design and implementation issues in Utopia, a load sharing facility specifically built for large and heterogeneous systems. The system has no restriction on the types of tasks that can be remotely executed, involves few application changes and no operating system change, supports a high degree of transparency for remote task execution, and incurs low overhead. The algorithms for managing resource load information and task placement take advantage of the clustering nature of large-scale distributed systems; centralized algorithms are used within host clusters, and directed graph algorithms are used among the clusters to make Utopia scalable to thousands of hosts. Task placements in Utopia exploit the heterogeneous hosts and consider varying resource demands of the tasks. A range of mechanisms for remote execution is available in Utopia that provides varying degrees of transparency and efficiency. A number of applications have been developed for Utopia, ranging from a load sharing command interpreter, to parallel and distributed applications, to a distributed batch facility. For example, an enhanced Unix command interpreter allows arbitrary commands and user jobs to be executed remotely, and a parallel make facility achieves speed-ups of 15 or more by processing a collection of tasks in parallel on a number of hosts.  相似文献   

12.
Recent advances in space and computer technologies are revolutionizing the way remotely sensed data is collected, managed and interpreted. In particular, NASA is continuously gathering very high-dimensional imagery data from the surface of the Earth with hyperspectral sensors such as the Jet Propulsion Laboratory's airborne visible-infrared imaging spectrometer (AVIRIS) or the Hyperion imager aboard Earth Observing-1 (EO-1) satellite platform. The development of efficient techniques for extracting scientific understanding from the massive amount of collected data is critical for space-based Earth science and planetary exploration. In particular, many hyperspectral imaging applications demand real time or near real-time performance. Examples include homeland security/defense, environmental modeling and assessment, wild-land fire tracking, biological threat detection, and monitoring of oil spills and other types of chemical contamination. Only a few parallel processing strategies for hyperspectral imagery are currently available, and most of them assume homogeneity in the underlying computing platform. In turn, heterogeneous networks of workstations (NOWs) have rapidly become a very promising computing solution which is expected to play a major role in the design of high-performance systems for many on-going and planned remote sensing missions. In order to address the need for cost-effective parallel solutions in this fast growing and emerging research area, this paper develops several highly innovative parallel algorithms for unsupervised information extraction and mining from hyperspectral image data sets, which have been specifically designed to be run in heterogeneous NOWs. The considered approaches fall into three highly representative categories: clustering, classification and spectral mixture analysis. Analytical and experimental results are presented in the context of realistic applications (based on hyperspectral data sets from the AVIRIS data repository) using several homogeneous and heterogeneous parallel computing facilities available at NASA's Goddard Space Flight Center and the University of Maryland.  相似文献   

13.
This paper describes two dynamic core allocation techniques for video decoding on homogeneous and heterogeneous embedded multicore platforms with the objective of reducing energy consumption while guaranteeing performance. While decoding a frame, the scheme measures “slack” and “overshoot” over the budgeted decode time and amortizes across the neighboring frames to achieve overall performance, compensating for the overshoot with the slack time. It allocates, on a per-frame basis, an appropriate number and types of cores for decoding to guarantee performance, while saving energy by using clock gating to switch off unused cores. Using the Sniper simulator to evaluate the implementation of the scheme on a modern embedded processor, we get an energy saving of 6%–61% while strictly adhering to the required performance of 75 fps on homogeneous multicore architectures. We receive an energy saving of 2%–46% while meeting the performance of 25 fps on heterogeneous multicore architectures. Thus, we show that substantial energy savings can be achieved in video decoding by employing dynamic core allocation, compared with the default strategy of allocating as many cores as available.  相似文献   

14.
This work presents a novel parallel micro evolutionary algorithm for scheduling tasks in distributed heterogeneous computing and grid environments. The scheduling problem in heterogeneous environments is NP-hard, so a significant effort has been made in order to develop an efficient method to provide good schedules in reduced execution times. The parallel micro evolutionary algorithm is implemented using MALLBA, a general-purpose library for combinatorial optimization. Efficient numerical results are reported in the experimental analysis performed on both well-known problem instances and large instances that model medium-sized grid environments. The comparative study of traditional methods and evolutionary algorithms shows that the parallel micro evolutionary algorithm achieves a high problem solving efficacy, outperforming previous results already reported in the related literature, and also showing a good scalability behavior when facing high dimension problem instances.  相似文献   

15.
Heterogeneous objects are objects composed of different constituent materials. In these objects, multiple desirable properties from different constituent materials can be synthesized into one part. In order to obtain mass applications of such heterogeneous objects, efficient and effective design methodologies for heterogeneous objects are crucial.In this paper, we present a feature based design methodology to facilitate heterogeneous object design. Under this methodology, designers design heterogeneous objects using high-level design components that have engineering significance. These high level components are form features and material features. In this paper, we first examine the relationships between form features and material features in heterogeneous objects. We then propose three synthesized material features in accordance with our examination of these features. Based on these proposed features, we develop a feature based design methodology for heterogeneous objects. Two enabling methods for this design methodology, material heterogeneity specification within each feature and combination of these material features, are developed. A physics (diffusion) based B-spline method is developed to (1) allow design intent of material variation be explicitly captured by boundary conditions, (2) ensure smooth material variation across the feature volume. A novel method, direct face neighborhood alteration, is developed to increase the efficiency of combining heterogeneous material features.Examples of using this feature based design methodology for heterogeneous object design, such as a prosthesis design, are presented.  相似文献   

16.
The computational complexity of a parallel algorithm depends critically on the model of computation. We describe a simple and elegant rule-based model of computation in which processors apply rules asynchronously to pairs of objects from a global object space. Application of a rule to a pair of objects results in the creation of a new object if the objects satisfy the guard of the rule. The model can be efficiently implemented as a novel MIMD array processor architecture, the Intersecting Broadcast Machine. For this model of computation, we describe an efficient parallel sorting algorithm based on mergesort. The computational complexity of the sorting algorithm isO(nlog2 n), comparable to that for specialized sorting networks and an improvement on theO(n 1.5) complexity of conventional mesh-connected array processors.  相似文献   

17.
Multiprocessor system-on-chip (MPSoC) designs offer a lot of computational power assembled in a compact design. The computing power of MPSoCs can be further augmented by adding massively parallel processor arrays (MPPA) and specialized hardware with instruction-set extensions. On-chip MPPAs can be used to accelerate low-level image-processing algorithms with massive inherent parallelism. However, the presence of multiple processing elements (PEs) with different characteristics raises issues related to programming and application mapping, among others. The conventional approach used for programming heterogeneous MPSoCs results in a static mapping of various parts of the application to different PE types, based on the nature of the algorithm and the structure of the PEs. Yet, such a mapping scheme independent of the instantaneous load on the PEs may lead to under-utilization of some type of PEs while overloading others.In this work, we investigate the benefits of using a heterogeneous MPSoC for accelerating various stages within a real-world image-processing algorithm for object-recognition. A case study demonstrates that a resource-aware programming model called Invasive Computing helps to improve the throughput and worst observed latency of the application program, by dynamically mapping applications to different types of PEs available on a heterogeneous MPSoC.  相似文献   

18.
Heterogeneous systems mix different technical domains such as signal processing, analog and digital electronics, software, telecommunication protocols, etc. Heterogeneous systems are composed of subsystems that are designed using different models of computation (MoC). These MoCs are the laws that govern the interactions of the components of a subsystem. The design of heterogeneous systems includes the design of each part of the system according to its specific MoC, and the connection of the parts in order to build the model representing the system. Indeed, this model allows the MoCs that govern different parts of system to coexist and interact.To be able to use a component which is specified according to a given MoC, under other, different MoCs, we can use either a hierarchical or a non-hierarchical approach, or we can build domain-specific components (DSC). However, these solutions present several disadvantages. This paper presents a new model of component, called domain-polymorph component (DPC). Such a component is atomic and is able to execute its core behavior, specified under a given MoC, under different host MoCs. This approach is not a competitor to the approaches above but is complementary.  相似文献   

19.
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.  相似文献   

20.
As the cost-driven public cloud services emerge, budget constraint is one of the primary design issues in large-scale scientific applications executed on heterogeneous cloud computing systems. Minimizing the schedule length while satisfying the budget constraint of an application is one of the most important quality of service requirements for cloud providers. A directed acyclic graph (DAG) can be used to describe an application consisted of multiple tasks with precedence constrains. Previous DAG scheduling methods tried to presuppose the minimum cost assignment for each task to minimize the schedule length of budget constrained applications on heterogeneous cloud computing systems. However, our analysis revealed that the preassignment of tasks with the minimum cost does not necessarily lead to the minimization of the schedule length. In this study, we propose an efficient algorithm of minimizing the schedule length using the budget level (MSLBL) to select processors for satisfying the budget constraint and minimizing the schedule length of an application. Such problem is decomposed into two sub-problems, namely, satisfying the budget constraint and minimizing the schedule length. The first sub-problem is solved by transferring the budget constraint of the application to that of each task, and the second sub-problem is solved by heuristically scheduling each task with low-time complexity. Experimental results on several real parallel applications validate that the proposed MSLBL algorithm can obtain shorter schedule lengths while satisfying the budget constraint of an application than existing methods in various situations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号