期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploiting fine-grain parallelism on dataflow architectures

Guang R. Gao 《Parallel Computing》1990,13(3):309-320

Although dataflow computers have many attractive features, skepticism exists concerning their efficiency in handling arrays (vectors) in high performance scientific computation. This paper outlines an efficient implementation scheme for arrays in applicative languages (such as VAL and SISAL) based on the principles of dataflow software pipelining. It illustrates how the fine-grain parallelism of dataflow approach can effectively handle large amount of data structured in applicative array operations. This is done through dataflow software pipelining between pairs of code blocks which act as producer-consumer of array values. To make effective use of the pipelined code mapping scheme, a compiler needs information concerning the overall program structure as well as the structure of each code block. An applicative language provides a basis for such analysis.

The program transformation techniques described here are developed primarily for the computationally intensive part of a scientific numerical program, which is usually formed by one or a few clusters of acyclic connected code blocks. Each code block defines an array value from several input arrays. We outline how mapping decisions of arrays can be based on a global analysis of attributes of the code blocks. We emphasize the role of overall program structure and the strategy of global optimization of the machine code structure. The structure of a proposed dataflow compiler based on the scheme described in this paper is outlined. 相似文献

2.

Exploiting wavefront parallelism on large-scale shared-memorymultiprocessors

Manjikian N. Abdelrahman T.S. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(3):259-271

Wavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. Previous approaches for scheduling wavefront parallelism focused on maximizing parallelism; balancing workloads, and reducing synchronization. In this paper, we show that on large-scale shared-memory multiprocessors, locality is a crucial factor. We make the distinction between intratile and intertile locality and show that as the number of processors grows, intertile locality becomes more important. We consider and experimentally evaluate existing strategies for scheduling wavefront parallelism. We show that dynamic self-scheduling can be efficiently used on a small number of processors, but performs poorly at large scale because it does not enhance intertile locality. By contrast, static scheduling strategies enhance intertile locality for small tiles, maintaining parallelism and resulting in better performance at large scale. Results from a Convex SPP1000 multiprocessor demonstrate the importance of taking intertile locality into account. Static scheduling outperforms dynamic self-scheduling by a factor of up to 2.3 on 30 processors 相似文献

3.

Exploiting instruction- and data-level parallelism

Espasa R. Valero M. 《Micro, IEEE》1997,17(5):20-27

Simultaneous multithreaded vector architectures combine the best of data-level and instruction-level parallelism and perform better than either approach could separately. Our design achieves performance equivalent to executing 15 to 26 scalar instructions/cycle for numerical applications 相似文献

4.

Exploiting the parallelism available in loops 总被引：1，自引：0，他引：1

Lilja D.J. 《Computer》1994,27(2):13-26

Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop 相似文献

5.

Exploiting parallelism in deterministic shared memory multiprocessing

Xu Zhou Kai Lu Xiaoping Wang Xu Li 《Journal of Parallel and Distributed Computing》2012

Multi-threaded programs on shared-memory hardware tend to be non-deterministic, which brings challenges to software debugging and testing. Current deterministic implementations eliminate nondeterminism of multi-threaded programs by trading much parallelism for determinism, which leads to low performance. Researchers typically improve parallelism by weakening determinism or introducing weak memory consistency models. However, weak determinism cannot deal with non-determinism caused by data races which are very common in multi-threaded programs. Weak memory consistency models impact the productivity of programming and may bring correctness problems of legacy programs. To address the problems, this paper presents a fully parallelized deterministic runtime, FPDet, which exploits parallelism of deterministic multi-threaded programs by preserving strong determinism and sequential memory consistency. FPDet creates a Working Set Memory (WSM) for each thread to make threads run independently for parallelism. FPDet guarantees determinism by redistributing memory blocks among threads’ WSMs in specified synchronization points. As a result, FPDet obtains parallelism and determinism simultaneously. To further exploit parallelism, we propose an Adaptive Budget Adjustment (ABA) mechanism to minimize wait time caused by thread synchronization. 相似文献

6.

Exploiting controlled-grained parallelism in message-driven stream programs

Yan Su Feng Shi Shahnawaz Talpur Jin Wei Hai Tan 《The Journal of supercomputing》2014,70(1):488-509

With the increasing amount of parallelism obtainable on multicore platforms, stream programming has been proposed as an effective solution for exposing distributed parallelization. Nonetheless, a pressing demand of scheduling task and data parallelism in stream programming exists that can accomplish robust multicore performance in the face of varying application characteristics. This paper addresses the problem of scheduling task and data parallelism in stream programming. We present StreamMDE, an asynchronous concurrency stream programming framework which offers a novel parallel programming model for scheduling task and data parallelism in the message-driven execution paradigm. A key property of this framework is exposing controlled-grained parallelism, which allows us to control the granularity of task and data parallelism in stream graph. Our empirical evaluation of StreamMDE shows that higher efficiency of mixed task and data parallelism in stream programming can be exploited with the appropriate granularity control. The framework bridges the gap between the parallel scale and the architecture of stream programs and facilitates in designing and coding stream features in different schedules. 相似文献

7.

Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

Moisés Viñas Zeki Bozkus Basilio B. Fraguela 《Journal of Parallel and Distributed Computing》2013

While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL. 相似文献

8.

Exploiting parallelism on a fine-grained MIMD architecture based upon channel queues

Rajiv Gupta Sunah Lee 《International journal of parallel programming》1992,21(3):169-192

We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh. 相似文献

9.

Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Qiang Wu Canqun Yang Tao Tang Liquan Xiao 《Journal of Parallel and Distributed Computing》2013

Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations. 相似文献

10.

Exploiting parallelism: highly competitive semantic tree theorem prover

《国际计算机数学杂志》2012,89(9):1051-1067

Semantic trees have often been used as a theoretical tool for showing the unsatisfiability of clauses in first-order predicate logic. Their practicality has been overshadowed, however, by other strategies. In this paper, we introduce unit clauses derived from resolutions when necessary to construct a semantic tree, leading to a strategy that combines the construction of semantic trees with resolution–refutation. The parallel semantic tree theorem prover, called PrHERBY, combines semantic trees and resolution–refutation methods. The parallel system is scalable by strategically selecting atoms with the help of dedicated resolutions. In addition, a parallel grounding scheme allows each system to have its own instance of generated atoms, thereby increasing the possibility of success. The PrHERBY system presented performs significantly better and generally finds proof using fewer atoms than the semantic tree prover, HERBY and its parallel version, PHERBY. 相似文献

11.

Exploiting inter-operation parallelism for matrix chain multiplication using MapReduce

Jaeseok Myung Sang-goo Lee 《The Journal of supercomputing》2013,66(1):594-609

In this paper, we address the matrix chain multiplication problem, i.e., the multiplication of several matrices. Although several studies have investigated the problem, our approach has some different points. First, we propose MapReduce algorithms that allow us to provide scalable computation for large matrices. Second, we transform the matrix chain multiplication problem from sequential multiplications of two matrices into a single multiplication of several matrices. Since matrix multiplication is associative, this approach helps to improve the performance of the algorithms. To implement the idea, we adopt multi-way join algorithms in MapReduce that have been studied in recent years. In our experiments, we show that the proposed algorithms are fast and scalable, compared to several baseline algorithms. 相似文献

12.

Exploiting application parallelism in knowledge-based systems: an experimental method

J.W.H. Daniel M.R. Moulding 《Concurrency and Computation》1996,8(1):1-18

A method is presented which aims to enhance the run-time performance of real-time production systems of utilising natural concurrency in the application knowledge base. This exploiting application parallelism (EAP) method includes an automated analysis of the knowledge base and the use of this analysis information to partition and execute rules on a novel parallel production system (PPS) architecture. Prototype analysis tools and a PPS simulator have been developed for the Inference ART environment in order to apply the method to a naval data-fusion problem. The results of this experimental investigation revealed that an average maximum of 12.06 rule-firings/cycle was possible but, due to serial bottlenecks inherent in the data-fusion problem, up to only 2.14 rule-firings/cycle was achieved overall. Limitations of the EAP method are discussed within the context of the experimental results and an enhanced method is investigated. 相似文献

13.

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Jie Yan Guangming Tan Ninghui Sun 《The Journal of supercomputing》2014,69(3):1462-1490

相似文献

14.

Exploiting parallelism across program execution: a unificationtechnique and its analysis

Rego V. Mathur A.P. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(4):399-414

A new technique for source-to-source transformation of sequential programs is described. It is shown that the transformed programs so generated provide significant speedups over the original program on vector processors and vector multiprocessors. The parallelism that arises when multiple instances of a program are executed on simultaneously available data sets is exploited. This is in contrast to the existing approaches that aim at detecting parallelism within a program. The analytic model is used to prove the optimality of the complete first policy for block selection for a class of program graphs known as nonregressive graphs. Analytic and simulation models of the technique clearly indicate the speedups that could be achieved when several data sets are available simultaneously, as is the case in many fields of interest 相似文献

15.

Exploiting thread-level parallelism in the iterative solution of sparse linear systems

José I. Aliaga Matthias Bollhöfer Alberto F. Mart?´n Enrique S. Quintana-Ort?´ 《Parallel Computing》2011,37(3):183-202

We investigate the efficient iterative solution of large-scale sparse linear systems on shared-memory multiprocessors. Our parallel approach is based on a multilevel ILU preconditioner which preserves the mathematical semantics of the sequential method in ILUPACK. We exploit the parallelism exposed by the task tree corresponding to the nested dissection hierarchy (task parallelism), employ dynamic scheduling of tasks to processors to improve load balance, and formulate all stages of the parallel PCG method conformal with the computation of the preconditioner to increase data reuse. Results on a CC-NUMA platform with 16 processors reveal the parallel efficiency of this solution. 相似文献

16.

Exploiting parallelism in 3-D object recognition using the connection machine

Suchendra M. Bhandarkar

Ravi V. Shankar

Minsoo Suk 《Robotics and Autonomous Systems》1991,8(4):291-309

This paper addresses the problem of parallelism in 3-D object recognition and localization using a fine-grained SIMD architecture such as the Connection Machine^TM. The input images of partially occluding three-dimensional objects. The objects are made up of piecewise compositions of curved surfaces. The surfaces of interest are cylindrical, conical and spherical surfaces since they cover a large portion of objects encountered in an industrial environment. Qualitative classification of surfaces based on the signs of the Mean and Gaussian curvature is used to come up with Dihedral feature junctions. Dihedral feature junctions are shown to be robust to occlusion and offer a viewpoint independent scheme for modeling the object models and are well suited for matching and pose determination. Hough clustering is chosen as the constraint propagation mechanism on account of the ease of parallelization. Issues regarding parallelism on the Connection Machine^TM for every stage of the recognition and localization process such as segmentations, feature extraction matching and pose determination are discussed. Experimental results on the Connection Machine^TM bring out the advantages of exploiting parallelism for 3-D object recognition and localization on a fine-grained SIMD architecture. 相似文献

17.

Exploiting multilevel parallelism on a many-core system for the application of hyperheuristics to a molecular docking problem

José M. Cecilia José-Matías Cutillas-Lozano Domingo Giménez Baldomero Imbernón 《The Journal of supercomputing》2018,74(5):1803-1814

The solution of Protein–Ligand Docking Problems can be approached through metaheuristics, and satisfactory metaheuristics can be obtained with hyperheuristics searching in the space of metaheuristics implemented inside a parameterized schema. These hyperheuristics apply several metaheuristics, resulting in high computational costs. To reduce execution times, a shared-memory schema of hyperheuristics is used with four levels of parallelism, two for the hyperheuristic and two for the metaheuristics. The parallel schema is executed in a many-core system in “native mode,” and the four-level parallelism allows us to take full advantage of the massive parallelism offered by this architecture and obtain satisfactory fitness and an important reduction in the execution time. 相似文献

18.

海上多平台多传感器数据位置对准方法

王成飞李文钦王航宇石章松《微型机与应用》2010,29(8)

根据不同的装备条件提出了基于高精度导航设备、舷角一距离、近似经纬度及舷角-距离的海上多平台多传感器数据位置对准方法,建立了每一种方法的对准流程,并对各个环节的坐标变换公式进行了推导,通过分析对准方法的适用环境、误差来源、对准精度和优缺点,给出了该系统对准方案. 相似文献

19.

Few-shot domain adaptation through compensation-guided progressive alignment and bias reduction

Shang Junyuan Niu Chang Huang Junchu Zhou Zhiheng Yang Junmei Xu Shiting Yang Liu 《Applied Intelligence》2022,52(10):10917-10933

Applied Intelligence - Most existing domain adaptation methods require large amounts of data in the target domain to train the model and a relatively long time to adapt different domains. Recently,... 相似文献

20.

Particle Swarm Clustering in clustering ensembles: Exploiting pruning and alignment free consensus

《Applied Soft Computing》2017

A clustering ensemble combines in a consensus function the partitions generated by a set of independent base clusterers. In this study both the employment of particle swarm clustering (PSC) and ensemble pruning (i.e., selective reduction of base partitions) using evolutionary techniques in the design of the consensus function is investigated. In the proposed ensemble, PSC plays two roles. First, it is used as a base clusterer. Second, it is employed in the consensus function; arguably the most challenging element of the ensemble. The proposed consensus function exploits a representation for the base partitions that makes cluster alignment unnecessary, allows for the combination of partitions with different number of clusters, and supports both disjoint and overlapping (fuzzy, probabilistic, and possibilistic) partitions. Results on both synthetic and real-world data sets show that the proposed ensemble can produce statistically significant better partitions, in terms of the validity indices used, than the best base partition available in the ensemble. In general, a small number of selected base partitions (below 20% of the total) yields the best results. Moreover, results produced by the proposed ensemble compare favorably to those of state-of-the-art clustering algorithms, and specially to swarm based clustering ensemble algorithms. 相似文献