共查询到20条相似文献,搜索用时 0 毫秒
1.
Simultaneous multithreaded vector architectures combine the best of data-level and instruction-level parallelism and perform better than either approach could separately. Our design achieves performance equivalent to executing 15 to 26 scalar instructions/cycle for numerical applications 相似文献
2.
《国际计算机数学杂志》2012,89(9):1051-1067
Semantic trees have often been used as a theoretical tool for showing the unsatisfiability of clauses in first-order predicate logic. Their practicality has been overshadowed, however, by other strategies. In this paper, we introduce unit clauses derived from resolutions when necessary to construct a semantic tree, leading to a strategy that combines the construction of semantic trees with resolution–refutation. The parallel semantic tree theorem prover, called PrHERBY, combines semantic trees and resolution–refutation methods. The parallel system is scalable by strategically selecting atoms with the help of dedicated resolutions. In addition, a parallel grounding scheme allows each system to have its own instance of generated atoms, thereby increasing the possibility of success. The PrHERBY system presented performs significantly better and generally finds proof using fewer atoms than the semantic tree prover, HERBY and its parallel version, PHERBY. 相似文献
3.
Graefe G. Davison D.L. 《IEEE transactions on pattern analysis and machine intelligence》1993,19(8):749-764
Emerging database application domains demand not only high functionality, but also high performance. To satisfy these two requirements, the Volcano query execution engine combines the efficient use of parallelism on a wide variety of computer architectures with an extensible set of query processing operators that can be nested into arbitrarily complex query evaluation plans. Volcano's novel exchange operator permits designing, developing, debugging, and tuning data manipulation operators in single-process environments but executing them in various forms of parallelism. The exchange operator shields the data manipulation operators from all parallelism issues. The design and implementation of the generalized exchange operator are examined. The authors justify their decision to support hierarchical architectures and argue that the exchange operator offers a significant advantage for development and maintenance of database query processing software. They discuss the integration of bit vector filtering into the exchange operator paradigm with only minor modifications 相似文献
4.
A method is presented which aims to enhance the run-time performance of real-time production systems of utilising natural concurrency in the application knowledge base. This exploiting application parallelism (EAP) method includes an automated analysis of the knowledge base and the use of this analysis information to partition and execute rules on a novel parallel production system (PPS) architecture. Prototype analysis tools and a PPS simulator have been developed for the Inference ART environment in order to apply the method to a naval data-fusion problem. The results of this experimental investigation revealed that an average maximum of 12.06 rule-firings/cycle was possible but, due to serial bottlenecks inherent in the data-fusion problem, up to only 2.14 rule-firings/cycle was achieved overall. Limitations of the EAP method are discussed within the context of the experimental results and an enhanced method is investigated. 相似文献
5.
We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh. 相似文献
6.
Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations. 相似文献
7.
R. L. Smelyanskiy 《Programming and Computer Software》2011,37(3):153-160
The paper discusses problems related to the field called program profiling. It presents a retrospective review of solving the code profiling problem, which reduces essentially to the frequency analysis of the sequential program code execution (SPCE). A new approach to solving this problem is proposed. It is based on the Monte Carlo method, which makes it possible to assess the number of program runs required for the estimation of the frequency of execution of its commands with a given accuracy. 相似文献
8.
The functional or applicative approach to the development of software leads to the natural exposure of parallelism. Data flow architectures are designed to exploit this parallelism. This paper describes a method for modeling and analyzing program execution on a stream-oriented, recursive data flow architecture. Based on a maximum path length technique, a recursive analysis of nodes (corresponding to high level constructs) results in an approximation of the time required to execute the program given adequate resources. By relating the structure of a program to the degree of exhibited parallelism, this method provides a useful tool for studying parallelism and evaluating certain problem classes to be run on similar architecutures. 相似文献
9.
We suggest a framework for UML diagram validation and execution that takes advantage of some of the practical restrictions
induced by diagrammatic representations (as compared to Turing equivalent programming languages) by exploiting possible gains
in decidability. In particular, within our framework we can prove that an object interaction comes to an end, or that one
action is always performed before another. Even more appealingly, we can compute efficiently whether two models are equivalent
(aiding in the redesign or refactoring of a model), and what the differences between two models are. The framework employs
a simple modelling object language (called MOL) for which we present formal syntax and semantics. A first generation of tools
has been implemented that allows us to collect experience with our approach, guiding its further development. 相似文献
10.
Chang Yun Park 《Real-Time Systems》1993,5(1):31-62
This paper describes a method to predict guaranteed and tight deterministic execution time bounds of a sequential program. The basic prediction technique is a static analysis based on simple timing schema for source-level language constructs, which gives accurate predictions in many cases. Using powerful user-provided information, dynamic path analysis refines looser predictions by eliminating infeasible paths and decomposing the possible execution behaviors in a pathwise manner. Overall prediction cost is scalable with respect to desired precision, controlling the amount of information provided. We introduce a formal path model for dynamic path analysis, where user execution information is represented by a set of program paths. With a well-defined practical high-level interface language, user information can be used in an easy and efficient way. We also introduce a method to verify given user information with known program verification techniques. Initial experiments with a timing tool show that safe and tight predictions are possible for a wide range of programs. The tool can also provide predictions for interesting subsets of program executions.This research was supported in part by the Office of Naval Research under grant number N00014-89-J-1040. 相似文献
11.
José M. Cecilia José-Matías Cutillas-Lozano Domingo Giménez Baldomero Imbernón 《The Journal of supercomputing》2018,74(5):1803-1814
The solution of Protein–Ligand Docking Problems can be approached through metaheuristics, and satisfactory metaheuristics can be obtained with hyperheuristics searching in the space of metaheuristics implemented inside a parameterized schema. These hyperheuristics apply several metaheuristics, resulting in high computational costs. To reduce execution times, a shared-memory schema of hyperheuristics is used with four levels of parallelism, two for the hyperheuristic and two for the metaheuristics. The parallel schema is executed in a many-core system in “native mode,” and the four-level parallelism allows us to take full advantage of the massive parallelism offered by this architecture and obtain satisfactory fitness and an important reduction in the execution time. 相似文献
12.
13.
The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix–matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP. 相似文献
14.
Alex Groce Rajeev Joshi 《International Journal on Software Tools for Technology Transfer (STTT)》2008,10(2):131-144
From operating systems and web browsers to spacecraft, many software systems maintain a log of events that provides a partial history of execution, supporting post-mortem (or post-reboot) analysis. Unfortunately, bandwidth, storage limitations, and privacy concerns limit the information content of logs, making it difficult to fully reconstruct execution from these traces. This paper presents a technique for modifying a program such that it can produce exactly those executions consistent with a given (partial) trace of events, enabling efficient analysis of the reduced program. Our method requires no additional history variables to track log events, and it can slice away code that does not execute in a given trace. We describe initial experiences with implementing our ideas by extending the CBMC bounded model checker for C programs. Applying our technique to a small, 400-line file system written in C, we get more than three orders of magnitude improvement in running time over a naïve approach based on adding history variables, along with fifty- to eighty-fold reductions in the sizes of the SAT problems solved. 相似文献
15.
A formal algebraic model for divide-and-conquer algorithms is presented. The model reveals the internal structure of divide-and-conquer functions, leads to high-level and functional-styled algorithms specification, and simplifies complexity analysis. Algorithms developed under the model contain vast amounts of parallelism and can be mapped fairly easily to parallel computers.This research was supported in part by DOE grant DOE FG02-86ER25012. 相似文献
16.
Through geometry, program visualization can yield performance properties. We derive all possible synchronization sequences and durations of blocking and concurrent execution for two process programs from a visualization mapping processes, synchronization, and program execution to Cartesian graph axes, line segments, and paths, respectively. Relationships to Petri nets are drawn 相似文献
17.
18.
Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a tracing library for the CRAY X-MP and CRAY-2 supercomputers that supports the low-overhead capture of execution events for sequential and multitasked programs. This library has been extended to use the automatic instrumentation facilities on these machines, allowing trace data from routine entry and exit, and other program segments, to be captured. To assess the utility of the trace-based tools, three of the Perfect Benchmark codes have been tested in scalar and vector modes with the tracing instrumentation. In addition to computing summary execution statistics from the traces, interesting execution dynamics appear when studying the trace histories. It is also possible to model application performance based on properties identified from traces. Our conclusion is that adding tracing support in Cray supercomputers can have significant returns in improved performance characterization and evaluation.An earlier version of this paper was presented at Supercomputing '90.Supported in part by the National Science Foundation under Grants No. NSF MIP-88-07775 and No. NSF ASC-84-04556, and the NASA Ames Research Center Grant No. NCC-2-559.Supported in part by the National Science Foundation under grant NSF ASC-84-04556.Supported in part by the National Science Foundation under grants NSF CCR-86-57696, NSF CCR-87-06653 and NSF CDA-87-22836 and by the National Aeronautics and Space Administration under NASA Contract Number NAG-1-613. 相似文献
19.
蒋勇 《网络安全技术与应用》2010,(12):18-21
JAXM是为Java平台上的应用程序定义的API,供了一种能够在Java平台上通过Internet发送XML文档的标准方法;XML密钥管理规范(XKMS)以已有的XML加密和XML数字签名为基础,定义了分发和注册公钥的协议。密钥定位服务是XKMS中一个重要的服务规范,本文在探讨JAXM和XKMS理论的基础上,给出了一个基于JAXM的XKMS密钥定位服务的实现方案。 相似文献
20.
The MAJC architecture enhances application performance by exploiting parallelism at multiple levels-instruction, data, thread, and process. Supporting vertical multithreading, speculative multithreading, and chip multiprocessors, the scalable VLIW architecture is also capable of advanced speculation and predication and treats all data types similarly 相似文献