首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
针对投机并行化中如何权衡策略并确定合适的执行模型来获取理想性能的问题,提出了一种基于交互信息的投机并行化方法,利用交互信息来确定投机并行化的执行模型,建立相关评价模型,并着重从线程抽取创建角度提出了相应的策略及对应的性能评价。通过实验表明,基于交互信息进行“按需”并行化,可以达到所需的性能要求。  相似文献   

2.
In molecular dynamics (MD) simulations, calculations of potentials and their derivatives by coordinate, i.e., forces, in a pairwise additive manner such as the Lennard–Jones interactions and a short-range part of the Coulombic interactions form the main part of arithmetic operations. It is essential to achieve high thread-level parallelization efficiency of these pairwise additive calculations of potentials and forces to use current supercomputers with many-core architectures effectively. In this paper, we propose four new thread-level parallelization algorithms for the pairwise additive potential and force calculations. We implement the four codes in a MD calculation code based on the fast multipole method. Performance benchmarks were taken on the FX100 supercomputer and Intel Xeon Phi coprocessor. The code succeeds in achieving high thread-level parallelization efficiency with 32 threads on the FX100 and up to 60 threads on the Xeon Phi.  相似文献   

3.
针对智能IC卡安全认证过程中3DES算法加密速度较慢的问题,本文采用AES算法代替3DES算法完成智能IC卡的安全认证,并对密钥分散机制进行改进,使用混沌系统中的logistic映射进行密钥分散。实验结果表明,改进后的安全认证算法在保证安全加密强度的前提下,有效提升了智能IC卡的安全认证速度。  相似文献   

4.
We construct a parallel algorithm, suitable for distributed memory architectures, of an explicit shock-capturing finite volume method for solving the two-dimensional shallow water equations. The finite volume method is based on the very popular approximate Riemann solver of Roe and is extended to second order spatial accuracy by an appropriate TVD technique. The parallel code is applied to distributed memory architectures using domain decomposition techniques and we investigate its performance on a grid computer and on a Distributed Shared Memory supercomputer. The effectiveness of the parallel algorithm is considered for specific benchmark test cases. The performance of the realization measured in terms of execution time and speedup factors reveals the efficiency of the implementation.  相似文献   

5.
针对DSA数字签名算法,提出了一种利用预计算建表技术提高DSA签名验证速度的方法,给出了具体的算法并做出相应的算法复杂度分析,它也能适用于某些底固定的双指数模幂运算。与目前较常用的方法相比,该方法验证速度要快一倍以上,可在计算效率与存储量之间达到某种平衡,使该方法能较好适用于某些应用场合,给出并分析了相应的实例。  相似文献   

6.
《Real》2004,10(2):95-102
The routine for finding the closest codeword in the encoding phase of vector quantization (VQ) is high computational complexity and time consuming, especially when the codewords deal with the high-dimensional vectors. In this paper, we propose three newly developed schemes for speeding up the encoding phase of VQ. The proposed schemes can easily filter out many impossible codewords such that the search domain is reduced. The experimental results show that the computational time of our proposed schemes can save more than 41–52% computational time in a full search scheme. Furthermore, our schemes only require fewer than 84% of the computational time required in recently proposed alternative.  相似文献   

7.
Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and An abstract of this paper has been publsihed in Ref. 1. Research supported in part by Army contract #DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army of the Government. Research supported in part by Intel and NASA Graduate Fellowships. Research supported in part by an AT&T Bell Laboratoroies Graduate Fellowship and by the International Computer Science Institute, Berkeley, California.  相似文献   

8.
Pattern Analysis and Applications - The accuracy of multi-class classification problems is improving at a good pace. However, improving the accuracy often leads to slowing down the processing...  相似文献   

9.
为了加速三维物体的变形过程,提出一种基于组件的新方法计算元球的边界体.把每个元球的边界体值存放在数组中,通过查询边界体数组,可以快速求出变形空间内任意一点的影响组件,并计算影响组件的骨架中心到该点的距离以及影响组件相对于该点的所有势函数之和,然后采用移动立方体算法将该点的势函数值与阈值比较,得出变形物体表面顶点的空间坐标值和该顶点的单位法向量.利用OpenGL实现变形物体的绘制过程.  相似文献   

10.
The modular exponentiation is a common operation for scrambling secret data and is used by several public-key cryptosystems, such as the RSA scheme and DSS digital signature scheme. However, the calculations involved in modular exponentiation are time-consuming especially when performed in software. In this paper, we propose an efficient CMM-MSD Montgomery algorithm by utilizing the Montgomery modular reduction method, common-multiplicand-multiplication (CMM) method, and minimal-signed-digit (MSD) recoding technique for fast modular exponentiation. By using the technique of recording the common signed-digit representations in the grouped substrings of exponent, our algorithm can improve the efficiency of both the original CMM exponentiation algorithm and the Montgomery multiplication algorithm. The fast modular exponentiation algorithm developed in this paper can be easily implemented in general signed-digit computing machine, and is therefore well suited for parallel implementation to fast evaluating modular exponentiation. Moreover, by using the proposed CMM-MSD Montgomery algorithm, on average the total number of single-precision multiplications can be reduced by about 38.9% and 26.68% as compared with Dusse-Kaliski’s Montgomery algorithm and Ha-Moon’s Montgomery algorithm, respectively.  相似文献   

11.
This paper proposes a scalable two-level parallelization method for distributed hydrological models that can use parallelizability at both the sub-basin level and the basic simulation-unit level (e.g., grid cell) simultaneously. This approach first uses the message-passing programming model to dispatch parallel tasks at the sub-basin level to different nodes with multi-core CPUs in the cluster. Each node is responsible for some of the sub-basins. Parallel tasks for each sub-basin at the basic simulation-unit level are then dispatched to multiple cores within each node using the shared-memory programming model. A grid-based distributed hydrological model was parallelized to demonstrate the performance of the proposed method, which was tested in different scenarios (e.g., different data volume, different numbers of sub-basins). Results show that the proposed two-level parallelization method had better scalability than the parallel computation at sub-basin level alone, and the parallel performance increased with data volume and the number of sub-basins.  相似文献   

12.
For speeding up query processing on Big Data, frequent sub-queries or views may be materialized such that the query processing cost is minimized with optimum cost of maintaining the materialized views and/or queries. Materializing frequent sub-queries and views means that resultant data set of the views reside in the memory of one or more nodes in the cluster, so that it reduces the MapReduce cost, submission and scheduling cost of Distributed File System jobs for query processing. We have defined materialized views as resultant data of frequent sub-queries and aggregation functions of a set of Big Data warehousing queries that are saved for enhancing query performance. The problem is defined as a multi-objective optimization problem for minimizing the total query processing MapReduce cost, MapReduce cost for maintaining the materialized views and the number of views selected for materializing with maximized total size of the views selected. We applied Differential Evolution algorithm and NSGA-II to study their performances for developing a recommendation system for selecting views for materializing in Big Data warehousing.  相似文献   

13.
14.
We report a novel micro magnetic gyromixer designed for accelerating mixing hence reactions in droplets. The gyromixer is fabricated with magnetite-PDMS composite using soft lithography. The mixer spins and balances itself on the droplet surface through the gyroscopic effect, rapidly homogenizing the enclosed reagents by stretching and folding internal fluid streamlines to enhance mixing. We examined the capability of the gyromixer for improving biochemical reactions in droplets by monitoring the biotin-streptavidin binding as a linker in a quantum dot fluorescence resonant energy transfer (QD-FRET) sensing system. The remotely controlled gyromixer exhibits high flexibility and potential for integration in a variety of droplet-based miniaturized total analysis systems (μTAS) to reduce turnaround times.  相似文献   

15.
16.
云计算产业的快速发展使得虚拟化技术在各大云服务商心目中占据重要地位。为了获取更高的利润,云服务商需要在保障用户体验的前提下尽可能地利用设备性能。通过利用I/O请求的优先级和重要性等信息,研究者们已经在Linux内核中实现了很多提高程序性能的方法。然而,虚拟机中的这些信息在传递到宿主机的过程中会丢失,所以 提出了一种基于服务水平目标SLO的I/O保障框架。首先分析了I/O请求优先级等信息丢失的原因,并提出了传递这些信息需要解决的关键性问题。在此基础上,本文提出的框架通过对Linux内核、virtio协议以及KVM的I/O虚拟化程序QEMU进行扩展,成功地将虚拟机线程的SLO信息传送至宿主机并在此基础上实现了基于SLO信息的调度器。最后,通过实验验证了框架的可行性,优先级最高的线程吞吐量可以达到260 KB/s,优先级最低的线程吞吐量只有10 KB/s,成功证明了由框架传递下来的SLO信息对宿主机中调度器的调度起到了积极作用。  相似文献   

17.
Most methods for programming loosely coupled systems are based on message-passing. Recently, however, methods have emerged based on ‘virtually’ sharing data. These methods simplify distributed programming, but are hard to implement efficiently, as loosely coupled systems do not contain physical shared memory. We introduce a new model, the shared data-object model, that eases the implementation of parallel applications on loosely coupled systems, but can still be implemented efficiently. In our model, shared data are encapsulated in passive data-objects, which are variables of user-defined abstract data types. To speed up access to shared data, data-objects are replicated. This ability to replicate objects is a significant difference with other object-based models (e.g. Emerald and Amber). Also, by replicating logical objects rather than physical pages, our model has many advantages over shared virtual memory systems. This paper discusses the design choices involved in replicating objects and their effect on performance. Important issues are: how to maintain consistency among different copies of an object; how to implement changes to objects; which strategy for object replication to use. We have implemented several options to determine which ones are the most efficient.  相似文献   

18.
The fast Fourier transform (FFT) is undoubtedly an essential primitive that has been applied in various fields of science and engineering. In this paper, we present a decomposition method for the parallelization of multi-dimensional FFTs with the smallest communication amounts for all ranges of the number of processes compared to previously proposed methods. This is achieved by two distinguishing features: adaptive decomposition and transpose order awareness. In the proposed method, the FFT data is decomposed based on a row-wise basis that maps the multi-dimensional data into one-dimensional data, and translates the corresponding coordinates from multi-dimensions into one dimension so that the one-dimensional data can be divided and allocated equally to the processes using a block distribution. As a result and different from previous works that have the dimensions of decomposition pre-defined, our method can adaptively decompose the FFT data on the lowest possible dimensions depending on the number of processes. In addition, this row-wise decomposition provides plenty of alternatives in data transpose, and different transpose order results in different amounts of communication. We identify the best transpose orders with the smallest communication amounts for the 3-D, 4-D, and 5-D FFTs by analyzing all possible cases. We also develop a general parallel software package for the most popular 3-D FFT based on our method using the 2-D domain decomposition. Numerical results show good performance and scaling properties of our implementation in comparison with other parallel packages. Given both communication efficiency and scalability, our method is promising in the development of highly efficient parallel packages for the FFT.  相似文献   

19.
Structural and Multidisciplinary Optimization - Surrogate modeling is commonly used to replace expensive simulations of engineering problems. Kriging is a popular surrogate for deterministic...  相似文献   

20.
In this paper, a new methodology for speeding up edge and line detection algorithms is presented, achieving improved performance over the state of the art software library OpenCV (speedup from 1.35 up to 2.22) and other conventional implementations, in both general and embedded processors, by reducing the number of load/store and arithmetic instructions, the number of data cache accesses and data cache misses in memory hierarchy and the algorithm memory size. This is achieved by fully exploiting the combination of the software and hardware parameters which are considered simultaneously as one problem and not separately. Furthermore, the edge and line detection algorithms have been simplified for a computer vision application in a Virtex-5 Xilinx FPGA using Microblaze soft processor (detection and measurement of flow fronts in a microfluid device); it achieves speedup up to 660 times in comparison with conventional software implementations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号