首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A hardware architecture for GF(2m) multiplication and its evaluation in a hardware architecture for elliptic curve scalar multiplication is presented. The architecture is a parameterizable digit-serial implementation for any field order m. Area/performance trade-off results of the hardware implementation of the multiplier in an FPGA are presented and discussed.  相似文献   

2.
We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.  相似文献   

3.
This paper describes the development of a new model predictive control technology INCA® that enables a high performance demand driven operation in the chemical process industry. The technology sustains optimal grade changes, maintains tight quality control and leads to low application development and implementation costs. An application on a polyethylene gasphase reactor is discussed.  相似文献   

4.
The design, implementation and performance of TwoGroups, a deductive database for the 58,761 groups of order 2n, (n ≤ 8), is described. The system is implemented in NU-Prolog, a Prolog system with built-in functions for creating and using deductive databases. TwoGroups has a set-theoretic query language, which provides users with a familiar notation to access the data. The paper describes the data and its representation, the set-theoretic query language, its translator and optimiser, and the experiments on the performance of the database.  相似文献   

5.
In this paper we study various implementations of Cholesky factorization on SIMD architectures. A submatrix algorithm is implemented on the MasPar MP-2 using both block and torus-wrap data mappings. Both LLT and LDLT (square root free) implementations of the algorithm are investigated. The execution times and performance results for the MP-2 are presented. The performance of these algorithms is characterized in terms of the problem size, machine size and other machine dependent communication and computation parameters. Analysis for the communication and computation complexities for these algorithms is also presented, and models to predict the performance are derived. The torus-wrap mapped implementations outperformed the block approach for all problem sizes. The LDLT implementation outperformed LLT for small to medium problem sizes. © 1997 John Wiley & Sons, Ltd.  相似文献   

6.
The increasing amount of real-time traffic carried over the Internet requires end-to-end quality of service (QoS) support. To this end, the QoS Schedulers, that are implemented in routers, assign the available bandwidth resources to packet flows according to their respective allocated rates. Packet Fair Queuing (PFQ) schedulers can provide fair service and low end-to-end delay bound to the traffic flows. However, they have higher implementation complexity compared to other algorithms, because of the requirements of tracking the system state, and searching for the packet to get service among all flows, that are queued at the outgoing interface. QoS scheduling is a data plane functionality, which requires hardware implementation for high speed router interfaces. The previous works on hardware implementation of PFQ schedulers are specific to certain algorithms, and they do not provide any results on real hardware platforms. In this paper, we present a general hardware design framework for PFQ schedulers, and apply this framework to the WF2Q+ PFQ algorithm to demonstrate its properties. We carry out the entire implementation of the WF2Q+ algorithm on an FPGA, and evaluate its performance with real traffic flows. In addition, we implement WFQ as a second PFQ algorithm to demonstrate the generality of the framework.  相似文献   

7.
We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU‐CPU computation. We compare single and double precision performance of a modern GPU with unified architecture, and show how iterative refinement with mixed precision can be used to regain full accuracy in the solution of linear systems, exploiting the potential of the processor for single precision arithmetic. Experimental results on a GTX280 using CUBLAS 2.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

8.
This study aims at suggesting an integrative methodology for planning knowledge management initiatives. First, four major underpinning assumptions which should be addressed in knowledge management are identified through literature reviews on strategic information systems planning and knowledge management. Based on these assumptions, we introduce a knowledge strategy planning methodology, called P2-KSP methodology. The P2-KSP methodology places its emphasis on improving organizational performance by identifying and leveraging knowledge directly related to business processes and performance. The methodology consists of five phases: business environment analysis, knowledge requirements analysis, knowledge management strategy establishment, knowledge management architecture design, and knowledge management implementation planning. After its detailed procedures and related features are explained, results of applying it to a large semiconductor manufacturer's knowledge management project are discussed.  相似文献   

9.
This paper considers the use of online frequency response estimates for change detection, leading to fault detection and diagnosis. Spectral changes are detected as deviations in the frequency response, through the choice of a suitable observation variable. Taking into consideration the statistical nature of a finite-time estimator, a χ2 fixed sample size detector is then designed. The detector is shown to be capable of detecting spectral changes for many classes of systems. The performance of the detector and some practical considerations involving robustness are discussed. Simulations and an implementation on a DC motor are used to illustrate the performance and properties of the detector.  相似文献   

10.
A SuperH? embedded processor core, SH-X2, implemented in a 90-nm CMOS process running at 800 MHz achieved 1440 Dhrystone MIPS, 5.6 GFLOPS, and 73M polygons/s. It has a dual-issue eight-stage pipeline architecture, but maintains the 1.8 MIPS/MHz of the previous seven-stage processor core SH-X. The processor meets the requirements of a wide range of applications, and is suitable for digital appliances aimed at the consumer market, such as cellular phones, digital still/video cameras, and car navigation systems. This paper focuses on the implementation of floating-point units in the SH-X2 and its resulting performance, and considers ways of enhancing this performance in future.  相似文献   

11.
Given that the concurrent L1-minimization (L1-min) problem is often required in some real applications, we investigate how to solve it in parallel on GPUs in this paper. First, we propose a novel self-adaptive warp implementation of the matrix-vector multiplication (Ax) and a novel self-adaptive thread implementation of the matrix-vector multiplication (ATx), respectively, on the GPU. The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size. Second, based on the above proposed kernels, the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU, and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache. Finally, we design a concurrent L1-min solver on multiple GPUs. The experimental results have validated the high effectiveness and good performance of our proposed methods.  相似文献   

12.
As today’s standard screening methods frequently fail to detect breast cancer before metastases have developed, early diagnosis is still a major challenge. With the promise of high-quality volume images, three-dimensional ultrasound computer tomography is likely to improve this situation, but has high computational needs. In this work, we investigate the acceleration of the ray-based transmission reconstruction by a GPU-based implementation of the iterative numerical optimization algorithm TVAL3. We identified the regular and transposed sparse-matrix–vector multiply as the performance limiting operations. For accelerated reconstruction we propose two different concepts and devise a hybrid scheme as optimal configuration. In addition we investigate multi-GPU scalability and derive the optimal number of devices for our two primary use-cases: a fast preview mode and a high-resolution mode. In order to achieve a fair estimation of the speedup, we compare our implementation to an optimized CPU version of the algorithm. Using our accelerated implementation we reconstructed a preview 3D volume with 24,576 unknowns, a voxel size of (8 mm)3 and approximately 200,000 equations in 0.5 s. A high-resolution volume with 1,572,864 unknowns, a voxel size of (2mm)3 and approximately 1.6 million equations was reconstructed in 23 s. This constitutes an acceleration of over one order of magnitude in comparison to the optimized CPU version.  相似文献   

13.
This paper details a strategy for modifying the source code of a complex model so that the model may be used in a data assimilation context, and gives the standards for implementing a data assimilation code to use such a model. The strategy relies on keeping the model separate from any data assimilation code, and coupling the two through the use of Message Passing Interface (MPI) functionality. This strategy limits the changes necessary to the model and as such is rapid to program, at the expense of ultimate performance. The implementation technique is applied in different models with state dimension up to .2.7 × 108 The overheads added by using this implementation strategy in a coupled ocean-atmosphere climate model are shown to be an order of magnitude smaller than the addition of correlated stochastic random errors necessary for some nonlinear data assimilation techniques.  相似文献   

14.
Important components of molecular modeling applications are estimation and minimization of the internal energy of a molecule. For macromolecules such as proteins and amino acids, energy estimation is performed using empirical equations known as force fields. Over the past several decades, much effort has been directed towards improving the accuracy of these equations, and the resulting increased accuracy has come at the expense of greater computational complexity. For example, the interactions between a protein and surrounding water molecules have been modeled with improved accuracy using the generalized Born solvation model, which increases the computational complexity to O (n 3). Fortunately, many force-field calculations are amenable to parallel execution. This paper describes the steps that were required to transform the Born calculation from a serial program into a parallel program suitable for parallel execution in both the OpenMP and MPI environments. Measurements of the parallel performance on a symmetric multiprocessor reveal that the Born calculation scales well for up to 144 processors. In some cases the OpenMP implementation scales better than the MPI implementation, but in other cases the MPI implementation scales better than the OpenMP implementation. However, in all cases the OpenMP implementation performs better than the MPI implementation, and requires less programming effort as well. Trademark Legend Sun, Sun Microsystems, SPARC, UltraSPARC, Sun Fire, Sun Performance Library and Sun HPC Cluster Tools are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.  相似文献   

15.
In this paper, we evaluate the impact on performance of various implementation techniques for collective I/O operations, and we do so across four important parallel architectures. We show that a naive implementation of collective I/0 does not result in significant performance gains for any of the architectures, but that an optimized implementation does provide excellent performance across all of the platforms under study. Furthermore, we demonstrate that there exists a single implementation strategy that provides the best performance for all four computational platforms. Next, we evaluate implementation techniques for thread-based collective I/O operations. We show that the most obvious implementation technique, which is to spawn a thread to execute the whole collective I/O operation in the background, frequently provides the worst performance, often performing much worse than just executing the collective I/O routine entirely in the foreground. To improve performance, we explore an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this implementation technique can provide significant performance gains, offering up to a 50% improvement over implementations that do not attempt to overlap collective I/O and computation.  相似文献   

16.
Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 3(2)‐dimensional Euclidean space. Multidimensional scaling (MDS) is a good candidate. However, owing to at least O(M2) memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 104 objects on computer systems, ranging from PC with multicore CPU processor, graphics processing unit (GPU) board to midrange MPI clusters. To explore interactively data sets of that size, we have developed novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in compute unified device architecture environment on a PC equipped with a modern GPU board (Tesla M2090, GeForce GTX 480) is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster (10 nodes, each equipped with 2x Intel Xeon X5670 CPUs). We also show that the hybridized two‐level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd.  相似文献   

17.
《Computer Networks》2007,51(6):1403-1420
While existing weighted fair scheduling schemes guarantee minimum bandwidths/resources for the classes/processes of a shared channel, the maximum rate control, which is critical to service providers, carriers, and network managers for resource management and business strategies in many applications, is generally enforced by employing traffic policing mechanisms. These approaches use either a concatenation of the rate controller and scheduler, or a policer in front of the scheduler. The concatenation method uses two sets of queues and a management apparatus that incurs overhead. The latter method may allow bursty traffic to pass through the controller, which violates the maximum rate constraint, or results in packet loss. In this paper, we present a new weighted fair scheduling scheme, WF2Q-M, which can simultaneously support maximum rate control and minimum service rate guarantees. WF2Q-M uses the virtual clock adjustment method to enforce maximum rate control and distribute the excess bandwidths of saturated sessions to other sessions without recalculating the virtual starting and finishing times of sessions. In terms of performance, we prove that WF2Q-M is theoretically bounded by a corresponding fluid reference model. A procedural scheduling implementation of WF2Q-M is proposed, and a proof of correctness is given. Finally, we present the results of extensive experiments to show that the performance of WF2Q-M is just as claimed.  相似文献   

18.
A Q4/Q4 continuum structural topology optimization implementation   总被引:4,自引:0,他引:4  
A node-based design variable implementation for continuum structural topology optimization in a finite element framework is presented and its properties are explored in the context of solving a number of different design examples. Since the implementation ensures C0continuity of design variables, it is immune to element-wise checkerboarding instabilities that are a concern with element-based design variables. Nevertheless, in a subset of design examples considered, especially those involving compliance minimization with coarse meshes, the implementation is found to introduce a new phenomenon that takes the form of layering or islanding in the material layout design. In the examples studied, this phenomenon disappears with mesh refinement or the enforcement of sufficiently restrictive design perimeter constraints, the latter sometimes being necessary in design problems involving bending to ensure convergence with mesh refinement. Based on its demonstrated performance characteristics, the authors conclude that the proposed node-based implementation is viable for continued usage in continuum topology optimization.  相似文献   

19.
This article considers the application of exact multiobjective techniques to search in large size realistic road maps. In particular, the NAMOA algorithm is successfully applied to several road networks from the DIMACS shortest path implementation challenge with two objectives. An efficient heuristic function previously proposed by Tung and Chew is evaluated. Heuristic values are precalculated with search. The precalculation effort is shown to pay off during the multiobjective search stage. An improvement to the calculation procedure is also proposed, resulting in added improved time performance in many problem instances.  相似文献   

20.
A distributed version of a homotopy algorithm for solving the H /H mixed-norm controller synthesis problem is presented. The main purpose of the study is to explore the possibility of achieving high performance with low cost. Existing UNIX workstations running PVM (Parallel Virtual Machine) are utilized. Only the jacobian matrix computation is distributed and therefore the modification to the original sequential code is minimal. The same algorithm has also been implemented on an Intel Paragon parallel machine. Our implementation shows that acceptable speed-up is achieved and the larger the problem sizes, the higher the speed-up. Compared with the results from the Intel Paragon, the study concludes that utilizing the existing UNIX workstations can be a very cost-effective approach to shorten computation time. Furthermore, this economical way to achieve high performance computation can easily be realized and incorporated in a practical industrial design environment.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号