期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

全文获取类型

收费全文	21篇
免费	0篇
国内免费	12篇

专业分类

一般工业技术	5篇
自动化技术	28篇

出版年

2024年	1篇
2023年	2篇
2021年	3篇
2020年	2篇
2018年	1篇
2017年	3篇
2016年	2篇
2014年	2篇
2012年	3篇
2011年	1篇
2010年	1篇
2009年	1篇
2005年	1篇
2003年	1篇
2001年	1篇
1998年	1篇
1997年	1篇
1995年	1篇
1993年	1篇
1992年	2篇
1989年	1篇
1988年	1篇

排序方式： 共有33条查询结果，搜索用时 203 毫秒

1 [2] [3] [4] 下一页 » 末页»

Applications of Level 2 BLAS in the NAG library

J. J. Du Croz P. J. D. Mayes 《Parallel Computing》1988,8(1-3):345-350

We describe work done to improve the performance of NAG Library routines for nonlinear equations and nonlinear least squares problems on vector-processing machines. Calls to the Level 2 BLAS routines for matrix-vector operations were introduced wherever possible, so that further efforts to tune the code could be concentrated within the Level 2 BLAS, and advantage can be taken of optimized implementations of the Level 2 BLAS when they become available. Performance measurements from a CRAY-1S, an AMDAHL VP1100, and a CDC CYBER 205 are presented to illustrate the effectiveness of this strategy. 相似文献

基于申威1621处理器的BLAS一级函数优化

李浩然王磊《计算机系统应用》2021,30(7):246-252

BLAS (Basic Linear Algebra Subprograms)是一个基本线性代数操作的数学函数标准, 该库函数分为三个级别, 每个级别提供了向量与向量(1级)、向量与矩阵(2级)、向量与向量(三级)之间的基本运算. 本文研究了在申威1621处理器上BLAS一级函数的优化方案, 以函数AXPY为例, 充分利用平台的架构特点对其进行性能调优,设计了自动的线程分配方案. 实验结果显示优化过后的BLAS一级函数AXPY相对于GotoBLAS参考实现版本的单核和多核加速比分别高达4.36和9.50, 对于每种优化方式均得到了一定的性能提升. 相似文献

高性能BLAS在类Beowulf机群系统上的实现

吴少刚许解峰杨耀忠任钢《小型微型计算机系统》2001,22(8):897-900

Beowulf计划关于“基于COTS技术以满足特殊计算需要”的思想使得机群计算成为斋性能计算的一个重要流派,本文针对类Beowulf机群的Intel微处理器特点,讨论了BLAS的优化技术,在以软件DSM系统作为并行编程环境的类Beowulf机群系统上作出了性能评价。相似文献

面向龙芯3A体系结构的BLAS库优化

何颂颂顾乃杰朱海涛刘燕君《小型微型计算机系统》2012,33(3):571-575

双精度普通矩阵乘法DGEMM是BLAS库中最核心的函数之一,大部分三级BLAS库函数的核心计算都是通过调用DGEM M来实现的.该文针对龙芯3A具有128位访存指令的特点,通过理论分析,找到了最佳的循环展开方式;针对龙芯3A的Cache替换策略(随机替换),通过使用地址交错技术,减少了Cache的冲突失效;针对龙芯3A访存带宽有限的问题,通过使用共享数据的任务划分方式,减少了数据访存量.优化后的DGEMM单核和多核运算速度均是性能最高的开源BLAS库(Goto-BLAS)的2倍多. 相似文献

A dissection solver with kernel detection for symmetric finite element matrices on shared memory computers

A. Suzuki F.‐X. Roux 《International journal for numerical methods in engineering》2014,100(2):136-164

A direct solver for symmetric sparse matrices from finite element problems is presented. The solver is supposed to work as a local solver of domain decomposition methods for hybrid parallelization on cluster systems of multi‐core CPUs, and then it is required to run on shared memory computers and to have an ability of kernel detection. Symmetric pivoting with a given threshold factorizes a matrix with a decomposition introduced by a nested bisection and selects suspicious null pivots from the threshold. The Schur complement constructed from the suspicious null pivots is examined by a factorization with 1 × 1 and 2 × 2 pivoting and by a robust kernel detection algorithm based on measurement of residuals with orthogonal projections onto supposed image spaces. A static data structure from the nested bisection and a block sub‐structure for Schur complements at all bisection levels can use level 3 BLAS routines efficiently. Asynchronous task execution for each block can reduce idle time of processors drastically, and as a result, the solver has high parallel efficiency. Competitive performance of the developed solver to Intel Pardiso on shared memory computers is shown by numerical experiments. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Jorge Gonzlez‐Domínguez María J. Martín Guillermo L. Taboada Juan Tourio Ramn Doallo Damin A. Malln Brian Wibecan 《Concurrency and Computation》2012,24(14):1645-1667

The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures such as multicore clusters. This paper describes UPCBLAS, a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C language. The routines developed in UPCBLAS are built on top of sequential basic linear algebra subprograms functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. Furthermore, the routines implement other optimization techniques, several of them by automatically taking into account the hardware characteristics of the underlying systems on which they are executed. The library has been experimentally evaluated on a multicore supercomputer and compared with a message‐passing‐based parallel numerical library, demonstrating good scalability and efficiency. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

Minimizing development and maintenance costs in supporting persistently optimized BLAS

R. Clint Whaley Antoine Petitet 《Software》2005,35(2):101-121

The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance‐critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance‐critical routines. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

异构HPL算法中CPU端高性能BLAS库优化

蔡雨孙成国杜朝晖刘子行康梦博李双双《软件学报》2021,32(8):2289-2306

异构HPL(high-performance Linpack)效率的提高需要充分发挥加速部件和通用CPU计算能力,加速部件集成了更多的计算核心,负责主要的计算,通用CPU负责任务调度的同时也参与计算.在合理划分任务、平衡负载的前提下,优化CPU端计算性能对整体效率的提升尤为重要.针对具体平台体系结构特点对BLAS(ba... 相似文献

A C++11 implementation of arbitrary-rank tensors for high-performance computing

Alejandro M. Aragón 《Computer Physics Communications》2014

This article discusses an efficient implementation of tensors of arbitrary rank by using some of the idioms introduced by the recently published C++ ISO Standard (C++11). With the aims at providing a basic building block for high-performance computing, a single Array class template is carefully crafted, from which vectors, matrices, and even higher-order tensors can be created. An expression template facility is also built around the array class template to provide convenient mathematical syntax. As a result, by using templates, an extra high-level layer is added to the C++ language when dealing with algebraic objects and their operations, without compromising performance. The implementation is tested running on both CPU and GPU. 相似文献

10.

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术

胡怡陈道琨杨超刘芳芳马文静尹万旺袁欣辉林蓉芬《软件学报》2023,34(9):4421-4436

BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块,广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算.针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数.基于RMA通信机制设计了从核归约策略,提升了BLAS 1、2级若干函数的归约效率.针对TRSV、TPSV等存在数据依赖关系的函数,提出了一套高效并行算法,该算法通过点对点同步维持数据依赖关系,设计了适用于三角矩阵的高效任务映射机制,有效减少了从核点对点同步的次数,提高了函数的执行效率.通过自适应优化、向量压缩、数据复用等技术,进一步提升了BLAS 1、2级函数的访存带宽利用率.实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%,平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%,平均可达80%以上.与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用... 相似文献

1 [2] [3] [4] 下一页 » 末页»