期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Magas: matrix-based asynchronous graph analytics on shared memory systems

Luo Le Liu Yi Yang Hailong Qian Depei 《The Journal of supercomputing》2022,78(4):5650-5680

Graph analytics plays an important role in many areas such as big data and artificial intelligence. The vertex-centric programming model provides friendly interfaces to programmers and is extensively used in graph processing frameworks. However, it is prone to generate many irregular memory accesses and scheduling overhead due to vertex-based execution and scheduling of programs in the backend. Instead, the matrix-based model provides a different approach by using high-performance matrix operations in the backend to improve the efficiency of graph processing. Unfortunately, current matrix-based frameworks only support the synchronous parallel model, which constrains its application to various graph algorithms. To address these problems, this paper proposes a graph processing framework, which combines matrix operations with the asynchronous model while providing friendly programming interfaces similar to vertex-centric programming model. Firstly, we propose an approach to map the vertex-based graph processing to matrix operations in the asynchronous model. Then, we propose two asynchronous scheduling policies, Gauss–Seidel policy and relaxed Gauss–Seidel policy, for different graph algorithms. After that, our framework applies the batch scheduling and optimized in-memory data structure to reduce the scheduling overhead introduced by the asynchronous model. Experimental results show that our framework performs better than the popular vertex programming frameworks such as GraphLab and GRACE in both performance and speedup and achieves similar performance compared to the BSP-based matrix framework such as GraphMat.

相似文献

2.

Toward High-Performance Delta-Based Iterative Processing with a Group-Based Approach

下载免费PDF全文

Yu Hui Jiang Xin-Yu Zhao Jin Qi Hao Zhang Yu Liao Xiao-Fei Liu Hai-Kun Mao Fu-Bing Jin Hai 《计算机科学技术学报》2022,37(4):797-813

Many systems have been built to employ the delta-based iterative execution model to support iterative algorithms on distributed platforms by exploiting the sparse computational dependencies between data items of these iterative algorithms in a synchronous or asynchronous approach. However, for large-scale iterative algorithms, existing synchronous solutions suffer from slow convergence speed and load imbalance, because of the strict barrier between iterations; while existing asynchronous approaches induce excessive redundant communication and computation cost as a result of being barrier-free. In view of the performance trade-off between these two approaches, this paper designs an efficient execution manager, called Aiter-R, which can be integrated into existing delta-based iterative processing systems to efficiently support the execution of delta-based iterative algorithms, by using our proposed group-based iterative execution approach. It can efficiently and correctly explore the middle ground of the two extremes. A heuristic scheduling algorithm is further proposed to allow an iterative algorithm to adaptively choose its trade-off point so as to achieve the maximum efficiency. Experimental results show that Aiter-R strikes a good balance between the synchronous and asynchronous policies and outperforms state-of-the-art solutions. It reduces the execution time by up to 54.1% and 84.6% in comparison with existing asynchronous and the synchronous models, respectively.

相似文献

3.

实时SAR成像系统矩阵转置方法研究与实现 总被引：1，自引：0，他引：1

下载免费PDF全文

边明明毕福昆汪精华《计算机工程与应用》2011,47(22):117-119

合成孔径雷达（SAR）是一种高分辨率成像雷达,而矩阵转置是实时SAR成像信号处理中一个很重要的操作,矩阵转置的效率高低将直接决定整个SAR成像信号处理系统的性能。对于矩阵转置,可采用行进列出或列进行出、两页式或三页式转置等方法进行处理,但这些方法处理时间较长,转置效率较低。在现有矩阵转置方法的基础上,提出了一种新的矩阵转置方法。在实际硬件平台上利用提出的矩阵转置方法进行了实时SAR成像处理,所得结果的矩阵转置效率为78%,整个SAR成像处理时间为10秒。测试结果表明,该方法对解决矩阵转置问题是行之有效的。相似文献

4.

基于MPI+CUDA异步模型的并行矩阵乘法

刘青昆马名威阎慰椿《计算机应用》2011,31(12):3327-3330

矩阵乘法在科学计算领域中起着重要的作用,不同结构模型能够改善并行矩阵乘的性能。现有的MPI+CUDA同步模型中,主机端需要进入等待状态,直到设备端完成任务后才能继续工作,这显然浪费时间。针对上述问题,提出一种基于MPI+CUDA异步模型的并行矩阵乘法。该模型避免了主机端进入等待状态,并采用CUDA流技术解决数据量超过GPU内存问题。通过分析异步模型的加速比和效率,实验结果表明,此方法显著提高了并行效率和大型矩阵乘法的运算速度,充分发挥了节点间分布式存储和节点内共享内存的优势,是一种有效可行的并行策略。相似文献

5.

基于MPSoC并行调度的矩阵乘法加速算法研究

杨飞马昱春侯金徐宁《计算机科学》2017,44(8):36-41

矩阵乘法是数值分析以及图形图像处理算法的基础,通用的矩阵乘法加速器设计一直是嵌入式系统设计的研究热点。但矩阵乘法由于计算复杂度高,处理效率低,常常成为嵌入式系统运算速度的瓶颈。为了在嵌入式领域更好地使用矩阵乘法,提出了基于MPSoC(MultiProcessor System-on-Chip)的软硬件协同加速的架构。在MPSoC的架构下,一方面,设计了面向硬件约束的矩阵分块方法,从而实现了通用的矩阵乘法加速器系统;另一方面,通过利用MPSoC下的多核架构,提出了相应的任务划分和负载平衡调度算法,提高了并行效率和整体系统加速比。实验结果表明,所提架构及算法实现了通用的矩阵乘法计算,并且通过软硬件协同设计实现的多核并行调度算法与传统单核设计相比在计算效率方面得到了显著的提高。相似文献

6.

Parametric Solutions to the Generalized Discrete Yakubovich‐Transpose Matrix Equation

Caiqin Song Jun‐e Feng Xiaodong Wang Jianli Zhao 《Asian journal of control》2014,16(4):1133-1140

This paper is concerned with the complete parametric solutions to the generalized discrete Yakubovich‐transpose matrix equation X − AX^TB = CY. which is related with several types of matrix equations in control theory. One of the parametric solutions has a neat and elegant form in terms of the Krylov matrix, a block Hankel matrix and an observability matrix. In addition, the special case of the generalized discrete Yakubovich‐transpose matrix equation, which is called the Karm‐Yakubovich‐transpose matrix equation, is considered. The explicit solutions to the Karm‐Yakubovich‐transpose matrix equation are also presented by the so‐called generalized Leverrier algorithm. At the end of the paper, two examples are given to show the efficiency of the proposed algorithm. 相似文献

7.

Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems

Michel Steuwer Malte Friese Sebastian Albers Sergei Gorlatch 《International journal of parallel programming》2014,42(4):601-618

Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code. 相似文献

8.

The Square of the Minor of the Constraint Matrix of the Axial Transport Problem: Its Mean

E. B. Titova V. N. Shevchenko 《Automation and Remote Control》2004,65(2):258-262

For the matrix TT, where T is the constraint matrix of the axial transport problem and T is its transpose, the spectrum,characteristic polynomial, a base of eigenvectors, and the asymptotic behavior of the mean of the square of the minor of the matrix T are determined. 相似文献

9.

A Pipelining Loop Optimization Method for Dataflow Architecture

下载免费PDF全文

Xu Tan Xiao-Chun Ye Xiao-Wei Shen Yuan-Chao Xu Da Wang Lunkai Zhang Wen-Ming Li Dong-Rui Fan Zhi-Min Tang 《计算机科学技术学报》2018,33(1):116-130

与 exascale 来超级计算的时代,电源效率成为了最重要的障碍造一个 exascale 系统。Dataflow 建筑学在为科学应用完成高电源效率有本国的优点。然而,最先进的 dataflow 体系结构没能为循环处理利用高并行。处理这个问题,我们建议一个 pipelining 环优化方法(PLO ) ,它在处理元素(PE ) 在环流动做重复 dataflow 的数组加速器。这个方法由二种技术,帮助建筑学的硬件重复和帮助说明的软件重复组成。在硬件重复执行模型,一个在薄片上循环控制器被设计产生循环索引,减少计算内核并且打为 pipelining 执行的一个好基础的复杂性。在软件重复实行模型,另外的环指令被论述解决重复相关性问题。经由这二种技术,准备好了每周期执行的指令的平均数字被增加使浮点联合起来忙。当这二种技术的硬件费用是可接受的时,模拟结果证明分别地,我们的建议方法平均由 2.45x 和 1.1x 在浮点效率超过静电干扰和动态循环执行模型。相似文献

10.

大规模三角线性方程的高效求解

贾迅邬贵明钱磊谢向辉吴东《计算机工程与科学》2019,41(2):240-245

大规模三角线性方程求解是科学与工程应用中重要的计算核心,受限于处理器的缓存容量和结构设计,其在CPU和GPU等平台上的计算效率不高。大规模三角线性方程的分块求解中,矩阵乘是主要运算,其计算效率对提升三角线性方程求解的计算效率至关重要。以矩阵乘计算效率较高的矩阵乘协处理器为计算平台,针对其结构特点提出了矩阵乘协处理器上大规模三角线性方程分块求解的实现方法和性能分析模型。实验结果表明,矩阵乘协处理器上大规模三角线性方程求解的计算效率最高可达85.9%,其实际性能和资源利用率分别为同等工艺下GPU的2.42倍和10.72倍。相似文献

11.

3D-MMA:基于3D集成电路的矩阵乘加速结构

王吉军郝子宇李宏亮《计算机工程与科学》2019,41(12):2110-2118

脉动阵列结构规整、吞吐量大,适合矩阵乘算法,广泛用于设计高性能卷积、矩阵乘加速结构。在深亚微米工艺下,通过增大阵列规模来提升芯片计算性能,会导致频率下降、功耗剧增等问题。因此,结合3D集成电路技术,提出了一种将平面脉动阵列结构映射到3D集成电路上的双精度浮点矩阵乘加速结构3D-MMA。首先,设计了针对该结构的分块映射调度算法,提升矩阵乘计算效率;其次,提出了基于3D-MMA的加速系统,构建了3D-MMA的性能模型,并对其设计空间进行探索;最后,评估了该结构实现代价,并同已有先进加速器进行对比分析。实验结果表明,访存带宽为160GB/s时,采用4层16×16脉动阵列的堆叠结构时,3D-MMA计算峰值性能达3TFLOPS,效率达99%,且实现代价小于二维实现。在相同工艺下,同线性阵列加速器及K40GPU相比,3D-MMA的性能是后者的1.36及1.92倍,而面积远小于后者。探索了3D集成电路在高性能矩阵乘加速器设计中的优势,对未来进一步提升高性能计算平台性能具有一定的参考价值。相似文献

12.

Faster remainder by direct computation: Applications to compilers and software libraries

Daniel Lemire Owen Kaser Nathan Kurz 《Software》2019,49(6):953-970

On common processors, integer multiplication is many times faster than integer division. Dividing a numerator n by a divisor d is mathematically equivalent to multiplication by the inverse of the divisor (n/d=n∗1/d). If the divisor is known in advance, or if repeated integer divisions will be performed with the same divisor, it can be beneficial to substitute a less costly multiplication for an expensive division. Currently, the remainder of the division by a constant is computed from the quotient by a multiplication and a subtraction. However, if just the remainder is desired and the quotient is unneeded, this may be suboptimal. We present a generally applicable algorithm to compute the remainder more directly. Specifically, we use the fractional portion of the product of the numerator and the inverse of the divisor. On this basis, we also present a new and simpler divisibility algorithm to detect nonzero remainders. We also derive new tight bounds on the precision required when representing the inverse of the divisor. Furthermore, we present simple C implementations that beat the optimized code produced by state-of-the-art C compilers on recent x64 processors (eg, Intel Skylake and AMD Ryzen), sometimes by more than 25%. On all tested platforms, including 64-bit ARM and POWER8, our divisibility test functions are faster than state-of-the-art Granlund-Montgomery divisibility test functions, sometimes by more than 50%. 相似文献

13.

A new PSO-based approach to fire flame detection using K-Medoids clustering

《Expert systems with applications》2017

Automated computer vision-based fire detection has gained popularity in recent years, as every fire detection needs to be fast and accurate. In this paper, a new fire detection method using image processing techniques is proposed. We explore how to create a fire flame-based colour space via a linear multiplication of a conversion matrix and colour features of a sample image. We show how the matrix multiplication can result in a differentiating colour space, in which the fire part is highlighted and the non-fire part is dimmed. Particle Swarm Optimization (PSO) and sample pixels from an image are used to obtain the weights of the colour-differentiating conversion matrix, and K-medoids provides a fitness metric for the PSO procedure. The obtained conversion matrix can be used for fire detection on different fire images without performing the PSO procedure. This allows a fast and easy implementable fire detection system. The empirical results indicate that the proposed method provides both qualitatively and quantitatively better results when compared to some of the conventional and state-of-the-art algorithms. 相似文献

14.

MapReduce based parallel fuzzy-rough attribute reduction using discernibility matrix

Sowkuntla Pandu Prasad P. S. V. S. Sai 《Applied Intelligence》2022,52(1):154-173

Fuzzy-rough set theory is an efficient method for attribute reduction. It can effectively handle the imprecision and uncertainty of the data in the attribute reduction. Despite its efficacy, current approaches to fuzzy-rough attribute reduction are not efficient for the processing of large data sets due to the requirement of higher space complexities. A limited number of accelerators and parallel/distributed approaches have been proposed for fuzzy-rough attribute reduction in large data sets. However, all of these approaches are dependency measure based methods in which fuzzy similarity matrices are used for performing attribute reduction. Alternative discernibility matrix based attribute reduction methods are found to have less space requirements and more amicable to parallelization in building parallel/distributed algorithms. This paper therefore introduces a fuzzy discernibility matrix-based attribute reduction accelerator (DARA) to accelerate the attribute reduction. DARA is used to build a sequential approach and the corresponding parallel/distributed approach for attribute reduction in large data sets. The proposed approaches are compared to the existing state-of-the-art approaches with a systematic experimental analysis to assess computational efficiency. The experimental study, along with theoretical validation, shows that the proposed approaches are effective and perform better than the current approaches.

相似文献

15.

布尔矩阵乘的分布式异构并行优化

朱敏唐波赵娟邹丹李金才《计算机工程与科学》2017,39(4):634-640

布尔多项式求解是当今密码代数分析中的关键步骤,F4算法是布尔多项式求解的高效算法。分析了Lachartre为F4矩阵专门设计的高斯消去算法,针对其中布尔矩阵乘这一耗时的计算步骤,设计并实现了分布式异构(CPU+MIC)并行算法。布尔矩阵相对于普通矩阵主要体现在矩阵元素取值区间不一样上,由于布尔矩阵元素(0,1)导致矩阵乘操作的特殊性,普通矩阵乘的优化方法不能很好地满足布尔矩阵乘的需求。分别从布尔矩阵的存储、OpenMP多线程组织、访存、任务划分和调度等方面进行了性能优化,实现了布尔矩阵乘的分布式异构并行算法。通过随机生成布尔矩阵测试,优化后的分布式异构并行程序相较于分布式同构并行程序达到了2.45的加速比,体现了良好的性能提升。相似文献

16.

矩阵相乘的并行计算及其DSP实现

雷晶金心宇王锐《传感技术学报》2006,19(3):737-740

矩阵相乘的速度在阵列信号处理中具有重要意义,并行处理是提高系统运算能力最有效的方法.本文根据矩阵相乘的特点,提出了矩阵相乘的并行算法.同时经分析推导出了矩阵相乘的脉动矩阵方法,得出其在超立方及其平面阵列上的映射,提高了矩阵的运算速度.最后,给出了用DSP实现脉动矩阵的系统方案. 相似文献

17.

A comparative review of dynamic neural networks and hidden Markov model methods for mobile on-device speech recognition

Mustafa Mohammed Kyari Allen Tony Appiah Kofi 《Neural computing & applications》2017,31(2):891-899

The adoption of high-accuracy speech recognition algorithms without an effective evaluation of their impact on the target computational resource is impractical for mobile and embedded systems. In this paper, techniques are adopted to minimise the required computational resource for an effective mobile-based speech recognition system. A Dynamic Multi-Layer Perceptron speech recognition technique, capable of running in real time on a state-of-the-art mobile device, has been introduced. Even though a conventional hidden Markov model when applied to the same dataset slightly outperformed our approach, its processing time is much higher. The Dynamic Multi-layer Perceptron presented here has an accuracy level of 96.94% and runs significantly faster than similar techniques.

相似文献

18.

Uniform approach for solving some classical problems on a lineararray

O'Hallaron D.R. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(2):236-241

It is shown that a number of classical problems from linear algebra and graph theory, including instances of the algebraic path problem, matrix multiplication, matrix triangularization, and matrix transpose, can be solved using the same basic recurrence. A simple mapping of the recurrence onto a unidirectional linear array is discussed. Qualitative advantages to programming linear arrays using this approach include uniformity of design, simplicity of programming, and scalability to larger problems. The major disadvantage is that the resulting algorithms are not necessarily optimal 相似文献

19.

A Fast Algorithm for Matrix Multiplication and Its Efficient Realization on Systolic Arrays

L. D. Elfimova Yu. V. Kapitonova 《Cybernetics and Systems Analysis》2001,37(1):109-121

A new fast matrix multiplication algorithm is proposed, which, as compared to the Winograd algorithm, has a lower multiplicative complexity equal to W _M 0.437n³ multiplication operations. Based on a goal-directed transformation of its basic graph, new optimized architectures of systolic arrays are synthesized. A systolic variant of the Strassen algorithm is presented for the first time. 相似文献

20.

Sparse sample self-representation for subspace clustering

Deng Zhenyun Zhang Shichao Yang Lifeng Zong Ming Cheng Debo 《Neural computing & applications》2018,29(1):43-49

This paper proposes a new subspace clustering method based on sparse sample self-representation (SSR). The proposed method considers SSR to solve the problem that affinity matrix does not strictly follow the structure of subspace, and also utilizes sparse constraint to ensure the robustness to noise and outliers in subspace clustering. Specifically, we propose to first construct a self-representation matrix for all samples and combine an l ₁-norm regularizer with an l _2,1-norm regularizer to guarantee that each sample can be represented as a sparse linear combination of its related samples. Then, we conduct the resulting matrix to build an affinity matrix. Finally, we apply spectral clustering on the affinity matrix to conduct clustering. In order to validate the effectiveness of the proposed method, we conducted experiments on UCI datasets, and the experimental results showed that our proposed method reduced the minimal clustering error, outperforming the state-of-the-art methods.

相似文献