期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Optimization techniques for small matrix multiplication

Charles-Éric Drevet 《Theoretical computer science》2011,412(22):2219-2236

The complexity of matrix multiplication has attracted a lot of attention in the last forty years. In this paper, instead of considering asymptotic aspects of this problem, we are interested in reducing the cost of multiplication for matrices of small size, say up to 30. Following the previous work of Probert & Fischer, Smith, and Mezzarobba, in a similar vein, we base our approach on the previous algorithms for small matrices, due to Strassen, Winograd, Pan, Laderman, and others and show how to exploit these standard algorithms in an improved way. We illustrate the use of our results by generating multiplication codes over various rings, such as integers, polynomials, differential operators and linear recurrence operators. 相似文献

2.

BlobSeer: Next-generation data management for large scale infrastructures

Bogdan NicolaeAuthor VitaeGabriel AntoniuAuthor Vitae Luc BougéAuthor VitaeDiana MoiseAuthor Vitae Alexandra Carpen-AmarieAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(2):169-184

As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond, introduces additional issues for which scalable data management becomes an immediate need. This paper makes several contributions. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, we highlight the potentially large benefits of using versioning in this context. Second, based on these principles, we propose a set of versioning algorithms, both for data and metadata, that enable a high throughput under concurrency. Finally, we implement and evaluate these algorithms in the BlobSeer prototype, that we integrate as a storage backend in the Hadoop MapReduce framework. We perform extensive microbenchmarks as well as experiments with real MapReduce applications: they demonstrate that applying the principles defended in our approach brings substantial benefits to data intensive applications. 相似文献

3.

Highly scalable parallel algorithms for sparse matrix factorization

Gupta A. Karypis G. Kumar V. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(5):502-520

In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer 相似文献

4.

基于Hadoop的大矩阵乘法处理方法

孙远帅陈垚官新均林琛《计算机应用》2013,33(12):3339-3344

目前的矩阵乘法算法无法处理大规模和超大规模的矩阵,而随着MapReduce编程框架的提出,并行处理矩阵乘法成为解决大矩阵运算的主要手段。总结了矩阵乘法在MapReduce编程模型上的并行实现方法,并提出了实现高性能大矩阵乘法的策略——折中单个工作节点的计算量和需要网络传输的数据量。实验证明,并行实现算法在大矩阵上明显优于传统的单机算法,而且随着集群中节点数目的增多,并行算法会表现出更好的性能。相似文献

5.

基于MapReduce的增量矩阵乘法设计

张沛然徐云《计算机系统应用》2016,25(9):176-182

矩阵乘法是许多应用中的核心计算,在这些应用中只是少量矩阵元素发生改变,如果全量重新计算则工作量很大,因此增量计算是解决该问题的有效手段. 本文提出了一种基于MapReudce模型的增量矩阵乘法计算方法,以及计算矩阵中变化元素的高效识别方法,通过利用矩阵元素的摘要信息快速计算出变化元素,然后将矩阵乘法计算过程转换为一系列等价的连接问题,实现了一种有效的矩阵乘法增量计算. 对于矩阵元素变化率较小的情形,计算实验表明提出的方法计算时间上明显优于全量重新计算方法. 相似文献

6.

Distributed media indexing based on MPI and MapReduce

Hisham Mohamed Stéphane Marchand-Maillet 《Multimedia Tools and Applications》2014,69(2):513-537

Web-scale digital assets comprise millions or billions of documents. Due to such increase, sequential algorithms cannot cope with this data, and parallel and distributed computing become the solution of choice. MapReduce is a programming model proposed by Google for scalable data processing. MapReduce is mainly applicable for data intensive algorithms. In contrast, the message passing interface (MPI) is suitable for high performance algorithms. This paper proposes an adapted structure of the MapReduce programming model using MPI for multimedia indexing. Experimental results are done on various multimedia applications to validate our model. The experiments indicate that our proposed model achieves good speedup compared to the original sequential versions, Hadoop and the earlier versions of MapReduce using MPI. 相似文献

7.

Bounded matrix factorization for recommender system

Ramakrishnan Kannan Mariya Ishteva Haesun Park 《Knowledge and Information Systems》2014,39(3):491-511

Matrix factorization has been widely utilized as a latent factor model for solving the recommender system problem using collaborative filtering. For a recommender system, all the ratings in the rating matrix are bounded within a pre-determined range. In this paper, we propose a new improved matrix factorization approach for such a rating matrix, called Bounded Matrix Factorization (BMF), which imposes a lower and an upper bound on every estimated missing element of the rating matrix. We present an efficient algorithm to solve BMF based on the block coordinate descent method. We show that our algorithm is scalable for large matrices with missing elements on multicore systems with low memory. We present substantial experimental results illustrating that the proposed method outperforms the state of the art algorithms for recommender system such as stochastic gradient descent, alternating least squares with regularization, SVD++ and Bias-SVD on real-world datasets such as Jester, Movielens, Book crossing, Online dating and Netflix. 相似文献

8.

Auto-tuned Krylov methods on cluster of graphics processing unit

Frédéric Magoulès Abal-Kassim Cheik Ahamed Roman Putanowicz 《国际计算机数学杂志》2015,92(6):1222-1250

Exascale computers are expected to have highly hierarchical architectures with nodes composed by multiple core processors (CPU; central processing unit) and accelerators (GPU; graphics processing unit). The different programming levels generate new difficult algorithm issues. In particular when solving extremely large linear systems, new programming paradigms of Krylov methods should be defined and evaluated with respect to modern state of the art of scientific methods. Iterative Krylov methods involve linear algebra operations such as dot product, norm, addition of vectors and sparse matrix–vector multiplication. These operations are computationally expensive for large size matrices. In this paper, we aim to focus on the best way to perform effectively these operations, in double precision, on GPU in order to make iterative Krylov methods more robust and therefore reduce the computing time. The performance of our algorithms is evaluated on several matrices arising from engineering problems. Numerical experiments illustrate the robustness and accuracy of our implementation compared to the existing libraries. We deal with different preconditioned Krylov methods: Conjugate Gradient for symmetric positive-definite matrices, and Generalized Conjugate Residual, Bi-Conjugate Gradient Conjugate Residual, transpose-free Quasi Minimal Residual, Stabilized BiConjugate Gradient and Stabilized BiConjugate Gradient (L) for the solution of sparse linear systems with non symmetric matrices. We consider and compare several sparse compressed formats, and propose a way to implement effectively Krylov methods on GPU and on multicore CPU. Finally, we give strategies to faster algorithms by auto-tuning the threading design, upon the problem characteristics and the hardware changes. As a conclusion, we propose and analyse hybrid sub-structuring methods that should pave the way to exascale hybrid methods. 相似文献

9.

一种支持优化分块策略的矩阵乘加速器设计

沈俊忠肖涛乔寓然杨乾明文梅《计算机工程与科学》2016,38(9):1748-1754

在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。相似文献

10.

DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce

《Information Systems》2014

Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention.In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework. 相似文献

11.

Fast and highly scalable parallel computations for fundamental matrix problems on distributed memory systems

Keqin Li 《The Journal of supercomputing》2010,54(3):271-297

We present fast and highly scalable parallel computations for a number of important and fundamental matrix problems on distributed memory systems (DMS). These problems include matrix multiplication, matrix chain product, and computing the powers, the inverse, the characteristic polynomial, the determinant, the rank, the Krylov matrix, and an LU- and a QR-factorization of a matrix, and solving linear systems of equations. Our highly scalable parallel computations for these problems are based on a highly scalable implementation of the fastest sequential matrix multiplication algorithm on DMS. We show that compared with the best known parallel time complexities on parallel random access machines (PRAM), the most powerful but unrealistic shared memory model of parallel computing, our parallel matrix computations achieve the same speeds on distributed memory parallel computers (DMPC), and have an extra polylog factor in the time complexities on DMS with hypercubic networks. Furthermore, our parallel matrix computations are fully scalable on DMPC and highly scalable over a wide range of system size on DMS with hypercubic networks. Such fast (in terms of parallel time complexity) and highly scalable (in terms of our definition of scalability) parallel matrix computations were rarely seen before on any distributed memory systems. 相似文献

12.

Output-sensitive modular algorithms for polynomial matrix normal forms

Howard Cheng George Labahn 《Journal of Symbolic Computation》2007

相似文献

13.

MR-SimLab: Scalable subgraph selection with label similarity for big data

《Information Systems》2017

With the increasing size and complexity of available databases, existing machine learning and data mining algorithms are facing a scalability challenge. In many applications, the number of features describing the data could be extremely high. This hinders or even could make any further exploration infeasible. In fact, many of these features are redundant or simply irrelevant. Hence, feature selection plays a key role in helping to overcome the problem of information overload especially in big data applications. Since many complex datasets could be modeled by graphs of interconnected labeled elements, in this work, we are particularly interested in feature selection for subgraph patterns. In this paper, we propose MR-SimLab, a MapReduce-based approach for subgraph selection from large input subgraph sets. In many applications, it is easy to compute pairwise similarities between labels of the graph nodes. Our approach leverages such rich information to measure an approximate subgraph matching by aggregating the elementary label similarities between the matched nodes. Based on the aggregated similarity scores, our approach selects a small subset of informative representative subgraphs. We provide a distributed implementation of our algorithm on top of the MapReduce framework that optimizes the computational efficiency of our approach for big data applications. We experimentally evaluate MR-SimLab on real datasets. The obtained results show that our approach is scalable and that the selected subgraphs are informative. 相似文献

14.

All-to-all broadcast and matrix multiplication in faulty SIMDhypercubes

Sengupta A. Raghavendra C.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(6):550-560

In this paper, we develop algorithms in order of efficiency for all-to-all broadcast problem in an N=2ⁿ-node n-dimensional faulty SIMD hypercube, Q_n, with up to n-1 node faults. The algorithms use a property of a certain ordering of dimensions. Our analysis includes startup time (α) and transfer time (β). We have established the lower bound for such an algorithm to be nα+(2N-3)Lβ in a faulty hypercube with at most n-1 faults (each node has a value of L bytes). Our best algorithm requires 2nα+2NLβ and is near-optimal. We develop an optimal algorithm for matrix multiplication in a faulty hypercube using all-to-all broadcast and compare the efficiency of all-to-all broadcast approach with broadcast approach and global sum approach for matrix multiplication. The algorithms are congestion-free and applicable in the context of available hypercube machines 相似文献

15.

Enhancing Bilinear Subspace Learning by Element Rearrangement

Xu Dong Yan Shuicheng Lin Stephen Huang Thomas S. Chang Shih-Fu 《IEEE transactions on pattern analysis and machine intelligence》2009,31(10):1913-1920

The success of bilinear subspace learning heavily depends on reducing correlations among features along rows and columns of the data matrices. In this work, we study the problem of rearranging elements within a matrix in order to maximize these correlations so that information redundancy in matrix data can be more extensively removed by existing bilinear subspace learning algorithms. An efficient iterative algorithm is proposed to tackle this essentially integer programming problem. In each step, the matrix structure is refined with a constrained Earth Mover's Distance procedure that incrementally rearranges matrices to become more similar to their low-rank approximations, which have high correlation among features along rows and columns. In addition, we present two extensions of the algorithm for conducting supervised bilinear subspace learning. Experiments in both unsupervised and supervised bilinear subspace learning demonstrate the effectiveness of our proposed algorithms in improving data compression performance and classification accuracy. 相似文献

16.

Parallel direct method for linear equations

《Advances in Engineering Software》1999,30(9-11):839-845

A simple direct method and its variants, based on matrix multiplications, to invert square non-singular matrices (need not be positive definite) and to solve systems of linear equations are developed. Since the method and its variants involve only matrix multiplications they are straightforward to parallelise and hence have an advantage over some existing well known direct methods. Theoretical background, analysis of the proposed method and the complexity of the algorithm for sparse and dense matrices are given. Two significantly different parallel algorithms for dense and sparse matrices are developed. In the case of the dense matrix standard parallelisation of matrix multiplications was undertaken. However, for sparse matrices the standard parallel matrix multiplication is not effective which led us to develop a parallel algorithm different from that for the dense matrix. The performances of our algorithms are tested via numerical examples. 相似文献

17.

Scheduling divisible MapReduce computations 总被引：3，自引：0，他引：3

J. BerlińskaAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(3):450-459

In this paper we analyze MapReduce distributed computations as a divisible load scheduling problem. The two operations of mapping and reducing can be understood as two divisible applications with precedence constraints. A divisible load model of the computation, and two load partitioning algorithms are proposed. Performance limits of MapReduce computations are investigated. To our best knowledge this is the first time that processing applications with precedence constraints have been considered on the grounds of divisible load theory. 相似文献

18.

Inferring functional properties of matrix manipulating programs by abstract interpretation

Matthieu Journault Antoine Miné 《Formal Methods in System Design》2018,53(2):221-258

We present a new static analysis by abstract interpretation to prove automatically the functional correctness of algorithms implementing matrix operations, such as matrix addition, multiplication, general matrix multiplication, inversion, or more generally Basic Linear Algebra Subprograms. In order to do so, we introduce a family of abstract domains parameterized by a set of matrix predicates as well as a numerical domain. We show that our analysis is robust enough to prove the functional correctness of several versions of the same matrix operations, resulting from loop reordering, loop tiling, inverting the iteration order, line swapping, and expression decomposition. We extend our method to enable modular analysis on code fragments manipulating matrices by reference, and show that it results in a significant analysis speedup. 相似文献

19.

Symbolic and Numeric Methods for Exploiting Structure in Constructing Resultant Matrices

《Journal of Symbolic Computation》2002,33(4):393-413

Resultants characterize the existence of roots of systems of multivariate nonlinear polynomial equations, while their matrices reduce the computation of all common zeros to a problem in linear algebra. Sparse elimination theory has introduced the sparse (or toric) resultant, which takes into account the sparse structure of the polynomials. The construction of sparse resultant, or Newton, matrices is the critical step in the computation of the multivariate resultant and the solution of a nonlinear system. We reveal and exploit the quasi-Toeplitz structure of the Newton matrix, thus decreasing the time complexity of constructing such matrices by roughly one order of magnitude to achieve quasi-quadratic complexity in the matrix dimension. The space complexity is also decreased analogously. These results imply similar improvements in the complexity of computing the resultant polynomial itself and of solving zero-dimensional systems. Our approach relies on fast vector-by-matrix multiplication and uses the following two methods as building blocks. First, a fast and numerically stable method for determining the rank of rectangular matrices, which works exclusively over floating point arithmetic. Second, exact polynomial arithmetic algorithms that improve upon the complexity of polynomial multiplication under our model of sparseness, offering bounds linear in the number of variables and the number of non-zero terms. 相似文献

20.

MapReduce in MPI for Large-scale graph algorithms 总被引：1，自引：0，他引：1

Steven J. PlimptonKaren D. Devine 《Parallel Computing》2011,37(9):610-632

相似文献