期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel approach to NNMF on multicore architecture

P. Alonso V. M. García F. J. Martínez-Zaldívar A. Salazar L. Vergara A. M. Vidal 《The Journal of supercomputing》2014,70(2):564-576

We tackle the parallelization of Non-Negative Matrix Factorization (NNMF), using the Alternating Least Squares and Lee and Seung algorithms, motivated by its use in audio source separation. For the first algorithm, a very suitable technique is the use of active set algorithms for solving several non-negative inequality constraints least squares problems. We have addressed the NNMF for dense matrix on multicore architectures, by organizing these optimization problems for independent columns. Although in the sequential case, the method is not as efficient as the block pivoting variant used by other authors, they are very effective in the parallel case, producing satisfactory results for the type of applications where is to be used. For the Lee and Seung method, we propose a reorganization of the algorithm steps that increases the convergence speed and a parallelization of the solution. The article also includes a theoretical and experimental study of the performance obtained with similar matrices to that which arise in applications that have motivated this work. 相似文献

2.

Exploiting parallelism on a fine-grained MIMD architecture based upon channel queues

Rajiv Gupta Sunah Lee 《International journal of parallel programming》1992,21(3):169-192

We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh. 相似文献

3.

Aspects of parallelism in computer architecture

Wolfgang Händler 《Mathematics and computers in simulation》1977,19(4):278-283

相似文献

4.

The MAJC architecture: a synthesis of parallelism and scalability

Tremblay M. Chan J. Chaudhry S. Conigliam A.W. Tse S.S. 《Micro, IEEE》2000,20(6):12-25

The MAJC architecture enhances application performance by exploiting parallelism at multiple levels-instruction, data, thread, and process. Supporting vertical multithreading, speculative multithreading, and chip multiprocessors, the scalable VLIW architecture is also capable of advanced speculation and predication and treats all data types similarly 相似文献

5.

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Alfred J. Park Kalyan S. Perumalla 《Journal of Parallel and Distributed Computing》2013

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance. 相似文献

6.

Resource integration using a large knowledge base in Carnot 总被引：2，自引：0，他引：2

Collet C. Huhns M.N. Shen W.-M. 《Computer》1991,24(12):55-62

A method for integrating separately developed information resources that overcomes incompatibilities in syntax and semantics and permits the resources to be accessed and modified coherently is described. The method provides logical connectivity among the information resources via a semantic service layer that automates the maintenance of data integrity and provides an approximation of global data integration across systems. This layer is a fundamental part of the Carnot architecture, which provides tools for interoperability across global enterprises 相似文献

7.

A dynamically reconfigurable communication architecture for multicore embedded systems

Salih Bayar Arda Yurdakul 《Journal of Systems Architecture》2012,58(3-4):140-159

相似文献

8.

A pipeline virtual environment architecture for multicore processor systems

Eric Acosta Alan Liu 《The Visual computer》2012,28(11):1099-1114

We present a novel architecture to develop Virtual Environments (VEs) for multicore CPU systems. An object-centric method provides a uniform representation of VEs. The representation enables VEs to be processed in parallel using a multistage, dual-frame pipeline. Dynamic work distribution and load balancing is accomplished using a thread migration strategy with minimal overhead. This paper describes our approach, and shows it is efficient and scalable with performance experiments. Near linear speed-ups have been observed in experiments involving up to 1,000 deformable objects on a six-core i7 CPU. This approach’s practicality is demonstrated with the development of a medical simulation trainer for a craniotomy procedure. 相似文献

9.

Software-defined process-near-memory architecture using 3D hybrid bonding integration

Anlin XU;Chenchen DENG;Jianfeng ZHU;Yao WANG;Shaojun WEI;Leibo LIU 《中国科学:信息科学(英文版)》2025,(1):366-378

With the unprecedented explosive growing amount of global data, the development of computing chips, which encounter bottlenecks such as power wall and memory wall, cannot satisfy the demanding requirement. This work proposes a software-defined process-near-memory(SDPNM) computing architecture implemented using 3D hybrid bonding integration.The software-defined chip architecture, featuring spatial computations and dynamic reconfiguration, innovates in a top-down manner to achieve high energy efficiency while maintaining flexibility after fabrication. The process-near-memory integration further advances the SDNPM chip in a bottom-up way to reduce the energy consumption of data movement while improving the bandwidth. Utilizing a relatively mature fabrication and bonding process can result in feasible solutions for both data-intensive and compute-intensive applications including digital signal processing and artificial intelligence. The logic die is fabricated in the SMIC 40 nm process and the DRAM die is fabricated in the PSMC 25 nm process. The hybrid bonding is implemented by XMC. The experimental results show that the energy efficiency of the proposed SDPNM chip is 33.1× better than the state-of-the-art FPGA ranging from 8.2× to 104.1×. 相似文献

10.

Virtualization of reconfigurable coprocessors in HPRC systems with multicore architecture

Ivan Gonzalez Sergio Lopez-Buedo Gustavo Sutter Diego Sanchez-Roman Francisco J. Gomez-Arribas Javier Aracil 《Journal of Systems Architecture》2012,58(6-7):247-256

HPRC (High-Performance Reconfigurable Computing) systems include multicore processors and reconfigurable devices acting as custom coprocessors. Due to economic constraints, the number of reconfigurable devices is usually smaller than the number of processor cores, thus preventing that a 1:1 mapping between cores and coprocessors could be achieved. This paper presents a solution to this problem, based on the virtualization of reconfigurable coprocessors. A Virtual Coprocessor Monitor (VCM) has been devised for the XtremeData XD2000i In-Socket Accelerator, and a thread-safe API is available for user applications to communicate with the VCM. Two reference applications, an IDEA cipher and an Euler CFD solver, have been implemented in order to validate the proposed architecture and execution model. Results show that the benefits arising from coprocessor virtualization outperform its overhead, specially when code has a significant software weight. 相似文献

11.

A novel folded-torus based network architecture for power-aware multicore systems

Abu Asaduzzaman Sri R. ChaturvedulaAuthor VitaeRavi PendseAuthor Vitae 《Computers & Electrical Engineering》2013

Multicore computers are expected to be used to process a higher volume of data in the future. Current mesh-like multicore architecture is inadequate to increase memory-level-parallelism because of its poor core-to-core interconnection topology. In some architecture, each node has communication and computation components – switching component of such a node consumes power while the node is only computing and vice versa. In this paper, we propose a folded-torus based topology to improve performance and energy saving. In this architecture, nodes are separated between network switches and computing cores. Using folded-torus concept, we develop a scheme to connect the components (switches and cores) of a multicore architecture. Experimental results show that the proposed architecture outperforms Raw Architecture Workstation (RAW), Triplet Based Architecture (TriBA), and Logic-Based Distributed Routing (LBDR) architecture by reducing the switches more than 53%, the power consumption by up to 71%, and the average delay by up to 58%. 相似文献

12.

Composing enterprise mashup components and services using architecture integration patterns 总被引：1，自引：0，他引：1

Yan LiuAuthor Vitae Xin LiangAuthor VitaeLingzhi XuAuthor Vitae Mark StaplesAuthor Vitae Liming ZhuAuthor Vitae 《Journal of Systems and Software》2011,84(9):1436-1446

Enterprise mashups leverage various source of information to compose new situational applications. The architecture of such applications must address integration issues: it needs to deal with heterogeneous local and/or public data sources, and build value-added applications on existing corporate IT systems. In this paper, we leverage enterprise architecture integration patterns to compose reusable mashup components. We present a service oriented architecture that addresses reusability and integration needs for building enterprise mashup applications. Key techniques to customize this architecture are developed for mashups with themed data on location maps. The usage of this architecture is illustrated by a property valuation application derived from a real-world scenario. We demonstrate and discuss how this state-of-the-art architecture design method can be applied to enhance the design and development of emerging enterprise mashups. 相似文献

13.

Parallelism on multicore processors using Parallel.FX

A.L. Márquez C. GilR. Baños J. Gómez 《Advances in Engineering Software》2011,42(5):259-265

The Parallel.FX Task Parallel Library is the latest tool developed for multicore parallelism optimization using the .NET technology. It is a managed concurrency library that provides optimized managed code for multicore processors using a new thread pool that withstands cancellation, waiting and pool isolation, among many other features. The Task Parallel Library also uses dynamic work stealing techniques for superior scalability. This paper analyzes the performance improvement of using the Task Parallel Library of Parallel.FX when applying a Multi-Objective Evolutionary Algorithm to solve a timetabling problem. For comparative purposes, this algorithm has also been parallelized using threads. The results obtained show that both alternatives allow a reduction in the runtime necessary to solve this problem. However, parallelizing the code using the Task Parallel Library of Parallel.FX has the advantage of being easier and the code size is much smaller than directly programming threads. 相似文献

14.

Data locality and parallelism optimization using a constraint-based approach

Ozcan Ozturk Author Vitae 《Journal of Parallel and Distributed Computing》2011,71(2):280-287

Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In the context of data-intensive embedded applications, there have been two complementary approaches to enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism. Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This can be achieved by considering multiple loop nests simultaneously. Although compilers address these two problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously. Therefore, an integrated approach that combines these two can generate much better results than each individual approach. Based on these observations, this paper proposes a constraint network (CN)-based formulation for data locality optimization and code parallelization. The paper also presents experimental evidence, demonstrating the success of the proposed approach, and compares our results with those obtained through previously proposed approaches. The experiments from our implementation indicate that the proposed approach is very effective in enhancing data locality and parallelization. 相似文献

15.

Scheduling parallel jobs on multicore clusters using CPU oversubscription

Gladys Utrera Julita Corbalan Jesús Labarta 《The Journal of supercomputing》2014,68(3):1113-1140

相似文献

16.

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Jie Yan Guangming Tan Ninghui Sun 《The Journal of supercomputing》2014,69(3):1462-1490

相似文献

17.

Parallel SUMIS soft detector for large MIMO systems on multicore and GPU

Ramiro Carla Simarro M. Ángeles Gonzalez Alberto Vidal Antonio M. 《The Journal of supercomputing》2019,75(3):1256-1267

The Journal of Supercomputing - The number of transmit and receiver antennas is an important factor that affects the performance and complexity of a MIMO system. A MIMO system with very large... 相似文献

18.

基于异构多核可编程系统的大点FFT卷积设计与实现

《电子技术应用》2017,(3):16-20

如今FFT卷积广泛应用于数字信号处理,并且过去几年证实了异构多核可编程系统(HMPS)的发展。另外,HMPS已经成为DSP领域的主流趋势。因此,研究基于HMPS大点FFT卷积的高效地实现显得非常重要。基于重叠相加FFT卷积方法,设计一款针对输入数据流的高效流水重叠相加滤波器。介绍了基于HMPS的大点FFT卷积实现,获得了高精度的滤波效果。此外,采用流水技术的滤波器设计,提高系统处理速度、数据吞吐率和任务并行度。基于Xilinx XC7V2000T FPGA开发板上的实验表明,参与运算的采样点越大,系统的任务并行度、处理速度和数据吞吐率就会越高。当采样点达到1M时,系统的平均任务平行度达到了5.33,消耗了2.745×10~6个系统时钟周期数,并且绝对误差精度达到10~(-4)。相似文献

19.

Achieving full parallelism using multidimensional retiming

Passos N.L. Sha E.H.-M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(11):1150-1163

Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms 相似文献

20.

面向数据的系统集成架构

李松齐文华《计算机应用》2012,32(Z2):85-88

针对大规模分布式系统集成过程中,各子系统实现技术的不同及紧耦合性而带来的系统管理、扩展、维护困难等问题,提出了一种面向数据的设计模式,讨论了该模式下的设计原则,以及采用面向数据进行系统设计时涉及的中间件技术等;并在分析了当前流行系统架构的基础上,提出了面向数据的系统集成架构;最后,以实现一个跨界实时包跟踪系统为例,具体介绍了面向数据集成架构的应用细节。相似文献