期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Looped schedules for dataflow descriptions of multirate signal processing algorithms

Shuvra S. Bhattacharyya Edward A. Lee 《Formal Methods in System Design》1994,5(3):183-205

相似文献

2.

同步数据流模型调度序列的空间优化

下载免费PDF全文

刘国鑫谭国强贺也平《计算机工程与应用》2009,45(3):198-201

提出了一种嵌入式DSP系统的存储优化方法。该方法基于同步数据流模型SDF（Synchronous Data Flow）。针对其他优化算法不适用于存在反馈环的同步数据流模型的问题,该方法为反馈环的空间优化设计实现了启发式的调度算法,并提出了将SAS（Single Appearance Schedules）和Non-SAS类型调度序列相结合的层次化的空间优化方案,为同步数据流模型调度序列的空间优化提供一个通用的解决方案。实验结果证实了该方案的有效性。相似文献

3.

Consistency in dataflow graphs

Lee E.A. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(2):223-235

Analytical properties of programming languages with dataflow graph semantics are discussed. It is shown that one of the most serious problems with these languages is that subtle inconsistencies between parts of the dataflow graph can be inadvertently created. These inconsistencies can lead to deadlock, or in the case of nonterminating programs, to unbounded memory requirements. Consistency is defined to mean that the same number of tokens is consumed as produced on any arc, in the long run. A token-flow model is developed for testing for inconsistency. The method is a generalization of consistency checks for synchronous dataflow (SDF) graphs. The token-flow model is compared to similar tests applied to hybrid dynamical systems. It is argued that dataflow semantics make steady-state analysis possible, leading to a simpler method in most cases 相似文献

4.

嵌入式DSP系统中SDF模型的层次化存储优化方法

刘国鑫郭烈恩贺也平郭亮《计算机辅助设计与图形学学报》2009,21(3)

在同步数据流模型(SDF)描述的嵌入式数字信号处理(DSP)系统中,计算体单一出现调度(SAS)算法对于存在反馈环和数据密集处理的应用不可解或内存优化效果很差.文中提出了将SAS和Non-SAS类型调度算法相结合的层次化的存储优化方法,定义了数据密集分量和强连通分量来描述环和数据密集处理结构,并依据数据优先消耗原则设计了启发式的Non-SAS调度算法对分量进行存储优化.该方法适用于任意SDF模型,并有良好的存储优化效果.实验结果证明了其有效性. 相似文献

5.

The firefighter problem: Empirical results on random graphs

《Computers & Operations Research》2015

The firefighter problem is a deterministic discrete-time model for the spread and containment of fire on a graph. Once the fire breaks out at a set of vertices, the goal addressed in this work is to save as many vertices as possible from burning. Although the problem finds applications in various real-world problems, such as the spread of diseases or hoaxes contention in communication networks, this problem has not been addressed from a practical point of view so far, in the sense of finding a good strategy for the general case. In this work, we develop and compare several integer linear programming techniques and heuristic methods. Random graphs are used for the purpose of comparison. The obtained results shed some light on the challenges for computational tools as caused by graph topology, graph size, and the number of firefighters per iteration, when looking for the best strategy for an a priori unknown graph. 相似文献

6.

面向高性能图计算的高效高层次综合方法

汤嘉武郑龙廖小飞金海《计算机研究与发展》2021,58(3):467-478

图计算已成为大数据处理领域的主流应用,采用特定硬件加速可以显著提高图计算的性能和能效.众所周知,硬件代码的编写和验证十分耗时,尽管通用高层次综合(high level synthesis,HLS)系统允许用户使用高级语言(如C语言)特性自动生成硬件结构,但是对于图计算这种不规则算法,其仍缺乏有效的并行性和访存技术支撑,存在综合效果不理想、效率不高等突出问题.提出一种面向图计算的高效HLS方法,结合图算法嵌套循环、随机访存、数据冲突以及幂律分布等特性,采用数据流架构实现高效的并行流水线,保证处理单元的负载均衡.通过提供的编程原语,提出的方法可将通用图算法转化为模块化的数据流中间表示形式,进而映射到参数化的硬件模板.在Xilinx Virtex UltraScale+XCVU9P的实现验证了方法的正确性,不同类型的图算法在多个数据集上的实验结果表明,相比国际上通用的Spatial HLS系统,提出的方法可达到7.9~30.6倍的性能提升. 相似文献

7.

An evaluation of medium-grain dataflow code

Walid A. Najjar Lucas Roh A. P. Wim Böhm 《International journal of parallel programming》1994,22(3):209-242

In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grainclusters from a fine grain dataflow graph. We compare thebasic block and thedependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied. The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and fainally, a simple superscalar processor with an issue rate of two is sufficient to exploit the internal parallelism of a cluster. This work is supported in part by NSF Grants CCR-9010240 and MIP-9113268. 相似文献

8.

Efficient MapReduce algorithms for triangle listing in billion-scale graphs

Yuanyuan Zhu Hao Zhang Lu Qin Hong Cheng 《Distributed and Parallel Databases》2017,35(2):149-176

This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders their scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size and memory cost, we also propose improved algorithms based on a compact data structure, and present several optimization techniques to accelerate the computation and reduce the memory consumption. The extensive experimental results show that our algorithms outperform existing competitors by several times on both synthetic graphs and real world graphs. 相似文献

9.

A finite volume scheme for solving elliptic boundary value problems on composite grids

M. J. H. Anthonissen B. van ’t Hof A. A. Reusken 《Computing》1998,61(4):285-305

This paper presents a new dataflow graph model, where only data tokens are allowed to flow. First we introduce a High-Level Dataflow System (HLDS) to describe a formal dataflow graph model, then we present a homogeneous HLDS (hHLDS) that formally describes our proposal. In this proposal the dataflow graph is obtained by employing only actors with homogeneous I/O conditions, that is, each actor, which executes an elemental operation, is characterised by having one output and two input arcs. Even though no control tokens are allowed, i.e. no T-gate, merge, and switch actors are present in this model, it is always possible to obtain dataflow graphs, which represent any programming structure and whose behaviour is well-behaved. As homogeneous I/O conditions are a severe restriction to represent the flow of a computation and the token flow in such dataflow graphs is completely asynchronous, proof is given to guarantee their determinacy. The main advantage of this representation is that it maps directly to hardware through a one-to-one correspondence between actors of the model and Functional Units of a dataflow machine. 相似文献

10.

Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration

Clark Nathan Zhong Hongtao Tang Wilkin Mahlke Scott 《International journal of parallel programming》2003,31(6):429-449

General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications. 相似文献

11.

Efficient Rate-Distortion Optimized Media Streaming for Tree-Structured Packet Dependencies

Rder M. Cardinal J. Hamzaoui R. 《Multimedia, IEEE Transactions on》2007,9(6):1259-1272

When streaming packetized media data over a lossy packet network, it is desirable to use transmission strategies that minimize the expected distortion subject to a constraint on the expected transmission rate. Because the computation of such optimal strategies is usually an intractable problem, fast heuristic techniques are often used. We first show that when the graph that gives the decoding dependencies between the data packets is reducible to a tree, optimal transmission strategies can be efficiently computed with dynamic programming algorithms. The proposed algorithms are much faster than other exact algorithms developed for arbitrary dependency graphs. They are slower than previous heuristic techniques but can provide much better solutions. We also show how to apply our algorithms to find high-quality approximate solutions when the dependency graph is not tree reducible. To validate our approach, we run simulations for MPEG1 and H.264 video data. We first consider a simulated packet erasure channel. Then we implement a real video streaming system and provide experimental results for an Internet connection. 相似文献

12.

Heuristic algorithms for scheduling iterative task computations ondistributed memory machines

Tao Yang Cong Fu 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):608-622

Many partitioned scientific programs can be modeled as iterative executions of computational tasks and represented by iterative task graphs (ITGs). An ITG may or may not have dependence cycles. In this paper, we consider the symbolic scheduling of ITGs on distributed memory architectures with nonzero communication overhead and propose heuristic algorithms for scheduling both cyclic and acyclic ITGs without searching an entire iteration space. Our approach incorporates techniques of software pipelining, graph unfolding, directed acyclic graph (DAG) scheduling, and load balancing. We analyze the asymptotic optimality of the algorithms to show that the derived schedules are competitive to optimal solutions. We also study the sensitivity of scheduling performance on inaccurate weights. Finally, we present experimental results to demonstrate the effectiveness of the optimization techniques 相似文献

13.

A Non-Stop Double Buffering Mechanism for Dataflow Architecture

下载免费PDF全文

Xu Tan Xiao-Wei Shen Xiao-Chun Ye Da Wang Dong-Rui Fan Lunkai Zhang Wen-Ming Li Zhi-Min Zhang Zhi-Min Tang 《计算机科学技术学报》2018,33(1):145-157

两倍缓冲是有效机制隐藏在在薄片上和离开薄片记忆之间的数据转移的潜伏。在 dataflow 建筑学,因为 dataflow 加速器的重复充满并且排干,然而,交换二在许多瓦减少的执行期间缓冲性能。在这个工作,我们为 dataflow 建筑学建议连续双的缓冲机制。没有停止通过在 dataflow 建筑学优化控制逻辑处理元素的执行,建议不停的机制把瓦分到处理元素数组。而且,我们建议一个工作流节目与连续双的缓冲机制合作。在控制逻辑上并且在工作流节目上的优化以后,充满并且排干数组需要越过属于一样的 dataflow 图的所有瓦的执行被做仅仅一次。试验性的结果证明没有优化,为 dataflow 建筑学的建议双缓冲机制在那上完成 16.2% 平均效率改进。相似文献

14.

Machines and models for parallel computing

Jack B. Dennis 《International journal of parallel programming》1994,22(1):47-77

It is widely believed that superscalar and superpipelined extensions of RISC style architecture will dominate future processor design, and that needs of parallel computing will have little effect on processor architecture. This belief ignores the issues of memory latency and synchronization, and fails to recognize the opportunity to support a general semantic model for parallel computing. Efforts to extend the shared-memory model using standard microprocessors have led to systems that implement no satisfactory model of computing, and present the programmer with a difficult interface on which to build parallel computing applications. A more satisfactory model for parallel computing may be obtained on the basis of functional programming concepts and the principles of modular software construction. We recommend that designs for computers be built on such a general semantic model of parallel computation. Multithreading concepts and dataflow principles can frame the architecture of these new machines. 相似文献

15.

Optimizing the Cray Graph Engine for performant analytics on cluster,SuperDome Flex,Shasta systems and cloud deployment

Christopher D. Rickett Kristyn J. Maschhoff Sreenivas R. Sukumar 《Concurrency and Computation》2024,36(10):e7982

We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-sided MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems. 相似文献

16.

Optimization procedures for the bipartite unconstrained 0-1 quadratic programming problem

《Computers & Operations Research》2014

The bipartite unconstrained 0-1 quadratic programming problem (BQP) is a difficult combinatorial problem defined on a complete graph that consists of selecting a subgraph that maximizes the sum of the weights associated with the chosen vertices and the edges that connect them. The problem has appeared under several different names in the literature, including maximum weight induced subgraph, maximum weight biclique, matrix factorization and maximum cut on bipartite graphs. There are only two unpublished works (technical reports) where heuristic approaches are tested on BQP instances. Our goal is to combine straightforward search elements to balance diversification and intensification in both exact (branch and bound) and heuristic (iterated local search) frameworks. We perform a number of experiments to test individual search components and also to create new benchmarks when comparing against the state of the art, which the proposed procedure outperforms. 相似文献

17.

Branch and bound for the cutwidth minimization problem

Rafael Martí Juan J. Pantrigo Abraham Duarte Eduardo G. Pardo 《Computers & Operations Research》2013

The cutwidth minimization problem consists of finding a linear arrangement of the vertices of a graph where the maximum number of cuts between the edges of the graph and a line separating consecutive vertices is minimized. We first review previous approaches for special classes of graphs, followed by lower bounds and then a linear integer formulation for the general problem. We then propose a branch-and-bound algorithm based on different lower bounds on the cutwidth of partial solutions. Additionally, we introduce a Greedy Randomized Adaptive Search Procedure (GRASP) heuristic to obtain good initial solutions. The combination of the branch-and-bound and GRASP methods results in optimal solutions or a reduced relative gap (difference between upper and lower bounds) on the instances tested. Empirical results with a collection of previously reported instances indicate that the proposed algorithm is able to solve all the small instances (up to 32 vertices) as well as some of the large instances tested (up to 158 vertices) using less than 30 minutes of CPU time. We compare the results of our method with previous lower bounds, and with the best previous linear integer formulation solved using Cplex. Both comparisons favor the proposed procedure. 相似文献

18.

DaSH: A benchmark suite for hybrid dataflow and shared memory programming models

《Parallel Computing》2015

The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics. 相似文献

19.

Sequential and parallel triangulating algorithms for Elimination Game and new insights on Minimum Degree

Anne Berry Elias Dahlhaus Pinar Heggernes Geneviève Simonet 《Theoretical computer science》2008

Elimination Game is a well-known algorithm that simulates Gaussian elimination of matrices on graphs, and it computes a triangulation of the input graph. The number of fill edges in the computed triangulation is highly dependent on the order in which Elimination Game processes the vertices, and in general the produced triangulations are neither minimum nor minimal. In order to obtain a triangulation which is close to minimum, the Minimum Degree heuristic is widely used in practice, but until now little was known on the theoretical mechanisms involved. 相似文献

20.

The function processor: A data-driven processor array for irregular computations

Jesper Vasell Jonas Vasell 《Future Generation Computer Systems》1992,8(4):321-335

相似文献