首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 937 毫秒
1.
Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values.  相似文献   

2.
Operations on basic data structures such as queues, priority queues, stacks, and counters can dominate the execution time of a parallel program due to both their frequency and their coordination and contention overheads. There are considerable performance payoffs in developing highly optimized, asynchronous, distributed, cache-conscious, parallel implementations of such data structures. Such implementations may employ a variety of tricks to reduce latencies and avoid serial bottlenecks, as long as the semantics of the data structure are preserved. The complexity of the implementation and the difficulty in reasoning about asynchronous systems increases concerns regarding possible bugs in the implementation. In this paper we consider postmortem, black-box procedures for testing whether a parallel data structure behaved correctly. We present the first systematic study of algorithms and hardness results for such testing procedures, focusing on queues, priority queues, stacks, and counters, under various important scenarios. Our results demonstrate the importance of selecting test data such that distinct values are inserted into the data structure (as appropriate). In such cases we present an O(n) time algorithm for testing linearizable queues, an O(n log n) time algorithm for testing linearizable priority queues, and an O( np 2 ) time algorithm for testing sequentially consistent queues, where n is the number of data structure operations and p is the number of processors. In contrast, we show that testing such data structures for executions with arbitrary input values is NP-complete. Our results also help clarify the thresholds between scenarios that admit polynomial time solutions and those that are NP-complete. Our algorithms are the first nontrivial algorithms for these problems.  相似文献   

3.
This paper presents a parallel framework for simulating fluids with the Smoothed Particle Hydrodynamics (SPH) method. For low computational costs per simulation step, efficient parallel neighbourhood queries are proposed and compared. To further minimize the computing time for entire simulation sequences, strategies for maximizing the time step and the respective consequences for parallel implementations are investigated. The presented experiments illustrate that the parallel framework can efficiently compute large numbers of time steps for large scenarios. In the context of neighbourhood queries, the paper presents optimizations for two efficient instances of uniform grids, that is, spatial hashing and index sort. For implementations on parallel architectures with shared memory, the paper discusses techniques with improved cache‐hit rate and reduced memory transfer. The performance of the parallel implementations of both optimized data structures is compared. The proposed solutions focus on systems with multiple CPUs. Benefits and challenges of potential GPU implementations are only briefly discussed.  相似文献   

4.
曹宇  徐明伟 《软件学报》2012,23(7):1924-1934
利用多路径传输协议,多宿主主机可以通过多条路径并行传输数据,从而有效提高系统的吞吐率和鲁棒性.但是由于不同路径在带宽、延迟和丢包率等方面存在差异,接收端必须缓存大量乱序到达的分组.数学分析表明,减少接收端的缓存开销有两条途径:一是最小化每条路径的发送队列中积压分组的数量,二是降低分组发送速率.由前者,提出依据每条路径的空闲发送窗口大小进行分组调度的算法SOD(Scheduling On Demand);由后者,提出利用窗口通告机制限制分组发送速率的流控方法.模拟实验结果表明:与现有算法相比,SOD的缓存开销最小;在接收端进行流控限制的情况下,SOD的吞吐率最大,并且在不同实验场景中性能表现稳定.  相似文献   

5.
We study the problem of storing an ordered set on an asynchronous shared memory parallel computer. We examine the case where we want to perform successor (least upper bound) queries efficiently on the set members that are stored. We also examine the case where processors insert and delete members of the set. Due to asynchrony, we require processors to perform queries and to maintain the structure independently. Although several such structures have been proposed, the analysis of these structures has been very limited. We here use the recently proposed QRQW PRAM model to provide upper and lower bounds on the performance of such data structures. In the asynchronous QRQW PRAM, the problem of processors concurrently and independently searching a shared data structure is very similar to the problem of routing packets through a network. Using this as a guide, we introduce the Search-Butterfly, a search structure that combines the efficient packet routing properties of the butterfly graph with the efficient search structure properties of the B-Tree. We analyze the behavior of the Search-Butterfly when the following operations are performed: arbitrary searches, random searches, and random searches, insertions, and deletions. We also provide lower bounds that show that the results are within a factor of O(\log n) of optimal where n is the number of keys in the structure. When the searches are random, the results are within a constant factor of optimal. Many of the proofs are derived from closely related results for packet routing. Others are of independent interest, most notably a method of adding queues to any network belonging to a large class of queuing networks with non-Markovian routing in a manner that allows us to bound the delay experienced by packets in the augmented network. Received October 1996, and in final form July 1997.  相似文献   

6.
Any parallel program has abstractions that are shared by the program's multiple processes. Such shared abstractions can considerably affect the performance of parallel programs, on both distributed and shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is its representation of shared abstractions as objects that may be internally distributed across different nodes of a parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that their implementations are easily changed while maintaining program portability across parallel architectures. The principal results presented are: a demonstration that the fragmentation of object state across different nodes of a multiprocessor machine can significantly improve program performance; and that such object fragmentation can be achieved without compromising portability by changing object interfaces. These results are demonstrated using implementations of the DSA library on several medium scale multiprocessors, including the BBN Butterfly, Kendall Square Research, and SGI shared memory multiprocessors. The DSA library's evaluation uses synthetic workloads and a parallel implementation of a branch and bound algorithm for solving the traveling salesperson problem (TSP)  相似文献   

7.
In a Multi-Processor System-on-a-Chip (MPSoC) based on Network-on-Chip (NoC), which processes massive data in a distributed fashion, communication is concentrated on shared memory. This paper proposes an assignment algorithm that can minimize the total power consumption for data communication in executing application programs and a switch structure that can reduce communication congestion resulting from simultaneous accesses to the shared memory. The proposed assignment algorithm gives higher priority to the tasks transferring a larger amount of data to shared memory, so that these tasks can be assigned to the PEs close to shared memory. The proposed switch structure was designed to support multi-port memory, which is often used for shared memory. The ports of the proposed switch are dedicated to be connected with in/out ports of shared memory in order to increase communication bandwidth between PEs and shared memories. By adopting the proposed scheme, the congestion caused by the concentrated requests to the memory can be reduced. Experimental results show that power consumption for transferring data in High-Definition (HD) H.264 decoder, Motion-JPEG decoder, MP3 decoder and 2D Wavelet transform codes has been reduced by 23.9% on the average, when compared with the cases of applying the well-known FC, BN and SA algorithms. The area has been slightly increased by 1.7% compared to conventional NoC structures.  相似文献   

8.
Nonnegative matrix factorization provides a new sight into the observed signals and has been extensively applied in face recognition, text mining and spectral data analysis. Despite the success, it is inefficient for the large-scale data set, due to the notoriously slow convergence of the multiplicative updating method. In this paper, we try to solve the problem through the parallel computing technique. Considering the limitation of the shared memory platform, the parallel algorithms are implemented on the distributed memory platform with the message passing interface library. Moreover, we adopt the two-layer cascade factorization strategy to eliminate the network consumption. The parallel implementations are evaluated on a 16-node Beowulf cluster with two data sets in different scale. The experiments demonstrate that the proposed method is effective in both precision and efficiency.  相似文献   

9.
Addressing the problem of queue scheduling for the packet-switched system is a vital aspect of congestion control. In this paper, the fuzzy logic based decision method is adopted for queue scheduling in order to enforce some level of control for traffic of different quality of service requirements using predetermined values. The fuzzy scheduler proposed in this paper takes into account the dynamic nature of the Internet traffic with respect to its time-varying packet arrival process that affects the network states and performance. Three queues are defined, viz low, medium and high priority queues. The choice of prioritizing packets influences how queues are served. The fuzzy scheduler not only utilizes queue priority in the queue scheduling scheme, but also considers packet drop susceptibility and queue limit. Through simulation it is shown that the fuzzy scheduler is more appropriate for the dynamic nature of Internet traffic in a packet-switched system as compared with some existing queue scheduling methods. Results show that the scheduling strategy of the proposed fuzzy scheduler reduces packet drop, provides good link utilization and minimizes queue delay as compared with the priority queuing (PQ), first-in-first-out (FIFO), and weighted fair queuing (WFQ).  相似文献   

10.
A novel architecture to accelerate a neocortex inspired cognitive model is presented. The architecture utilizes a collection of context switchable processing elements (PEs). This enables time multiplexing of nodes in the model onto available PEs. A streaming memory system is designed to enable high-throughput computation and efficient use of memory resources. Several scheduling algorithms were examined to efficiently assign network nodes to the PEs. Multiple parallel FPGA-accelerated implementations were evaluated on a Cray XD1. Networks of varying complexity were tested and indicate that hardware acceleration can provide an average throughput gain of 184 times over equivalent parallel software implementations.  相似文献   

11.
Until now, most results reported for parallelism in production systems (rulebased systems) have been simulation results-very few real parallel implementations exist. In this paper, we present results from our parallel implementation of OPS5 on the Encore multiprocessor. The implementation exploits very finegrained parallelism to achieve significant speed-ups. For one of the applications, we achieve 12.4 fold speed-up using 13 processes. Our implementation is also distinct from other parallel implementations in that we parallelize a highly optimized C-based implementation of OPS5. Running on a uniprocessor, our C-based implementation is 10–20 times faster than the standard lisp implementation distributed by Carnegie Mellon University. In addition to presenting the performance numbers, the paper discusses the details of the parallel implementation-the data structures used, the amount of contention observed for shared data structures, and the techniques used to reduce such contention.  相似文献   

12.
为控制P2P流量,本文从数据缓冲区使用的实时状态出发,提出了一种基于模糊神经网络的拥塞控制模型,该模型把缓冲区划分为两个队列分别存放P2P和非P2P的数据包,通过模糊神经网络预测评估缓冲区队列的拥塞状况,并建立一个评估函数对各队列的空间分配作出指导,使得能够控制各队列的拥塞状况,并动态的调整缓冲区队列的分配,在缓冲区溢出前主动丢包,避免缓冲区锁定。模拟实验的结果表明,该模型在保证网络资源分配的公平性方面取得了较好的效果,它降低了数据包排队延时和丢包率,提高了路由器处理网络拥塞的能力。  相似文献   

13.
We present an efficient and practical lock-free implementation of a concurrent priority queue that is suitable for both fully concurrent (large multi-processor) systems as well as pre-emptive (multi-process) systems. Many algorithms for concurrent priority queues are based on mutual exclusion. However, mutual exclusion causes blocking which has several drawbacks and degrades the overall performance of the system. Non-blocking algorithms avoid blocking, and several implementations have been proposed. Previously known non-blocking algorithms of priority queues did not perform well in practice because of their complexity, and they are often based on non-available atomic synchronization primitives. Our algorithm is based on the randomized sequential list structure called Skiplist, and a real-time extension of our algorithm is also described. In our performance evaluation we compare our algorithm with a well-representable set of earlier known implementations of priority queues. The experimental results clearly show that our lock-free implementation outperforms the other lock-based implementations in practical scenarios for 3 threads and more, both on fully concurrent as well as on pre-emptive systems.  相似文献   

14.
Network virtualization provides the ability to run multiple concurrent virtual networks over a shared substrate. However, it is challenging to design such a platform to host multiple heterogenous and often highly customized virtual networks. Not only high degree of flexibility is desired for virtual networks to customize their functions, fast packet forwarding is also required. This paper presents PdP, a flexible network virtualization platform capable of achieving high speed packet forwarding. A PdP node has multiple machines to perform packet processing for virtual networks hosted in the system. To forward packets in high speed, the data plane of a virtual network in PdP can be allocated with multiple forwarding machines to process packets in parallel. Furthermore, a virtual network in PdP can be fully customized. Both the control plane and data plane of a virtual network run in virtual machines so as to be isolated from other virtual networks. We have built a proof-of-concept prototyping PdP platform using off-the-shelf commodity hardware and open source software. The performance evaluation results show that our system can closely match the best-known packet forwarding speed of software router running in commodity hardware.  相似文献   

15.
A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

16.
设计一种基于网络处理的多核共享SDRAM控制器,提出分层优先级仲裁算法以提高多核访问共享内存的效率,针对IP包处理特点,给出一种基于指令控制的块数据传输机制来缩短IP包的读写延迟。在FPGA平台上进行验证,结果表明,当处理长度为64 Byte的IP包时,SDRAM控制器的读写效率能提高55%以上。  相似文献   

17.
Irregular parallel algorithms pose a significant challenge for achieving high performance because of the difficulty predicting memory access patterns or execution paths. Within an irregular application, fine-grained synchronization is one technique for managing the coordination of work; but in practice the actual performance for irregular problems depends on the input, the access pattern to shared data structures, the relative speed of processors, and the hardware support of synchronization primitives. In this paper, we focus on lock-free and mutual exclusion protocols for handling fine-grained synchronization. Mutual exclusion and lock-free protocols have received a fair amount of attention in coordinating accesses to shared data structures from concurrent processes. Mutual exclusion offers a simple programming abstraction, while lock-free data structures provide better fault tolerance and eliminate problems associated with critical sections such as priority inversion and deadlock. These synchronization protocols, however, are seldom used in parallel algorithm designs, especially for algorithms under the SPMD paradigm, as their implementations are highly hardware dependent and their costs are hard to characterize. Using graph-theoretic algorithms for illustrative purposes, we show experimental results on two shared-memory multiprocessors, the IBM pSeries 570 and the Sun Enterprise 4500, that irregular parallel algorithms with efficient fine-grained synchronization may yield good performance.  相似文献   

18.
With the advancements in multimedia and networking technologies, many distributed multimedia applications have been developed recently. These new applications rely on the underlying network support to transmit the multimedia real-time data between sites. Multimedia Transport Protocol (MMTP) is an experimental protocol designed for transmission of various multimedia data. MMTP uses multiple priority queues to support different levels of service requirements, and it discards packets from the transmission queue to reduce the network loading, and to ensure packets transmitted will meet the real-time constraint required by the data streams.  相似文献   

19.
Active Queue Management is a convenient way to administer the network load without increasing the complexity of end-user protocols. Current AQM techniques work in two ways; the router either drops some of its packets with a given probability or creates different queues with corresponding priorities. Head-to-Tail introduces a novel AQM approach: the packet rearrange scheme. Instead of dropping, HtT rearranges packets, moving them from the head of the queue to its tail. The additional queuing delay triggers a sending rate decrease and congestion events can be avoided. The HtT scheme avoids explicit packet drops and extensive retransmission delays. In this work, we detail the HtT algorithm and demonstrate when and how it outperforms current AQM implementations. We also approach analytically its impact on packet delay and conduct extensive simulations. Our experiments show that HtT achieves better results than Droptail and RED methods in terms of retransmitted packets and Goodput.  相似文献   

20.
《Computer Networks》2003,41(5):563-586
Network processors (NPs) are an emerging field of programmable processors that are optimized to implement data plane packet processing networking functions. Unlike the general-purpose CPUs that rely heavily on caching for improving performance, the lack of locality in packet processing and need for high-performance I/O have forced designers to come up with innovative architectures that can hide memory latency while still processing packets at high data rates. Most of these NPs use some type of multiprocessing in combination with a hierarchy of memory types to achieve high performance. In addition, to keep up with packets arriving at high data rates over multiple incoming media interfaces, an NP must perform fast I/O and memory operations such as packet storage, table lookup, and extraction of fields in packet headers. We describe an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance. We describe the challenges in programming such a processor including the issues related to consistency and maintaining packet ordering. We also present a programming model for generic network applications that uses software pipelines. We then demonstrate the use of the programming model in implementing two applications, namely, mapping traffic management algorithms onto a multithreaded architecture and an implementation of a media gateway based on voice-over-AAL2.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号