首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Microelectronics Reliability》2014,54(11):2621-2628
Given their high computational power, General Purpose Graphics Processing Units (GPGPUs) are increasingly adopted: GPGPUs have begun to be preferred to CPUs for several computationally intensive applications, not necessarily related to computer graphics. However, their sensitivity to radiation still requires to be fully evaluated. In this context, GPGPU data caches and shared memory have a key role since they allow to increase performance by sharing data between the parallel resources of a GPGPU and minimizing the memory accesses overhead. In this paper we present three new algorithms designed to support radiation experiments aimed at evaluating the radiation sensitivity of GPGPU data caches and shared memory. We also report the cross-section and Failure In Time results from neutron testing experiments performed on a commercial-off-the-shelf GPGPU using the proposed algorithms, with particular emphasis on the shared memory and on the L1 and L2 data caches.  相似文献   

2.
Two major performance bottlenecks in multiprocessor execution of protocols are contention for shared memory and for locks. Locks are used to protect shared messages and/or shared protocol state in a memory shared by competing processors. Mutual exclusion by locking can be costly, in terms of both lock contention and memory contention, if the parallel protocol code frequently accesses shared state and data. This paper presents a queueing network model for performance predictions of shared-memory multiprocessor protocol executions. Predictions from this model are compared to performance measurements from a multiprocessor implementation of two commonly used communication protocol stacks, transmission control protocol/Internet protocol (TCP/IP)/Ethernet and user datagram protocol/Internet protocol (UDP/IP)/Ethernet. These stacks are implemented on a parallelized version of the x-kernel protocol environment from the University of Arizona. A “processor-per-message” paradigm is used to partition the load among the processors. The measured speedups for the parallel implementations relative to the sequential ones are more than 11 times for UDP (using 20 processors) and three times for TCP (using five processors) on a sequent symmetry. We show that the model accurately captures the effects of lock and memory contention in our shared-memory multiprocessor and predicts the performance with a discrepancy of less than 10%  相似文献   

3.
Michel Raynal 《电信纪事》1993,48(5-6):260-267
Advent of distributed memory parallel machines has made feasible implementation of the shared virtual memory concept in a distributed context. This paper presents a complementary aspect of such an approach, namely protocols that implement a basic centralized synchronization tool: the semaphore. Provided with implementations of shared virtual memory and semaphore concepts, a programmer can use the very classical programming model based on processes and shared variables, and then execute her program either on a shared memory multiprocessor or on a distributed memory parallel machine.  相似文献   

4.
The configuration of an asynchronous transfer mode (ATM) switch architecture using a shared buffer memory switch (SBMS) is discussed. The scaling factors of the ATM switching network under a condition of mixed applications, including a conventional mix and telecommunication with video, are analyzed. The use of the SBMS as the unit switch for a multistage switching network is examined. A prototype system and its performance evaluation and experimental data are presented. The data indicate excellent performance under a burst cell arrival condition. The buffer size of the SBMS can be reduced in comparison with that of an individual (nonshared) buffer memory switch. A configuration for a large-scale ATM switching network with multistage switches is proposed  相似文献   

5.
Methods of parallelising grey-scale coordinate transforms for medium-grained asynchronous parallel processors are considered. The thrice-skew rotation transform, polar and log-polar transforms can all be parallelised by an image-strip method. Where communication bandwidth is limited or some form of shared memory is available, an alternative method of parallelisation is presented. Details of computational enhancements and of interpolation considerations are provided. Specialised rotation algorithms are also discussed  相似文献   

6.
张凌洁  赵英 《电子设计工程》2012,20(17):15-18,22
Floyd-Warshall算法是图论中APSP(All-Pair Shortest Paths)问题的经典算法,为了加快计算速度,提出使用GPU通用计算来实现。文章先从算法的原理入手,层层深入,提出了可以在GPU上运行的并行F-W算法。之后,又根据矩阵分块的原理和GPU共享存储器的使用,实现了改进的GPU并行F-W算法。通过大量测试实验,得到了该GPU并行程序相对于传统CPU并行程序产生超过百倍的加速比的结论。  相似文献   

7.
要使多核处理器充分发挥并行计算性能,最大的挑战是并行编程模型.目前并行线程使用锁来保证线程间的同步,但锁会带来死锁等错误,并且性能很难优化.事务存储模型将一系列共享存储操作看成一个事务,保证其原子性,一致性和隔离性.它可以取代锁结构,简化编程模型,提高并行计算的性能.介绍了一种软件事务存储模型(Buffering Software Transactional Memory,BSTM)的结构,它主要采用了写缓冲的办法,简化了事务模型的设计.实验的结果表明这种模型存在一定的优势.  相似文献   

8.
A survey of techniques for solving geometric problems in parallel is given, both for shared memory parallel machines and for networks of processors. Parallel models are reviewed, and basic subproblems that tend to arise in the solution of geometric problems on any parallel model, are discussed. PRAM techniques, techniques for mesh-connected arrays of processors, and the hybrid RAM/ARRAY model and its connection to I/O complexity are considered. Open problems are also discussed, as well as directions for future research  相似文献   

9.
李爱玲  王璐  彭云峰 《电子器件》2012,35(4):453-456
为了提高并行应用程序在异构平台上的执行效率,从范例、粒度角度对并行组件分类并设计相应模型,从而实现串行、消息并行或内存并行共享,粗、精、中粒度均可的各类范例的运行,同时也可针对组件的编程语言对范例进行编程。基于对组件范例、粒度的描述及资源使用的信息,进一步提出了组件调度策略,经测试表明组件模型和调度策略改善了并行应用程序的执行,提高了异构平台资源的利用率。  相似文献   

10.
We describe an integrated compile time and run time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model. The run time system implements a consistent shared memory abstraction using memory access detection and automatic data caching. The compiler improves the efficiency of the shared memory implementation by directing the run time system to exploit the message passing capabilities of the underlying hardware. To do so, the compiler analyzes shared memory accesses and transforms the code to insert calls to the run time system that provide it with the access information computed by the compiler. The run time system is augmented with the appropriate entry points to use this information to implement bulk data transfer and to reduce the overhead of run time consistency maintenance. In those cases where the compiler analysis succeeds for the entire program, we demonstrate that the combined system achieves performance comparable to that produced by compilers that directly target message passing. If the compiler analysis is successful only for parts of the program, for instance, because of irregular accesses to some of the arrays, the resulting optimizations can be applied to those parts for which the analysis succeeds. If the compiler analysis fails entirely, we rely on the run time maintenance of shared memory and thereby avoid the complexity and the limitations of compilers that directly target message passing. The result is a single system that combines efficient support for both regular and irregular memory access patterns  相似文献   

11.
A multicast replication algorithm is proposed for shared memory switches. It uses a dedicated FIFO to multicast by replicating cells at receiver and the FIFO is operating with shared memory in parallel. Speedup is used to promote loss and delay performance. A new queueing analytical model is developed based on a sub-timeslot approach. The system performance in terms of cell loss and delay is analyzed and verified by simulation.  相似文献   

12.
Memory errors can occur in the stages of a pipelined analog-to-digital converter (ADC) due to several effects. These include capacitor dielectric absorption/relaxation, incomplete stage reset at high clock rates, and parasitic capacitance effects when opamps are shared between subsequent pipeline stages. This paper describes these sources of memory errors and the effect they have on overall ADC linearity. It is shown how these errors relate to and differ from interstage gain errors. Two new calibration algorithms are proposed that correct for memory errors by digital post-processing of the ADC output. Both algorithms operate in the background and so do not require conversion to be interrupted in order to track changes due to temperature and supply variations. The two algorithms are compared in terms of their system costs and their dependence on input signal statistics.  相似文献   

13.
The class of switches with shareable parallel memory modules include those switches that use parallel memory modules which are physically separate but logically shared. The two main classes of such architectures namely the Shared Multibuffer (SMB) based switch and the Sliding-Window (SW) based packet switch both deploy shareable parallel memory modules, however they differ in the switching scheme used by them to store incoming packets and transfer packets among different switch ports. In this letter, we investigate and compare the performance of switching schemes deployed by these two classes of switching architectures. We compare throughput and packet loss performance of these two switches under conditions of identical traffic type, switch configuration and memory resource deployed.  相似文献   

14.
In this paper, we introduce and evaluate the parallel implementations of two video sequences decorrelation algorithms having been developed based on the non-alternating three-dimensional wavelet transform (3D-WT) and the temporal-window method. The proposed algorithms have been proven to outperform the classic 3D-WT algorithm in terms of a better coding efficiency and lower computational requirements while enabling a lossless coding and a top-quality reconstruction: the two most highly relevant features to medical imaging applications. The parallel implementations of the algorithms are developed and tested on a shared memory system, a SGI Origin 3800 supercomputer, making use of a message-passing paradigm. We evaluate and analyze the performance of the implementations in terms of the response time and speed-up factor by varying the number of processors and various video coding parameters. The key point enabling the development of highly efficient implementations rely on a workload distribution strategy supplemented by the use of parallel I/O primitives, for better exploiting the inherent features of the application and computing platform. Two sets of I/O primitives are tested and evaluated: the ones provided by the C compiler and the ones belonging to the MPI/IO library.  相似文献   

15.
分布式小卫星SAR回波仿真的并行化研究   总被引:1,自引:0,他引:1  
分布式小卫星SAR系统回波仿真的计算量和存储量巨大,本文分析了回波仿真及其快速算法,提出了基于仿真时间分解和基于场景分解的两种并行任务分解方法;性能分析表明,前者能够显著提高算法的加速效率,后者在处理大尺寸场景时能有效克服内存不足的问题。最后,一组小型集群系统环境下的仿真实验数据证实了两种方法的可行性和有效性。  相似文献   

16.
Local processing, which is a dominant type of processing in image and video applications, requires a huge computational power to be performed in real-time. However, processing locality, in space and/or in time, allows to exploit data parallelism and data reusing. Although it is possible to exploit these properties to achieve high performance image and video processing in multi-core processors, it is necessary to develop suitable models and parallel algorithms, in particular for non-shared memory architectures. This paper proposes an efficient and simple model for local image and video processing on non-shared memory multi-core architectures. This model adopts a single program multiple data approach, where data is distributed, processed and reused in an optimal way, regarding the data size, the number of cores and the local memory capacity. The model was experimentally evaluated by developing video local processing algorithms and programming the Cell Broadband Engine multi-core processor, namely for advanced video motion estimation and in-loop deblocking filtering. Furthermore, based on these experiences it is also addressed the main challenges of vectorization, and the reduction of branch mispredictions and computational load imbalances. The limits and advantages of the regular and adaptive algorithms are also discussed. Experimental results show the adequacy of the proposed model to perform local video processing, and that real-time is achieved even to process the most demanding parts of advanced video coding. Full-pixel motion estimation is performed over high resolution video (720×576 pixels) at a rate of 30 frames per second, by considering large search areas and five reference frames.  相似文献   

17.
Multi-access networks are considered in which the shared channel is noisy. The authors assume a slotted-time collision-type channel, Poisson infinite-user model, and binary feedback. Due to the noise in the shared channel, the received signal may be detected as a collision even though no message or a single message is transmitted. This kind of imperfect feedback is referred to as error. A common assumption in all previous studies of multi-access algorithms in channels with errors is that the channel is memoryless. The authors consider the problem of splitting algorithms when the channel has memory. They introduce a two-state, first-order Markovian model for the channel and analyze the operation of the tree collision-resolution algorithm in this channel. They obtain a stability result, i.e., the necessary conditions on the channel parameters for stability of the algorithm. Assuming that the stability conditions hold, they calculate the throughput of the algorithm. Assuming that the stability conditions hold, they calculate the throughput of the algorithm. Extensions to more general channel moders are discussed  相似文献   

18.
Matching output queueing with a combined input/output-queued switch   总被引:19,自引:0,他引:19  
The Internet is facing two problems simultaneously: there is a need for a faster switching/routing infrastructure and a need to introduce guaranteed qualities-of-service (QoS). Each problem can be solved independently: switches and routers can be made faster by using input-queued crossbars instead of shared memory systems; QoS can be provided using weighted-fair queueing (WFQ)-based packet scheduling. Until now, however, the two solutions have been mutually exclusive-all of the work on WFQ-based scheduling algorithms has required that switches/routers use output-queueing or centralized shared memory. This paper demonstrates that a combined input/output-queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet-scheduling algorithms, including WFQ and strict priorities. More precisely, we show that for an N×N switch, a “speedup” of 2-1/N is necessary, and a speedup of two is sufficient for this exact emulation. Perhaps most interestingly, this result holds for all traffic arrival patterns. On its own, the result is primarily a theoretical observation; it shows that it is possible to emulate purely OQ switches with CIOQ switches running at approximately twice the line rate. To make the result more practical, we introduce several scheduling algorithms that with a speedup of two can emulate an OQ switch. We focus our attention on the simplest of these algorithms, critical cells first (CCF), and consider its running time and implementation complexity. We conclude that additional techniques are required to make the scheduling algorithms implementable at a high speed and propose two specific strategies  相似文献   

19.
Thermal properties of high-power transistors   总被引:3,自引:0,他引:3  
The temperature of a transistor can be determined from the emitter-base voltage versus collector-current characteristic. This characteristic was used for studying the stability of parallel pairs of high-frequency high-power transistors. The thermal effect may cause the incremental emitter-base resistance to assume a negative value. This, in turn, will cause the current flow in a pair of transistors to be asymmetrical. The transition from symmetrical to asymmetrical current flow occurs at a power level which is determined by the nonshared thermal and electrical resistances. Stability to a higher current level can be obtained by increasing the nonshared emitter or base resistances or reducing the collector voltage. Higher currents can also be obtained by reducing the nonshared thermal resistances which indicates close thermal coupling between the two units is desirable.  相似文献   

20.
Automatic generation of prime length FFT programs   总被引:2,自引:0,他引:2  
Describes a set of programs for circular convolution and prime length fast Fourier transforms (FFTs) that are relatively short, possess great structure, share many computational procedures, and cover a large variety of lengths. The programs make clear the structure of the algorithms and clearly enumerate independent computational branches that can be performed in parallel. Moreover, each of these independent operations is made up of a sequence of suboperations that can be implemented as vector/parallel operations. This is in contrast with previously existing programs for prime length FFTs: They consist of straight line code, no code is shared between them, and they cannot be easily adapted for vector/parallel implementations. The authors have also developed a program that automatically generates these programs for prime length FTTs. This code-generating program requires information only about a set of modules for computing cyclotomic convolutions  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号