排序方式: 共有49条查询结果,搜索用时 0 毫秒
31.
Lock synchronization overheadsmay be significant in a shared-memory multiprocessor system-on-a-chip (SoC)implementation. These overheads are observed in terms of lock latency, lockdelay and memory bandwidth consumption in the system. There has been muchprevious work to speedup access of lock variables via specialized caches [1],software queues [2]–[5] and delayed loops, e.g., exponential backoff [2]. However, in the context of SoC, these previously reported techniquesall have drawbacks not present in our technique. We present a novel, efficient,small and very simple hardware unit, SoC Lock Cache (SoCLC), which resolvesthe critical section (CS) interactions among multiple processors and improvesthe performance criteria in terms of lock latency, lock delay and bandwidthconsumption in a shared-memory multiprocessor SoC. Our mechanism is capableof handling short CSs as well as long CSs. This combined support has beenestablished at both the hardware architecture level and the software architecturelevel including the real-time operating system (RTOS) kernel level facilities(such as support for preemptive versus non-preemptive synchronization, schedulingof lock variable accesses, interrupt handling and RTOS initialization). Theexperimental results of a microbenchmark program, which simulates an applicationwith high-contention critical section accesses under a four-processor platformwith shared-memory, showed an overall speedup of 55%. Furthermore, a databaseapplication example with client–server pairs of tasks,run on the same platform, showed that our mechanism achieved an overall speedupof 27%. 相似文献
32.
The performance of the Global Array shared-memory nonuniform memory-access programming model is explored in a wide-area-network (WAN) distributed supercomputer environment. The Global Array model is extended by introducing a concept of mirrored arrays that thanks to the caching and user-controlled consistency of the shared data structure scan reduce the application sensitivity to the network latency. Latencies and bandwidths for remote memory access are studied, and the performance of a large application from computational chemistry is evaluated using both fully distributed and also mirrored arrays. Excellent performance can be obtained with mirroring if even modest (0.5 MB/s) network bandwidth is available. 相似文献
33.
In distributed shared-memory (DSM) multiprocessors, a write operation requires multiple messages to invalidate the nodes which share and cache the memory block to being written. The consequent write stall time impedes the performance of such systems. An effective means of achieving efficient invalidation is to employ multicast messages to reach the sharing nodes. This study evaluates two multicast-based invalidation schemes, dual-path and pruning, by performing application-driven simulation. The experimental settings used herein find that multicasts improve invalidation traffic for four of the six evaluated real applications. The remaining two applications are computationally intensive, and multicast-based invalidation is less effective. However, since multicasts encourage bursty communication, our results indicate that they help relieve network congestion during these periods. Dual-path performs slightly better than pruning, because it is less sensitive to routing delay in the routers. Our results further demonstrate that cache size is an important design parameter for multicast-based invalidation, and is highly effective for DSM multiprocessors with larger caches. 相似文献
34.
《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-2):49-58
NEUCOMP2 is a parallel Neural Network Compiler for a shared-memory parallel machine. It compiles a program written as a list of mathematical specifications of Neural Network (NN) models and then translates it into a chosen target program which contains parallel codes. Performance results for character recognition problems on popular NN models are presented. The models are the backpropagation, Kohonen, Counterpropagation and ART1 network models. NEUCOMP2 was developed and run on the SEQUENT Balance 8000 computer system at PARC. 相似文献
35.
Sandeep N. Bhatt Gianfranco Bilardi Kieran T. Herley Geppino Pucci Abhiram Ranade 《Journal of Parallel and Distributed Computing》1998,51(2):75-88
The list marking problem involves marking the nodes of an ℓ-node linked list stored in the memory of a (p, n)-PRAM, when only the position of the head of the list is initially known, while the remaining list nodes are stored in arbitrary memory locations. Under the assumption that cells containing list nodes bear no distinctive tags distinguishing them from other cells, we establish anΩ(min{ℓ, n/p}) randomized lower bound for ℓ-node lists and present a deterministic algorithm whose running time is within a logarithmic additive term of this bound. Such a result implies that randomization cannot be exploited in any significant way in this setting. For the case where list cells are tagged in a way that differentiates them from other cells, the above lower bound still applies to deterministic algorithms, while we establish a tight
bound for randomized algorithms. Therefore, in the latter case, randomization yields a better performance for a wide range of parameter values. 相似文献
Full-size image
36.
37.
针对光纤通道(FC)交换机,研究并设计了一种基于共享存储器的交换结构,重点对FC交换单元的组成、调度、内部帧、多播和流控等关键技术进行了描述,并基于Xilinx平台进行了设计和验证,验证了交换单元的有效性.该研究为实现自主化的FC交换机产品奠定了有效的基础,也对采用FPGA实现各种网络协议具有借鉴意义. 相似文献
38.
存储一致性模型对共享存储系统的正确性,性能以及程序的复杂性都有重要的影响,该文立足于分布共享存储系统,提出了一种新的存储一致性模型框架-S^3C框架,该框架通过同步点的概念来描述不同模型正确的存储访问事件顺序;通过一致性维护点的概念,对同一模型的不同实现方式也能够进行区别和比较,结合S^3C框架,该文提出一种以操作系统为中心的线程一致性模型,并针对以顺序一致性模为代表的存储一致性模型的正确实现进行了论述。 相似文献
39.
Takahashi Daisuke Sato Mitsuhisa Boku Taisuke 《International journal of parallel programming》2003,31(3):185-196
This paper reports the performance of a single node of the Hitachi SR8000 when using SPEC OMP2001 benchmarks. Each processing node of the SR8000 is a shared-memory parallel computer composed of eight scalar processors with pseudo-vector processing feature. We have run the all of the SPEC OMP2001 benchmarks on the SR8000. According to the results of this performance measurement, we found that the SR8000 has good scalability continuing up to 8 processors except for a few benchmark programs. The performance results demonstrate that the SR8000 achieves high performance especially for memory-intensive applications. 相似文献
40.
D. C. S. Allison K. M. Irani C. J. Ribbens L. T. Watson 《The Journal of supercomputing》1992,5(4):347-366
Results are reported for a series of experiments involving numerical curve tracking on a shared-memory parallel computer. Several algorithms exist for finding zeros or fixed points of nonlinear systems of equations that are globally convergent for almost all starting points, that is, with probability one. The essence of all such algorithms is the construction of an appropriate homotopy map and then the tracking of some smooth curve in the zero set of this homotopy map. HOMPACK is a mathematical software package implementing globally convergent homotopy algorithms with three different techniques for tracking a homotopy zero curve, and has separate routines for dense and sparse Jacobian matrices. The HOMPACK algorithms for sparse Jacobian matrices use a preconditioned conjugate gradient algorithm for the computation of the kernel of the homotopy Jacobian matrix, a required linear algebra step for homotopy curve tracking. A parallel version of HOMPACK is implemented on a shared-memory parallel computer with various levels and degrees of parallelism (e.g., linear algebra, function, and Jacobian matrix evaluation), and a detailed study is presented for each of these levels with respect to the speedup in execution time obtained with the parallelism, the time spent implementing the parallel code, and the extra memory allocated by the parallel algorithm. 相似文献