期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A microprogrammed occam interpreter for the HLH orion

R. E. M. Cooper G. Jones 《Software》1988,18(1):63-71

A micro-coded interpreter has been written for the occam programming language that runs on the fully microprogrammable HLH Orion mini-computers. The resulting software gives performance which is comparable with native-code operation on similar hardware, and is able to be extended easily to support new instructions as the language is developed further. 相似文献

2.

Comparative timings of three different set implementations in occam

G. A. Wilson G. A. Wilson 《Software》1989,19(3):273-281

Three different occam implementations of a Set datatype have been investigated, using arrays of bits, booleans (bytes) and full integers, and the performance of each compared. Execution times and code/data requirements are recorded, and surprisingly the best implementation is not as originally expected. 相似文献

3.

Replay-based debugging of occam programs

A. Cimitile U. De Carlini U. Villano 《Software Testing, Verification and Reliability》1993,3(2):83-100

Parallel programs are intrinsically non-deterministic, and therefore the techniques of cyclical debugging that are commonly used for sequential programs are not suitable for parallel ones. This paper proposes a method to reproduce Occam program behaviour. Saving information on the timer values input by the program and the guards selected at run-time on alternative commands allows program replay, i.e. it makes it possible to re-execute the program deterministically with the same inputs following the same instruction path. This enables the software developer to use tools such as debuggers and intrusive monitors to help identify program faults. After discussing possible implementations of the proposed technique, IRD (an interactive replay debugger for Occam programs) is described. Finally, the use of the IRD in a sample debug session is presented as an example. 相似文献

4.

Three solutions for a robot arm controller using Pascal-Plus,occam and edison

Jon M. Kerridge Dan Simpson 《Software》1984,14(1):3-15

Three currently available concurrent language systems, Pascal-Plus, occam and Edison, are used to implement a controller for a robot arm. The robot arm allows real parallelism of operation within the movements of the arm. The feasibility and restrictions placed upon the resultant solution for each of the language systems is then analysed and discussed. A Petri-net solution is also presented for the generalized problem and it is shown that each of the solutions is a different folding of the general net. 相似文献

5.

A parallel implementation of the douglas-peucker line simplification algorithm

Jon Vaughan Duncan Whyatt Graham Brookes 《Software》1991,21(3):331-336

As parallel machines become more widely available, many existing algorithms are being converted to take advantage of the improved speed offered by such computers. However, the method by which the algorithm is distributed is crucial towards obtaining the speed-ups required for many real-time tasks. This paper presents three parallel implementations of the Douglas—Peucker line simplification algorithm on a Sequent Symmetry computer and compares the performance of each with the original sequential algorithm. 相似文献

6.

A real-time messaging system for token ring networks

Alfred C. Weaver M. Alex Colvin 《Software》1987,17(12):885-897

The Computer Networks Laboratory at the University of Virginia has developed a real-time messaging service that runs on IBM PCs and PC/ATs when interconnected with a Proteon ProNET-10 token ring local area network. The system is a prototype for a real-time communications network to be used aboard ships. The system conforms to the IEEE 802.2 logical link control standard for type I (connectionless, or datagram) service, with an option for acknowledged datagrams. The application environment required substantial network throughput and bounded message delay. Thus, the development philosophy was to emphasize performance initially and to offer only primitive user services. After providing and measuring the performance of a basic datagram service, the intent is to add additional user services one at a time and to retain only those which the user can ‘afford’ in terms of their impact on throughput, delay, and CPU utilization. The current system is programmed in C. The user interface is a set of C procedure calls that initialize tables, reserve buffer space, send and receive messages, and report network status. The system is now operational, and initial performance measurements are complete. Using this system, an individual PC can transmit or receive approximately 200 short (about 100 bytes) messages per second, and the PC/AT operates at nearly 500 short messages per second. 相似文献

7.

一种适用于自组网的多信道轮询多址MAC协议

彭艺赵东风查光明周正中《计算机科学》2005,32(12):41-43

本文针对节点具有多个可用信道的自组网,提出了一种基于令牌环的多信道轮询多址MAC协议,协议在控制信道上采用令牌轮询接入方式来实现节点公平接入信道,通过令牌的传递预约数据信道来实现动态按需分配信道的数据传输,并对该协议进行了性能分析。相似文献

8.

A parallel implementation of the SCAN language

Nikolaos G. Bourbakis Christos Alexopoulos Allen Klinger 《Computer Languages, Systems and Structures》1989,14(4):239-254

SCAN is a special purpose context-free language which describes and generates a wide range of array accessing algorithms from a short set of simple ones. These algorithms may represent scan techniques for image processing, but at the same time they stand as generic data accessing strategies. In this paper we present two schemes (one sequential and one parallel) which implement the SCAN language and compare their memory requirements and execution time. 相似文献

9.

Strongly adaptive token distribution

F. auf der Meyer Heide B. Oesterdiekhoff R. Wanka 《Algorithmica》1996,15(5):413-427

The token distribution (TD) problem, an abstract static variant of load balancing, is defined as follows: letM be a (parallel processor) network with setP of processors. Initially, each processorP P has a certain amountl(P) of tokens. The goal of a TD algorithm, run onM, is to distribute the tokens evenly among the processors. In this paper we introduce the notion of strongly adaptive TD algorithms, i.e., algorithms whose running times come close to the best possible runtime, the off-line complexity of the TD problem, for each individual initial token distributionl. Until now, only weakly adaptive algorithms have been considered, where the running time is measured in terms of the maximum initial load max{l(P)P P}.We design an almost optimal, strongly adaptive algorithm on mesh-connected networks of arbitrary dimension extended by a single 1-bit bus. This result shows that an on-line TD algorithm can come close to the optimal (off-line) bound for each individual initial load. Furthermore, we exactly characterize the off-line complexity of arbitrary initial token distributions on arbitrary networks. As an intermediate result, we design almost optimal weakly adaptive algorithms for TD on mesh-connected networks of arbitrary dimension.This research was partially supported by DFG-Forschergruppe Effiziente Nutzung massiv paralleler Systeme, Teilprojekt 4, by the ESPRIT Basic Research Action No. 7141 (ALCOM II), and by the Volkswagen-stiftung. A preliminary version was presented at the 20th ICALP, 1993, see [9]. 相似文献

10.

A vector C and Fortran compiler for the FPS T-series: Experiences with compiling to occam I

D. E. Stevenson L. K. Ammons W. G. Crosmun A. Jackson G. L. Raj 《Software》1992,22(5):371-390

We describe our implementation of C and Fortran preprocessors for the FPS T-series hypercube. The target of these preprocessors is the occam I language. We provide a brief overview of the INMOS transputer and the Weitek vector processing unit (VPU). These two units comprise one node of the T-series. Some depth of understanding of the VPU is required to fully appreciate the problems encountered in generating vector code. These complexities were not fully appreciated at the outset. The occam I language is briefly described. We focus on only those aspects of occam I which differ radically from C. The transformations used to preprocess C into occam I are discussed in detail. The special problems with the VPU both in terms of its (non)interface with occam I and in dealing with numerical programs is discussed separately. A lengthy discussion on the special techniques required for compilation is provided. C and Fortran are simply incompatible with the occam I model. We provide a catalogue of problems encountered. We emphasize that these problems are not so much with occam I but with preprocessing to occam I. We feel the CSP and occam I models are quite good for distributed processing. The ultimate message from this work should be seen in a larger context. Several languages—such as Ada and Modula-2—are being touted as the standards for the 1990s. These languages severely restrict parallel programming style; this may make saving dusty decks by preprocessing an impossibility. 相似文献

11.

A highly efficient implementation of a backpropagation learning algorithm using matrix ISA

Mostafa I. SolimanAuthor Vitae Samir A. Mohamed 《Journal of Parallel and Distributed Computing》2008

BackPropagation (BP) is the most famous learning algorithm for Artificial Neural Networks (ANN). BP has received intensive research efforts to exploit its parallelism in order to reduce the training time for complex problems. A modified version of BP based on matrix–matrix multiplication was proposed for parallel processing. In this paper, we present the implementation of Matrix BackPropagation (MBP) using scalar, vector, and matrix Instruction Set Architectures (ISAs). Besides this, we show that the performance of the MBP is improved by switching from scalar ISA to vector ISA. It is further improved by switching from vector ISA to matrix ISA. On a practical application, speech recognition, the speedup of training a neural network using unrolling scalar ISA over scalar ISA is 1.83. On eight parallel lanes, the speedups of using vector, unrolling vector, and matrix ISAs are respectively 10.33, 11.88, and 15.36, where the maximum theoretical speedup is 16. The results obtained show that the use of matrix ISA gives a performance close to optimal, because of reusing the loaded data, decreasing the loop overhead, and overlapping the memory operations with arithmetic operations. 相似文献

12.

采用VLSI技术实现舰用计算机局部网络的通信协议

杨永田殷志伟《小型微型计算机系统》1993,14(9):57-61,F003

相似文献

13.

An efficient implementation of parallel eigenvalue computation for massively parallel processing 总被引：4，自引：0，他引：4

Takahiro Katagiri Yasumasa Kanada 《Parallel Computing》2001,27(14):1831-1845

This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201. 相似文献

14.

An implementation of monitors

A. M. Lister K. J. Maynard 《Software》1976,6(3):377-385

Monitors and similar constructs have been suggested by Hansen¹ and Hoare² as suitable structuring concepts for operating systems. This paper describes an implementation of monitors in BCPL and shows how the scope rules of BCPL can be used to provide most of the requisite compile time checking. Some observations are made on problems of implementation, particularly in respect of mutual exclusion, and on the use and construction of monitors in practice. 相似文献

15.

High-performance implementation of regular and easily scalable sorting networks on an FPGA

Valery Sklyarov Iouliia SkliarovaAuthor Vitae 《Microprocessors and Microsystems》2014

The paper is dedicated to fast FPGA-based hardware accelerators that implement sorting networks. The primary emphasis is on the uniformity of core components, feasible combinations of parallel, pipelined and sequential operations, and the regularity of the circuits and interconnections. The paper shows theoretically, and based on numerous experiments, that many existing solutions that are commonly considered to be very efficient have worthy competitors that are better for many practical problems. We compared the even–odd merge and bitonic merge sorting networks (which are among the fastest known) with the even–odd transition network, which is often characterized as significantly slower and more resource consuming. We found that the latter is the most regular network that can be implemented very efficiently in FPGA, so we are proposing new, easily scalable hardware solutions and processing techniques based on this. Finally, the paper provides four main contributions and suggests: (1) a regular hardware implementation of resource and time effective architectures based on the even–odd transition network; (2) a pipelined implementation of even–odd transition networks; (3) a pre-processing technique that enables sorting to be further accelerated; (4) combinations of this technique with a merge sort, an address-based sort, a quicksort, and a radix sort. 相似文献

16.

利用对称多处理器提高NIDS的性能

赖海光黄皓谢俊元《计算机应用》2005,25(5):1141-1144

网络入侵检测系统(NIDS)通过捕获和分析网络数据包判断是否存在攻击行为。由于网络带宽越来越高,NIDS的处理能力越来越难以跟上网络的速度。该文提出了一种利用对称多处理器(SMP)提高NIDS处理能力的方法,通过多个CPU并行的处理网络数据包改善系统的性能。经过对NIDS处理过程的分析,设计了一种有效的并行处理结构,保证在不同CPU上运行的线程能够高度并行的执行。此外,该文提出的线程同步方式既保证了程序功能的正确,又避免了对共享资源的互斥访问,进一步提高了线程的并行度。实验证明,在具有双CPU的SMP结构上实现的NIDS的性能比单CPU系统提高了80%。相似文献

17.

Highly optimized implementation of OpenCV for the Cell Broadband Engine

Hiroki Sugano Ryusuke Miyamoto 《Computer Vision and Image Understanding》2010,114(11):1273-1281

Recently, real-time processing of image recognition is required for embedded applications such as automotive applications, robotics, entertainment, and so on. To realize real-time processing of image recognition on such systems we need optimized libraries for embedded processors. OpenCV is one of the most widely used libraries for computer vision applications and has many functions optimized for Intel processors, but no function is optimized for embedded processors. We present a parallel implementation of OpenCV library on the Cell Broadband Engine (Cell), which is one of the most widely used high performance embedded processors. Experimental result shows that most of the functions optimized for the Cell processor are faster than functions optimized for Intel Core 2 Duo E6850 3.00 GHz. 相似文献

18.

An efficient protocol with synchronization accelerator for multi-processor embedded systems

Jiyang Yu Peng Liu Weidong Wang Chunming Huang Jie Yang Yingtao Jiang Qingdong Yao 《Parallel Computing》2013

With the proliferation of multi-processor core systems, parallel programming imposes a difficult challenge where current solutions are far from being considered efficient. In order to alleviate the difficulty of parallel programming, we propose a scheduler, which is part of a master–slave RTOS, to efficiently manage the parallel programs running on a multi-processor core system. We also propose an efficient protocol that serves as the interface between the operating system and application programs. This interface protocol runs on a dedicated control subnet to cut down the synchronization overhead between the parallel tasks. Such synchronization overhead incurred in these multi-core parallel systems has been recognized as one of the severe limiting factors when pushing up the performance envelope. Experimental results, obtained from the register-transfer level simulations of various benchmark parallel programs, show that the proposed protocol and the control subnet can improve the system efficiency by up to 33.5%. This protocol, as it is designed to be compatible with the minimum subset of the massage-passing interface functions (MPI), scales well with the number of cores. 相似文献

19.

龙芯2E多处理器芯片组的设计与实现

方志斌胡鹏安学军孙凝晖《计算机应用研究》2008,25(5):1465-1469

提出了一种面向高性能计算机的多处理器芯片组的设计,其主要特点是支持多处理器通过芯片组和交换芯片两级互连,全局地址空间和多处理器同步支持。给出了芯片组的组成结构、设计原则和关键技术,设计并实现了基于龙芯2E处理器的多处理器芯片组。目前,已采用FPGA平台对该芯片组进行验证和测试,以该芯片组为核心的四处理器原型系统完成B IOS引导和操作系统运行,经过实测处理器的访问请求通过芯片组延迟小于0.5μs,芯片组内处理器通信带宽达到500 Mbps。相似文献

20.

串行程序在大粒度级的并行分解及可并行执行包的形成 总被引：1，自引：0，他引：1

罗昕于月芬罗静敏《小型微型计算机系统》1996,(8):35-40

本文提出了针对由划分阶段所形成的任务图［７］进行优化、合并的技术及相应的算法，用于在并行与通信开销间进行折衷，以使分解出的并行成份有尽可能高的执行效率。本文还给出了根据综合后的任务图形成可并行执行包，并在其中自动插入通信原语的方法。相似文献