首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The Image Processing applications require both computing and communication power. The object of the GFLOPS project was to study all aspects concerning the design of such computers. The project's aim was to develop a parallel architecture as well as its software environment to implement these applications efficiently. A development environment, especially a C data-parallel language, has been built for this purpose. The C parallel language presented here, simplifies the use of such architectures by providing the programmer with a global name space and a control mechanism to exploit fine and medium grain parallelism of its applications. The main advantage of our paradigm is that it allows a unique framework to express both data and control parallelism. We have implemented this programming environment on the GFLOPS machine which supports up to 512 processor nodes, which are PC mother boards, connected over a scaleable and cost-effective network, via the PCI-bus, at a constant cost per node. The aim is to obtain at low cost a scaleable virtually shared memory machine. In this paper we discuss the design of the GFLOPS machine and its C parallel language, and evaluate the effectiveness of the mechanisms incorporated. The analysis of the architecture's behaviour was conducted with microbenchmarks and image processing algorithms, written in C.  相似文献   

2.
曙光2000超级计算机系统软件的设计   总被引:10,自引:3,他引:7  
曙光2000超级计算机系统采用可扩展机群体系结构,是通用的超级并行计算机,可支持科学与工程计算、网络服务和数据处理应用。该文介绍了曙光2000系统软件设计采用担SUMA技术路线,即在通信软件、可扩展文件系统和服务器取信的设计上体现可管理性,在单一系统映像、集成化并行环境和傻瓜界面的设计上体现好用性。文章详细阐述了系统软件的设计和关键技术,包括通信系统、COSMOS可扩展文件系统、管理软件和用刻界面  相似文献   

3.
4.
《Real》1998,4(6):379-388
Parallel processing can effectively satisfy the real-time constraints required by on-line machine vision applications. This paper describes a real-time automatic visual inspection (AVI) system for high-speed plane products, which is based on a reconfigurable and scalable coarse-grain distributed memory MIMD architecture, and a unique application programming interface. The code of application algorithms are source level machine independent. As our prototype is organized as a test bed, new algorithms and even new dedicated hardware processing elements can quickly be evaluated on it. The system can easily be tailored for various automatic visual surface inspection applications. An example for inspecting whether separate small mosaics (textures) in a scene are in normal shapes is described. The algorithm combines the connected component labeling, the moment calculation and the pattern recognition. This key task for most applications is well suited for the “divide and conquer” parallel paradigm. The real-time performance has been achieved on a TMS320C40 array.  相似文献   

5.
This paper presents a dynamically scheduled parallel DSP architecture for general purpose DSP computations. The architecture consists of multiple DSP processors and of one or more scheduling units. DSP applications are first captured by stream flow graphs, and then stream flow graphs are statically mapped onto a parallel architecture. The ordering and starting time of DSP tasks are determined by the scheduling unit(s) using a dynamic scheduling algorithm.The main contributions of this paper are summarized as follows:• A scalable parallel DSP architecture: The parallel DSP architecture proposed in this paper is scalable to meet signal processing requirements. For parallel DSP architectures with large configurations, the scheduling unit may become a performance bottleneck. A distributed scheduling mechanism is proposed to address this problem.• A mapping algorithm: An algorithm is proposed to systematically map a stream flow graph onto a parallel DSP architecture.• A dynamic scheduling algorithm: We propose a dynamic scheduling algorithm that will only schedule a node for execution when both input data and output storage space are available. Such scheduling algorithm will allow buffer sizes to be determined at compile time.• A simulation study: Our simulation study reveals the relationships among the grain-size, the processor utilization, and the scheduling capability. We believe these relationships have significant impact on parallel computer architecture design involving dynamic scheduling.  相似文献   

6.
A single-chip multiprocessor for multimedia: the MVP   总被引:2,自引:0,他引:2  
The multimedia video processor (MVP) architecture, which incorporates a variety of parallel processing techniques to deliver very high performance to a wide range of imaging and graphics applications, is described. The MVP combines, on a single semiconductor chip, multiple fully programmable processors with multiple data streams connected to shared RAMs through a crossbar network. Each of the independent processors can execute many operations in parallel every cycle. The architecture is scalable and supports different numbers of processors to meet the cost and performance requirements of different markets. MVP's target environment and the development of MVP are outlined  相似文献   

7.
Neural network simulations on a parallel architecture are reported. The architecture is scalable and flexible enough to be useful for simulating various kinds of networks and paradigms. The computing device is based on an existing coarse-grain parallel framework (INMOS transputers), improved with finer-grain parallel abilities through VLSI chips, and is called the Lneuro 1.0 (for LEP neuromimetic) circuit. The modular architecture of the circuit makes it possible to build various kinds of boards to match the expected range of applications or to increase the power of the system by adding more hardware. The resulting machine remains reconfigurable to accommodate a specific problem to some extent. A small-scale machine has been realized using 16 Lneuros, to experimentally test the behavior of this architecture. Results are presented on an integer version of Kohonen feature maps. The speedup factor increases regularly with the number of clusters involved (to a factor of 80). Some ways to improve this family of neural network simulation machines are also investigated.  相似文献   

8.
Modern complex embedded applications in multiple application fields impose stringent and continuously increasing functional and parametric demands. To adequately serve these applications, massively parallel multi-processor systems on a single chip (MPSoCs) are required. This paper is devoted to the design of scalable communication architectures of massively parallel hardware multi-processors for highly-demanding applications. We demonstrated that in the massively parallel hardware multi-processors the communication network influence on both the throughput and circuit area dominates the processors influence, while the traditionally used flat communication architectures do not scale well with the increase of parallelism. Therefore, we propose to design highly optimized application-specific partitioned hierarchical organizations of the communication architectures through exploiting the regularity and hierarchy of the actual information flows of a given application. We developed related communication architecture synthesis strategies and incorporated them into our quality-driven model-based multi-processor design methodology and related automated architecture exploration framework. Using this framework we performed a large series of architecture synthesis experiments. Some of the results of the experiments are presented in this paper. They demonstrate many features of the synthesized communication architectures and show that our method and related framework are able to efficiently synthesize well scalable communication architectures even for the high-end massively parallel multi-processors that have to satisfy extremely stringent computation demands.  相似文献   

9.
多处理器MPEG2并行解码系统的设计   总被引:1,自引:0,他引:1  
MPEG2运动图像及伴音压缩标准是许多视频服务应用的核心算法。基于软件结合多处理器的并行系统实现MPEG2算法解压,不仅灵活适用于多种MPEG2产品的回放功能,避免了硬件芯片解压的局限性,而且随着个人计算机的普及和性能的提高,这种系统适配卡方案可以令个人计算机拥有更多的MPEG2服务功能,对MPEG2系列标准更新算法的研究和测试工作也带来方便。本文分析了MPEG2解码对实现系统的要求,特别是解压处理时各部分运算量和数据传输、处理的要求。根据这些数据本文基于多种TMS320C40并行处理系统板,对MPEG2输入码流的数据分割,并行解码存储控制和通信、解码算法复杂度等问题进行了实验和分析,据此得到相应的设计选择和数据。最后提出了MPEG2并行处理解码系统的设计方案。  相似文献   

10.
Literature has always witnessed efforts that make use of parallel algorithms / parallel architecture to improve performance; machine learning space is no exception. In fact, a considerable effort has gone into this area in the past fifteen years. Our report attempts to bring together and consolidate such attempts. It tracks the development in this area since the inception of the idea in 1995, identifies different phases during the time period 1995–2011 and marks important achievements. When it comes to performance enhancement, GPU platforms have carved a special niche for themselves. The strength of these platforms comes from the capability to speed up computations exponentially by way of parallel architecture / programming methods. While it is evident that computationally complex processes like image processing, gaming etc. stand to gain much from parallel architectures; studies suggest that general purpose tasks such as machine learning, graph traversal, and finite state machines are also identified as the parallel applications of the future. Map reduce is another important technique that has evolved during this period and as the literature has it, it has been proved to be an important aid in delivering performance of machine learning algorithms on GPUs. The report summarily presents the path of developments.  相似文献   

11.
In this article, we present the prototypical implementation of the scalable GigaNetIC chip multiprocessor architecture. We use an FPGA-based rapid prototyping system to verify the functionality of our architecture in a network application scenario before fabricating the ASIC in a modern CMOS standard cell technology. The rapid prototyping environment gives us the opportunity to test our multiprocessor architecture with Ethernet-based data streams in a real network scenario. Our system concept is based on a massively parallel processor structure. Due to its regularity, our architecture can be easily scaled to accommodate a wide range of packet processing applications with various performance and throughput requirements at high reliability. Furthermore, the composition based on predefined building blocks guarantees fast design cycles and simplifies system verification. We present standard cell synthesis results as well as a performance analysis for a firewall application with various couplings of hardware accelerators. Finally, we compare implementations of our architecture with state-of-the-art desktop CPUs. We use simple, general-purpose applications as well as the introduced packet processing tasks to determine the performance capabilities and the resource efficiency of the GigaNetIC architecture. We show that, if supported by the application, parallelism offers more opportunities than increasing clock frequencies.  相似文献   

12.
BSP模型独立于并行体系结构,既可作为并行计算模型,又可作为并行程序设计模型。提出了基于BSP模型的H-V事务模型,适用于长、短事务和长短事务混杂的情况。给出了在无共享结构下实现并行事务处理的进程结构。该结构不仅实现了事务内及事务间并行性,而且使人有可用性(availability)和可扩充性(scalability),而后给出了适用于该模型的基于时间戳的多版本并发控制协议,最后描述了事务在超步结构下的运行过程。性能测试表明,使用该模型进行事务处理时可获得良好的事务响应时间和加速比。  相似文献   

13.
Network Of Workstations (NOW) platforms put together with off-the-shelf workstations and networking hardware have become a cost effective, scalable, and flexible platform for video processing applications. Still, one has to manually schedule an algorithm to the available processors of the NOW to make efficient use of the resources. However, this approach is time-consuming and impractical for a video processing system that must perform a variety of different algorithms, with new algorithms being constantly developed. Improved support for program development is absolutely necessary before the full benefits of parallel architectures can be realized for video processing applications. Toward this goal, an automatic compile-time scheduler has been developed to schedule input tasks of video processing applications with precedence constraints onto available processors. The scheduler exploits both spatial (parallelism) and temporal (pipelining) concurrency to make the best use of machine resources. Two important scheduling problems are addressed. First, given a task graph and a desired throughput, a schedule is constructed to achieve the desired throughput with the minimum number of processors. Second, given a task graph and a finite set of available resources, a schedule is constructed such that the throughput is maximized while meeting the resource constraints. Results from simulations show that the scheduler and proposed optimization techniques effectively tackle these problems by maximizing processor utilization. A code generator has been developed to generate parallel programs automatically. The tools developed in this paper make it much easier for a programmer to develop video processing applications on these parallel architectures.  相似文献   

14.
In order to address the problems faced in the wireless communications domain, picoChip has devised the picoArray™. The picoArray is a tiled-processor architecture, containing several hundred heterogeneous processors, connected through a novel, compile-time scheduled interconnect. This architecture does not suffer from many of the problems faced by conventional general purpose parallel processors and provides an alternative to creating an ASIC. The PC102 is the second generation device from picoChip containing 308 processors. The devices are designed to be connected together using a seamless extension of the internal interconnect structure. This enables multi-chip solutions to be easily realised for applications which require additional processing. This paper highlights some of the difficulties encountered when building parallel systems and goes on to show how the features of the picoArray allow deterministic processing to be achieved, how the tool chain allows programming to be performed effectively in a combination of high level assembly language and C, and how systems built around the picoArray are debugged in real-time. By handling a wide variety of types of processing within the picoArray a single design flow can be used to produce complex communications systems. The effectiveness of this approach is demonstrated through the use of the picoArray to build a 802.16 base-station for commercial deployment.  相似文献   

15.
Scalability of parallel algorithm-machine combinations   总被引:2,自引:0,他引:2  
Scalability has become an important consideration in parallel algorithm and machine designs. The word scalable, or scalability, has been widely and often used in the parallel processing community. However, there is no adequate, commonly accepted definition of scalability available. Scalabilities of computer systems and programs are difficult to quantify, evaluate, and compare. In this paper, scalability is formally defined for algorithm-machine combinations. A practical method is proposed to provide a quantitative measurement of the scalability. The relation between the newly proposed scalability and other existing parallel performance metrics is studied. A harmony between speedup and scalability has been observed. Theoretical results show that a large class of algorithm-machine combinations is scalable and the scalability can be predicted through premeasured machine parameters. Two algorithms have been studied on an nCUBE 2 multicomputer and on a MasPar MP-1 computer. These case studies have shown how scalabilities can be measured, computed, and predicted. Performance instrumentation and visualization tools also have been used and developed to understand the scalability related behavior  相似文献   

16.
A system structure supporting parallel processing in general and parallel logic programming and expert system applications in particular is described. It is not based on special hardware but has rather been designed as an evolutionary extension to most existing machine architectures. It is aimed at parallel processing support for e.g. PROLOG as well as for expert system (shells) implemented in a general purpose language. A layered structure consisting of an extended machine interface and a macro language is chosen to support a range of various applications.  相似文献   

17.
18.
基于协同服务器组的志愿者计算环境的构造   总被引:4,自引:0,他引:4  
构造了一个基于协同服务器组的志愿者计算环境P2HP.P2HP把平台中的所有节点按照角色划分为监控服务器节点、调度服务器节点、计算节点和数据服务器,进而形成一个可扩展的层次网络拓扑架构.P2HP具有开放性、容易使用、容错能力好、可扩展、跨平台等特点,并提供一套简单方便的API(application programming interface)函数调用来支持并行应用程序开发.测试结果表明,P2HP是处理高性能并行应用的一个可行的方法.  相似文献   

19.
The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz.  相似文献   

20.
如何有效地解决I/O瓶颈问题,一直是高性能并行计算机有待研究解决的关键技术。我们提出了一种可伸缩分布共享并行I/O系统方案,并自行研制了结点控制器芯片和路由器芯片,研制了原型系统SDSP604。为实现系统的计算、通讯和I/O性能随着系统规模均衡扩展的目标,该系统基于CC-NUMA系统结构,采用了合理的分布共享并行I/O系统结构。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号