首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. In addition, memory access will continue to be an important limiting factor on achieving performance, and heterogeneous systems may make use of cores with varying capabilities and performance characteristics. An appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores. We survey and evaluate a range of multicore processor architectures and programming models with a focus on GPUs and the Cell BE processor. These processors have a large number of cores and are available to consumers today, but the scalable programming models developed for them are also applicable to current and future multicore CPUs.  相似文献   

2.
The power consumption of 3D many‐core processors can be reduced, and the power delivery of such processors can be improved by introducing voltage island (VI) design using on‐chip voltage regulators. With the dramatic growth in the number of cores that are integrated in a processor, however, it is infeasible to adopt per‐core VI design. We propose a 3D many‐core processor architecture that consists of multiple voltage clusters, where each has a set of cores that share an on‐chip voltage regulator. Based on the architecture, the steady state temperature is analyzed so that the thermal characteristic of each voltage cluster is known. In the voltage scaling and task scheduling stages, the thermal characteristics and communication between cores is considered. The consideration of the thermal characteristics enables the proposed VI formation to reduce the total energy consumption, peak temperature, and temperature gradients in 3D many‐core processors.  相似文献   

3.
The increasing use of embedded software, often implemented on a core processor in a single-chip system, is a clear trend in the telecommunications, multimedia, and consumer electronics industries. A companion paper (Paulin et al., 1997) presents a survey of application and architecture trends for embedded systems in these growth markets. However, the lack of suitable design technology remains a significant obstacle in the development of such systems. One of the key requirements is more efficient software compilation technology. Especially in the case of fixed-point digital signal processor (DSP) cores, it is often cited that commercially available compilers are unable to take full advantage of the architectural features of the processor. Moreover, due to the shorter lifetimes and the architectural specialization of many processor cores, processor designers are often compelled to neglect the issue of compiler support. This situation has resulted in an increased research activity in the area of design tool support for embedded processors. This paper discusses design technology issues for embedded systems using processor cores, with a focus on software compilation tools. Architectural characteristics of contemporary processor cores are reviewed and tool requirements are formulated. This is followed by a comprehensive survey of both existing and new software compilation techniques that are considered important in the context of embedded processors  相似文献   

4.
《Spectrum, IEEE》2008,45(11):15-15
With no other way to improve the performance of processors further, chip makers have staked their future on putting more and more processor cores on the same chip. Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16?core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores. The performance is especially bad for informatics applications? data-intensive programs that are increasingly crucial to the labs? national security function.  相似文献   

5.
Limits to chip power dissipation and power density and limits on the benefits of hyperpipelining in microprocessors threaten to stop the exponential performance growth in microprocessor performance that we have grown accustomed to. Multicore processors can continue to provide historical performance growth on most modern consumer and business applications. However, power efficiency of these cores must also be improved to stay within reasonable power budgets. This can be achieved by simplifying the processor core architecture, and reversing the trend toward ever more-complex and less power-efficient cores. To maintain overall performance growth with stunted per-core and per-thread performance, growth rates will require an even more rapid increase in the number of cores per die. Growing performance by increasing the number of cores on a die at this rate, however, puts unprecedented requirements on the corresponding growth of off-chip bandwidth. We argue that contrary to the international roadmap for semiconductors (ITRS) predictions, off-chip signaling frequencies are likely to exceed the frequencies of processor cores in the not too distant future, consistent with the system-on-package (SOP) concept in the first paper of this issue. If this approach is followed, a 1-TFlops multiprocessor die with 1 TB/s of off-chip bandwidth is feasible at reasonable cost before the end of the decade.  相似文献   

6.
Three-dimensional chip (3-D) stacking technology provides a new approach to address the so-called memory wall problem. Memory processor chip stacking reduces this memory wall problem, permitting faster clock rates (with suitable processor logic) or permitting multicore access to shared memory using a large number of vertical vias between tiers in the stack, for ultrawide bit path transfer of data and address information to and from various levels of cache. Although a limited amount of parallel access is possible using conventional two-dimensional (2-D) chip memory-processor approaches, 3-D memory-processor stacking greatly extends this to much larger capacity memories. We evaluate high-clock-rate processors as well as shared memory processors with a large number of cores. Various architectural design options to reduce the impact of the memory wall on the processor performance are explored and validated through simulations. Certain architectural features can be implemented in a 3-D chip, such as an ultrawide, ultrashort vertical bus with low parasitic resistance and the elimination of conventional electrostatic discharge, and packaging parasitics required in multiple package 2-D solutions. The objective is to reduce the clocks per instruction figure of merit for high clock speeds in order to deliver significant performance levels. High-clock-rate processors can be designed with SiGe heterostructure bipolar transistors to obtain processors operating on the order of 16 or 32 GHz.   相似文献   

7.
Today’s many-core processors are manufactured in inherently unreliable technologies. Massively defective technologies used for production of many-core processors are the direct consequence of the feature size shrinkage in today’s CMOS (complementary metal-oxide-semiconductor) technology. Due to these reliability problems, fault-tolerance of many-core processors becomes one of the major challenges. To reduce the probability of failures of many-core processors various fault tolerance techniques can be applied. The most preferable and promising techniques are the ones that can be easily implemented and have minimal cost while providing high level of processor fault tolerance. One of the promising techniques for detection of faulty cores, and consequently, for performing the first step in providing many-core processor fault tolerance is mutual testing among processor cores. Mutual testing can be performed either in a random manner or according to a deterministic scheduling policy. In the paper we deal with random execution of mutual tests. Effectiveness of such testing can be evaluated through its modeling. In the paper, we have shown how Stochastic Petri Nets can be used for this purpose and have obtained some results that can be useful for developing and implementation of testing procedure in many-core processors.  相似文献   

8.
An increasing number of standards in wireless communications have encouraged to study programmable processors as platforms for flexible receivers. A multiple-input multiple-output (MIMO) antenna system combined with orthogonal frequency division multiplexing (OFDM) technique has been introduced in many wireless communications standards, such as in the third generation long term evolution (3G LTE). The MIMO-OFDM system requires an efficient detector and a platform support for parallel processing of multiple subcarriers. A K-best list sphere detector (LSD) provides for near optimal decoding performance and a fixed throughput making it an interesting algorithm from the point of view of practical implementations.In this paper, we compare the implementations of the K-best LSD on four processor platforms: a digital signal processor (DSP), software defined radio (SDR), application-specific processor (ASP) and application-specific instruction-set processor (ASIP). The DSP is a popular very long instruction word (VLIW) device (TMS320C6455), the SDR processor employs multithreading and multiple cores (SB3500 core processor), the ASP is based on transport triggered architecture (TTA), while the ASIP is the SDR processor enhanced with a special instruction-set extension for sorting.A 2×2 MIMO antenna system with 64-quadrature amplitude modulation (64-QAM) is assumed. The chosen list sizes K=8 and 16 are based on simulation results carried out in MATLAB environment with the third generation long term evolution (3G LTE) parameters. The proposed ASIP achieved a promising throughput of 32.0 Mbps, where the software defined radio (SDR) implementation on the SB3500 core processor suffers from an inefficient software sorter. The ASP, in which the minimized hardware complexity has been the goal, achieves a throughput of 7.6 Mbps. However, more essential examination is related to the symbol time, which sets strict parallel processing requirements to the programmable processors.  相似文献   

9.
赵运筹  贾浩  丁建峰  张磊  付鑫  杨林 《半导体学报》2016,37(11):114008-6
With the continuous development of integrated circuits, the performance of the processor has been improved steadily. To integrate more cores in one processor is an effective way to improve the performance of the processor, while it is impossible to further improve the property of the processor by only increasing the clock frequency. For a processor with integrated multiple cores, its performance is determined not only by the number of cores, but also by communication efficiency between them. With more processor cores integrated on a chip, larger bandwidths are required to establish the communication among them. The traditional electrical interconnect has gradually become a bottleneck for improving the performance of multiple-core processors due to its limited bandwidth, high power consumption, and long latency. The optical interconnect is considered as a potential way to solve this issue. The optical router is the key device for realizing the optical interconnect. Its basic function is to achieve the data routing and switching between the local node and the multi-node. In this paper we present a five-port optical router for Mesh photonics network-on-chip. A five-port optical router composed of eight thermally tuned silicon Mach-Zehnder optical switches is demonstrated. The experimental spectral responses indicate that the optical signal-to-noise ratios of the optical router are over 13 dB in the wavelength range of 1525-1565 nm for all of its 20 optical links. Each optical link can manipulate 50 wavelength channels with the channel spacing of 100 GHz and the data rate of 32 Gbps for each wavelength channel in the same wavelength range. The lowest energy efficiency of the optical router is 43.4 fJ/bit.  相似文献   

10.
This paper describes a heterogeneous multi-core processor (HMCP) architecture that integrates general-purpose processors (CPUs) and accelerators (ACCs) to achieve exceptional performance as well as low-power consumption for the SoCs of embedded systems. The memory architectures of CPUs and ACCs were unified to improve programming and compiling efficiency. Advanced audio codec-low complexity (AAC-LC) stereo audio encoding was parallelized on a heterogeneous multi-core having homogeneous processor cores and dynamically reconfigurable processor (DRP) ACC cores in a preliminary evaluation of the HMCP architecture. The performance evaluation revealed that 54times AAC encoding was achieved on the chip with two CPUs at 600 MHz and two DRPs at 300 MHz, which achieved encoding of an entire CD within 1- 2 min.  相似文献   

11.
This article proposes design and architecture of a dynamically scalable dual-core pipelined processor. Methodology of the design is the core fusion of two processors where two independent cores can dynamically morph into a larger processing unit, or they can be used as distinct processing elements to achieve high sequential performance and high parallel performance. Processor provides two execution modes. Mode1 is multiprogramming mode for execution of streams of instruction of lower data width, i.e., each core can perform 16-bit operations individually. Performance is improved in this mode due to the parallel execution of instructions in both the cores at the cost of area. In mode2, both the processing cores are coupled and behave like single, high data width processing unit, i.e., can perform 32-bit operation. Additional core-to-core communication is needed to realise this mode. The mode can switch dynamically; therefore, this processor can provide multifunction with single design. Design and verification of processor has been done successfully using Verilog on Xilinx 14.1 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S500E FPGA.  相似文献   

12.
In a distributed computer system, a group of processors is connected by communication links into a network. Each processor (node) of the network has an identity (a unique integer value) that is not related to its position in the network (its address). A processor's identity is known only to the processor. In the problem of leader election, exactly one processor among a network of processors has to be distinguished as the leader. Previously, many efficient election protocols have been proposed for networks with a sense of direction. In particular, the sequential search is used for election in a reliable complete network, and a multi-token search method is used in a faulty complete network. However, election protocols on a faulty ChRgN (chordal ring network) have not been investigated by other researchers. This paper addresses this issue by: studying the problem of leader election in an asynchronous ChRgN with a sense of direction and with the presence of undetectable fail-stop processor failures; proposing a new election protocol which (a) combines the concept of sequential search and multi-token search techniques, and (b) uses an efficient routing algorithm to reduce the total number of messages used; presenting a protocol for a ChRgN of n processors with I chords/processor and at most f fail-stop faulty processors, with message complexity O(n+(n/l)log(n)+k·f), where k is the number of processors starting the election process spontaneously and at most f相似文献   

13.
Aggressive processor design methodology using high-speed clock and deep submicrometer technology is necessitating the use of at-speed delay fault testing. Although nearly all modern processors use pipelined architecture, no method has been proposed in literature to model these for the purpose of test generation. This paper proposes a graph theoretic model of pipelined processors and develops a systematic approach to path delay fault testing of such processor cores using the processor instruction set. The proposed methodology generates test vectors under the extracted architectural constraints. These test vectors can be applied in functional mode of operation, hence, self-test becomes possible. Self-test in a functional mode can also be used for online periodic testing. Our approach uses a graph model for architectural constraint extraction and path classification. Test vectors are generated using constrained automatic test pattern generation (ATPG) under the extracted constraints. Finally, a test program consisting of an instruction sequence is generated for the application of generated test vectors. We applied our method to two example processors, namely a 16-bit 5-stage VPRO pipelined processor and a 32-bit pipelined DLX processor, to demonstrate the effectiveness of our methodology  相似文献   

14.
A processor is any self-contained computer of at least personal-computer capability. The paper explores how much the processor mean time-to-failure can be improved by replacing it with an N-processor module, where each processor in the module consists of a copy of the original processor augmented with a communication protocol unit. The copy of the original processor is faulty with probability, pc, and the protocol unit is faulty with probability, p. The asynchronous N-processor module uses a Byzantine agreement (F-ID-P) algorithm to identify which of its processors disagreed with a module consensus. The identified processors are presumed faulty, and the module replaces them with duplicates from a set of standbys. The F-ID-P algorithm is a modification of Bracha's, which guarantees that in a module of 3t+1 processors, up to t faults can be identified by at least t+1 non-faulty processors. The module fails if faults in more than t of its processors prevent it from: 1) obtaining a correct consensus, or 2) executing the algorithm. The F-ID-P algorithm departs from Bracha's by using a random instead of an adversary scheduler of message delays. Simulation showed that almost always F-ID-P algorithm correctly identified all of a module's faulty processors if more than half of them were nonfaulty. Thus F-ID-P algorithm was about 3/2 more fault tolerant than guaranteed. Also, compared to a single processor's mean number of decisions to failure, the F-ID-P module was 841 times better when N=37, down to 5.1 times better when N=10  相似文献   

15.
Recently, the power consumption of integrated circuits has been attracting increasing attention. Many techniques have been studied to improve the power efficiency of digital signal processing units such as fast Fourier transform (FFT) processors, which are popularly employed in both traditional research fields, such as satellite communications, and thriving consumer electronics, such as wireless communications. This paper presents solutions based on parallel architectures for high throughput and power efficient FFT cores. Different combinations of hybrid low‐power techniques are exploited to reduce power consumption, such as multiplierless units which replace the complex multipliers in FFTs, low‐power commutators based on an advanced interconnection, and parallel‐pipelined architectures. A number of FFT cores are implemented and evaluated for their power/area performance. The results show that up to 38% and 55% power savings can be achieved by the proposed pipelined FFTs and parallel‐pipelined FFTs respectively, compared to the conventional pipelined FFT processor architectures.  相似文献   

16.
In this paper, we propose a novel reconfigurable processor using dynamically partitioned single‐instruction multiple‐data (DP‐SIMD) which is able to process multimedia data. The SIMD processor and parallel SIMD (P‐SIMD) processor, which is composed of a number of SIMD processors, are usually used these days. But these processors are inefficient because all processing units (PUs) should process the same operations all the time. Moreover, the PUs can process different operations only when every SIMD group operation is predefined. We propose a processor control method which can partition parallel processors into multiple SIMD‐based processors dynamically to enhance efficiency. For performance evaluation of the proposed method, we carried out the inverse transform, inverse quantization, and motion compensation operations of H.264 using processors based on SIMD, P‐SIMD, and DP‐SIMD. Experimental results show that the DP‐SIMD control method is more efficient than SIMD and P‐SIMD control methods by about 15% and 14%, respectively.  相似文献   

17.
The general-purpose, highly parallel, cellular array processor (CAP) we developed features multiple-instruction stream, multiple-data stream (MIMD) processing and image display. Processor elements can number in several hundreds. The present system uses 256 processors. Each processor element consists of a general-purpose microprocessor, memory, and a special VLSI chip that performs parallel-processing-specific functions such as processor communication and synchronization. The VLSI has two 2M byte/s independent common bus interfaces for data broadcating and six 15M bit/s serial communication ports for local data communication. The chip also can process image data in real time for multiple processors. Use of the communication interfaces enables a variety of processor networks to be configured. One CAP application has been computer graphics, in which ray tracing is used to generate quality images.  相似文献   

18.
This paper addresses embedded multiprocessor implementation of iterative, real-time applications, such as digital signal and image processing, that are specified as dataflow graphs. Scheduling dataflow graphs on multiple processors involves assigning tasks to processors (processor assignment), ordering the execution of tasks within each processor (task ordering), and determining when each task must commence execution. We consider three scheduling strategies: fully-static, self-timed and ordered transactions, all of which perform the assignment and ordering steps at compile time. Run time costs are small for the fully-static strategy; however it is not robust with respect to changes or uncertainty in task execution times. The self-timed approach is tolerant of variations in task execution times, but pays the penalty of high run time costs, because processors need to explicitly synchronize whenever they communicate. The ordered transactions approach lies between the fully-static and self-timed strategies; in this approach the order in which processors communicate is determined at compile time and enforced at run time. The ordered transactions strategy retains some of the flexibility of self-timed schedules and at the same time has lower run time costs than the self-timed approach.In this paper we determine an order of processor transactions that is nearly optimal given information about task execution times at compile time, and for a given processor assignment and task ordering. The criterion for optimality is the average throughput achieved by the schedule. Our main result is that it is possible to choose a transaction order such that the resulting ordered transactions schedule incurs no performance penalty compared to the more flexible self-timed strategy, even when the higher run time costs implied by the self-timed strategy are ignored.  相似文献   

19.
在体积、重量和功耗有严格约束的情况下,系统小型化遇到多种技术挑战,为了满足高密度计算和小型化的要求,高密度系统集成和单芯片多核处理器至关重要。讨论了高密度集成与单芯片多核处理器技术及其研究进展,其中包括单芯片多核处理器(CMP)、片上网络(NoC)、3D集成电路、高密度封装。提出了CMP的两个发展特征,即小核大数量和层次型簇结构。指出高密度集成设计与高密度封装设计逐渐融合,并为单芯片多核系统的物理实现提供了技术保证,为最终实现高密度计算和小型化系统提供了硬件解决方案。  相似文献   

20.
We examine the diagnosis of processor array systems formed as two-dimensional arrays, with boundaries, and either four or eight neighbors for each interior processor. We employ a parallel test schedule. Neighboring processors test each other, and report the results. Our diagnostic objective is to find a fault-free processor or set of processors. The system may then be sequentially diagnosed by repairing those processors tested faulty according to the identified fault-free set, or a job may be run on the identified fault-free processors. We establish an upper bound on the maximum number of faults which can be sustained without invalidating the test results under worst case conditions. We give test schedules and diagnostic algorithms which meet the upper bound as far as the highest order term. We compare these near optimal diagnostic algorithms to alternative algorithms, both new and already in the literature, and against an upper bound ideal case algorithm, which is not necessarily practically realizable. For eight-way array systems with N processors, an ideal algorithm has diagnosability 3N/sup 2/3/-2N/sup 1/2/ plus lower-order terms. No algorithm exists which can exceed this. We give an algorithm which starts with tests on diagonally connected processors, and which achieves approximately this diagnosability. So the given algorithm is optimal to within the two most significant terms of the maximum diagnosability. Similarly, for four-way array systems with N processors, no algorithm can have diagnosability exceeding 3N/sup 2/3//2/sup 1/3/-2N/sup 1/2/ plus lower-order terms. And we give an algorithm which begins with tests arranged in a zigzag pattern, one consisting of pairing nodes for tests in two different directions in two consecutive test stages; this algorithm achieves diagnosability (3/2)(5/2)/sup 1/3/N/sup 2/3/-(5/4)N/sup 1/2/ plus lower-order terms, which is about 0.85 of the upper bound due to an ideal algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号