首页 | 本学科首页   官方微博 | 高级检索  
文章检索
  按 检索   检索词:      
出版年份:   被引次数:   他引次数: 提示:输入*表示无穷大
  收费全文   10篇
  免费   0篇
自动化技术   10篇
  2009年   1篇
  2004年   1篇
  2003年   2篇
  2002年   1篇
  2001年   1篇
  2000年   1篇
  1999年   1篇
  1995年   1篇
  1990年   1篇
排序方式: 共有10条查询结果,搜索用时 171 毫秒
1
1.
2.
Cheong  H. Veidenbaum  A.V. 《Computer》1990,23(6):39-47
The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented  相似文献   
3.
4.
Latency hiding techniques are increasingly used to minimize the effect of a long memory latency in multiprocessors. Their use requires additional network bandwidth. The network organization and its design parameters alone can significantly affect performance. With latency hiding, system performance depends on how well the interconnection network can support the use of such techniques and their interaction with network organization. This paper investigates these issues for prefetching and weak consistency in a 128-processor shared-memory system with either a 2-D torus, a multistage, or a single-stage network. The performance impact of network organization and the link bandwidth, with and without the use of latency hiding techniques is shown. The effect of caching and of limiting the number of outstanding memory requests is shown. Multistage is the most robust network and has the best performance under all conditions. Single-stage network is very close in performance when sufficient channel bandwidth is available. Torus network comes in last when channel bandwidth is high, but can exceed single stage performance when it is low. The relative performance of the three networks with prefetching remains similar, with torus gaining the most. Benchmark execution time can decrease by as much as 25% with prefetching. Further gains depend on reducing the effect of write traffic. Finally, the existence of an optimal number of outstanding requests is shown but the value is program-dependent.  相似文献   
5.
Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective.  相似文献   
6.
International Journal of Parallel Programming -  相似文献   
7.
8.
Most power reduction techniques have focused on gating the clock to unused functional units to minimize static power consumption, while system level optimizations have been used to deal with dynamic power consumption. Once these techniques are applied, register file power consumption becomes a dominant factor in the processor. This paper proposes a power-aware reconfiguration mechanism in the register file driven by a compiler. Optimal usage of the register file in terms of size is achieved and unused registers are put into a low-power state. Total energy consumption in the register file is reduced by 65% with no appreciable performance penalty for MiBench benchmarks on an embedded processor. The effect of reconfiguration granularity on energy savings is also analyzed, and the compiler approach to optimize energy results is presented.  相似文献   
9.
Even though computing systems have increased the number of transistors, the switching speed, and the number of processors, most programs exhibit limited speedup due to the serial dependencies of existing algorithms. Analysis of intrinsically parallel systems such as brain circuitry have led to the identification of novel architecture designs, and also new algorithms than can exploit the features of modern multiprocessor systems. In this article we describe the details of a brain derived vision (BDV) algorithm that is derived from the anatomical structure, and physiological operating principles of thalamo-cortical brain circuits. We show that many characteristics of the BDV algorithm lend themselves to implementation on IBM CELL architecture, and yield impressive speedups that equal or exceed the performance of specialized solutions such as FPGAs. Mapping this algorithm to the IBM CELL is non-trivial, and we suggest various approaches to deal with parallelism, task granularity, communication, and memory locality. We also show that a cluster of three PS3s (or more) containing IBM CELL processors provides a promising platform for brain derived algorithms, exhibiting speedup of more than 140 × over a desktop PC implementation, and thus enabling real-time object recognition for robotic systems.  相似文献   
10.
The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management.  相似文献   
1
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号