首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A procedure is described to prepare an exam schedule which meets the following goals: (1) No student should have two exams at the same time. (2) No student should take more than two exams in a day. (3) The number of exam days should be a minimum. (4) The total number of students having two exams in a day is minimum. (5) The number of students having two consecutive exams in a day is a minimum.  相似文献   

2.
A structured video consists of a collection of background objects, characters, spatial and temporal constructs, and rendering features. Assuming a platform consisting of a fixed amount of memory and a magnetic disk drive, this study presents a resource scheduler for the continuous display of structured video that minimizes both the latency observed by a display and its required amount of memory  相似文献   

3.
同时多线程(SMT)是一种允许多个独立的线程每周期发射多条指令的技术,这种技术充分利用了可能存在的指令级并行和线程级并行,提高了有限资源的利用率。文章以西北工业大学航空微电子中心自主研发的32位超标量处理器“龙腾R2”为基础,引入SMT技术,在基本不改变内部结构大小、不增加执行功能部件、仅做一些必要修改的前提条件下进行研究。通过仿真不同的线程数和各种线程组合,进行性能分析。尽管存在制约性能提升的一些因素,引入SMT技术后依然获得了最高约50%的性能增加。  相似文献   

4.
Trace-driven simulation of out-of-order superscalar processors is far from straightforward. The dynamic nature of out-of-order superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results when the traces contain only a subset of executed instructions for trace reduction. In this paper, we describe and comprehensively evaluate the pairwise dependent cache miss model (PDCM), a framework for fast and accurate trace-driven simulation of out-of-order superscalar processors. The model determines how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. Our experimental results demonstrate that a PDCM-based simulator produces highly accurate simulation results (less than 3% error) with fast simulation speeds (62.5× on average) compared with an execution-driven simulator. Moreover, we observed that the proposed simulation method is capable of preserving a processor’s dynamic off-core memory access behavior and accurately predicting the relative performance change when a processor’s low-level memory hierarchy parameters are changed.  相似文献   

5.
6.
While high-performance architectures have included some Instruction-Level Parallelism (ILP) for at least 25 years, recent computer designs have exploited ILP to a significant degree. Although a local scheduler is not sufficient for generation of excellent ILP code, it is necessary as many global scheduling and software pipelining techniques rely on a local scheduler. Global scheduling techniques are well-documented, yet practical discussions of local schedulers are notable in their absence. This paper strives to remedy that disparity by describing a list scheduling framework and several important practical details that, taken together, allow implementation of an efficient local instruction scheduler that is easily retargetable for ILP architectures. The foundation of our machine-independent instruction scheduler is a timing model that allows easy retargetability to a wide range of architectures. In addition to describing how a general list-scheduler can be implemented within the framework of our timing model, experimental results indicate that lookahead scheduling can profoundly improve a scheduler's ability to produce a legal schedule. Further experimental data shows that deciding to schedule a data dependence DAG (DDD) in forward or reverse order depends significantly upon that target architecture, suggesting the possibility of scheduling in each direction and using the best of the two schedules. In contrast, experiments demonstrate little difference in code quality for schedules generated by either instruction-driven or operation-driven schedulers. Thus, the inherent flexibility of operation-driven methods suggests including that approach in a retargetable instruction scheduler. List scheduling is, of course, a heuristic scheduling method. A variety of scheduling heuristics are presented. In addition, the paper describes a method, using a genetic algorithm search, to ‘fine-tune’ the weights of twenty-four individual heuristics to form a DDD-node heuristic tuned to a specific architecture. © 1998 John Wiley & Sons, Ltd.  相似文献   

7.
为了追求更高的性能,处理器核的主频不断提升,处理器核的设计日益复杂,随之而来的是功耗问题越来越突出。除了在工艺级和电路级采用低功耗技术外,在逻辑设计阶段通过分析处理器核各个功能模块的特点并采用相应的技术手段,也可以有效降低功耗。对一款乱序超标量处理器核中功耗比较突出的模块——寄存器文件和再定序缓冲——进行了逻辑设计优化,在程序运行性能几乎不受影响的情况下明显减少了面积,降低了功耗。  相似文献   

8.
Suzuki  K. Arai  T. Nadehara  K. Kuroda  I. 《Micro, IEEE》1998,18(2):36-47
The V830R/AV's real-time decoding of MPEG-2 video and audio data enables practical embedded-processor-based multimedia systems  相似文献   

9.
《Performance Evaluation》2006,63(9-10):939-955
Increasing diversity in telecommunication workloads leads to greater complexity in communication protocols. This occurs as channel bandwidth rapidly increases. These factors result in larger computational loads for network processors that are increasingly turning to high performance microprocessor designs. This paper presents an analytical method for estimating the performance of instruction level parallel (ILP) processors executing network protocol processing applications. Instruction dependency information extracted while executing an application is used to calculate upper and lower bounds for throughput, measured in instructions per cycle (IPC). Results using UDP/TCP/IP applications show that the simulated IPC values fall between the analytically derived upper and lower bounds, validating the model. The analytical method is much less expensive than cycle-accurate simulation, but reveals similar throughput performance predictions. This allows the architectural design space for network superscalar processors to be explored more rapidly and comprehensively, to reveal the maximum IPC that is possible for a given application workload and the available hardware resources.  相似文献   

10.
11.
Coarse-grained reconfigurable platforms are good for parallel data-intensive applications but inefficient for sequential control-dominated code. This article explores the integration of the general purpose Sparc-compliant Leon processor with the Extreme Processing Platform reconfigurable data path. The integration's goal is to optimize the execution of complex multimedia applications such as MPEG-4.  相似文献   

12.
Virtualization facilitates the provision of flexible resources and improves energy efficiency through the consolidation of virtualized servers into a smaller number of physical servers. As an increasingly essential component of the emerging cloud computing model, virtualized environments bill their users based on processor time or the number of virtual machine instances. However, accounting based only on the depreciation of server hardware is not sufficient because the cooling and energy costs for data centers will exceed the purchase costs for hardware. This paper suggests a model for estimating the energy consumption of each virtual machine without dedicated measurement hardware. Our model estimates the energy consumption of a virtual machine based on in-processor events generated by the virtual machine. Based on this estimation model, we also propose a virtual machine scheduling algorithm that can provide computing resources according to the energy budget of each virtual machine. The suggested schemes are implemented in the Xen virtualization system, and an evaluation shows that the suggested schemes estimate and provide energy consumption with errors of less than 5% of the total energy consumption.  相似文献   

13.
In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstacles and causing significant performance degradation in executing instructions from wrong paths. Therefore, the negative impact of (unbiased) branches over global performance should be seriously attenuated by anticipating the results of long-latency instructions, including critical Loads. On the other hand, hiding instructions’ long latencies in a pipelined superscalar processor represents an important challenge itself. We developed a superscalar architecture that selectively anticipates the values produced by high-latency instructions. In this work we are focusing on multiply, division and loads with miss in L1 data cache, implementing a dynamic instruction reuse scheme for the MUL/DIV instructions and a simple last value predictor for the critical Load instructions. Our improved superscalar architecture achieves an average IPC speedup of 3.5% on the integer SPEC 2000 benchmarks, of 23.6% on the floating-point benchmarks, and an improvement in energy-delay product (EDP) of 6.2% and 34.5%, respectively. We also quantified the impact of our developed selective instruction reuse and value prediction techniques in a simultaneous multithreaded architecture (SMT) that implies per thread reuse buffers and load value prediction tables. Our simulation results showed that the best improvements on the SPEC integer applications have been obtained with 2 threads: an IPC speedup of 5.95% and an EDP gain of 10.44%. Although, on the SPEC floating-point programs, we obtained the highest improvements with the enhanced superscalar architecture, the SMT with 3 threads also provides an important IPC speedup of 16.51% and an EDP gain of 25.94%.  相似文献   

14.

This article presents the design and implementation of an air-crew assignment system, for producing and refining a solution to this problem, based on the artificial intelligence principles and techniques of abductive reasoning as captured by the framework of abductive logic programming (ALP). The system offers a high level of flexibility in addressing both the tasks of crew scheduling and rescheduling. Itcan be used to generate a valid and good quality initial solution and then help the human operators adjust and refine further this solution in order to meet extra requirements of the problem. These additional needs can arise either due to new foreseen requirements that the company wants to have or experiment with for a particular period in time, or due to unexpected events that have occurred while the solution (crew-roster) is in operation. This work shows the ability and flexibility of abduction, and, more specifically, of ALP, in tackling problems of this type with complex and changing requirements.  相似文献   

15.
阵列众核处理器由于其较高的计算性能和能效比已经被广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器中,核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。在阵列众核处理器中,在单核心中引入硬件同时多线程技术,针对实验中一级指令缓存命中率随着线程数增加而显著降低的问题,提出了一种面向阵列众核处理器的冗余指令缓存存储结构,基于该结构,提出采用FIFO及类LRU替换策略。通过上述优化的高速缓存结构设计,经实验模拟,双线程整体指令Cache失效率降低了25.2%,整体CPI性能提升了30.2%。  相似文献   

16.
《电子技术应用》2016,(1):19-21
多核同时多线程处理器(SMT_PAAG)是用于图形、图像及数字信号处理的一种多核处理器。基于这种处理器提出了一种硬件线程调度器,该调度器采用同时多线程技术,最多可同时执行四个线程,支持八个线程阻塞模式下的快速上下文切换。这样避免了因阻塞带来的等待问题,能够有效提高处理器的工作效率和资源利用率。通过在处理器上运行图形处理算法进行性能评测。结果表明,SMT-PAAG处理器通过挖掘指令级并行和线程级并行,将处理器的性能提高了69.25%。  相似文献   

17.
An implementable parallel scheduler for input-queued switches   总被引:1,自引:0,他引:1  
Giaccone  P. Shah  D. Prabhakar  B. 《Micro, IEEE》2002,22(1):19-25
The Apsara algorithm is an input-queued switch scheduler that uses limited parallelism to find a matching in a single iteration, as compared to the O(N3) iterations of the more common maximum-weight matching algorithm. The Apsara algorithm also achieves a throughput of up to 100 percent and has very good delay properties  相似文献   

18.
Multi-module memory has been employed in high-end digital signal processing system (DSP). It provides high memory bandwidth and low power operating mode for energy savings. However, making full use of these architectural features is a challenging problem for code optimization. In this paper, we propose an integer linear programming (ILP) model to optimize the performance and energy consumption of multi-module memories by solving variable assignment, instruction scheduling and operating mode setting problems simultaneously. The combined effect of performance and energy saving requirements has been considered as well. Specially, we develop two optimization techniques to improve the computation efficiency of our ILP model. The experimental results show that the optimal performance and energy solution can be achieved within a reasonable amount of time.  相似文献   

19.
This paper presents a parallel algorithm for computing the inversion of a dense matrix based on modified Jordan's elimination which requires fewer calculation steps than the standard one. The algorithm is proposed for the implementation on the linear array with a small to moderate number of processors which operate in a parallel-pipeline fashion. A communication between neighboring processors is achieved by a common memory module implemented as a FIFO memory module. For the proposed algorithm we define a task scheduling procedure and prove that it is time optimal. In order to compute the speedup and efficiency of the system, two definitions (Amdahl's and Gustafson's) were used. For the proposed architecture, involving two to 16 processors, estimated Gustafson's (Amdahl's) speedups are in the range 1.99 to 13.76 (1.99 to 9.69).  相似文献   

20.
This paper proposes an enhanced method of multiple branch prediction using a per-primary branch history table. This scheme improves the previous ones based on a single global branch history register, by reducing interferences among histories of different branches caused by sharing a single register. This scheme also allows the prediction of a branch not to affect the prediction of other branches that are predicted in the same cycle, thus allowing independent and parallel prediction of multiple branches. Our experimental results indicate that these features help to achieve higher prediction accuracy than that of the previous global history scheme (which is already high) with the less hardware cost (i.e., 96.1% vs. 95.1% for integer code and 95.7% vs. 94.9% for floating-point code including nasa7, for a given hardware budget of 128K bits). Moreover, the increased prediction accuracy causes better fetch bandwidth of a superscalar machine (i.e., 7.1 vs. 6.9 instructions per clock cycle for integer code and 11.0 vs. 10.9 instructions per cycle for floating-point code).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号