首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
For cache analytical modeling, the stack distance theory is widely utilized to predict LRU-cache behaviors. Typically, the stack distance histogram collecting is implemented by profiling memory references. However, the profiled memory references merely reflect instruction fetching and load/store executions, which only represent the memory accesses to first-level (L1) caches. That is why these traces cannot be applied directly to construct stack distance histograms for downstream (L2 and L3) caches.Therefore, this paper proposes a stack distance probability model to extend the stack distance theory to the multi-level LRU cache behavior predictions. The inputs of our model are the L1 cache stack distance histograms and the multi-level LRU cache configurations. The outputs are the L2 and L3 cache stack distance histograms, with which the conflict misses in L2 and L3 caches can be quantified quickly and precisely.15 benchmarks chosen from Mobybench 2.0, Mibench I and Mediabench II are used to evaluate the accuracy of our model. Compared to the simulation results from Gem5 in AtomicSimpleCPU mode, the average absolute error of predicting cache misses in the I/D shared L2 cache is less than 5% while that of estimating the L3 cache misses is less than 7%. Furthermore, contrast to the time overhead of Gem5 AtomicSimpleCPU simulations, our model can speed up the cache miss prediction about x100 on average.  相似文献   

2.
Trace-driven simulation of out-of-order superscalar processors is far from straightforward. The dynamic nature of out-of-order superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results when the traces contain only a subset of executed instructions for trace reduction. In this paper, we describe and comprehensively evaluate the pairwise dependent cache miss model (PDCM), a framework for fast and accurate trace-driven simulation of out-of-order superscalar processors. The model determines how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. Our experimental results demonstrate that a PDCM-based simulator produces highly accurate simulation results (less than 3% error) with fast simulation speeds (62.5× on average) compared with an execution-driven simulator. Moreover, we observed that the proposed simulation method is capable of preserving a processor’s dynamic off-core memory access behavior and accurately predicting the relative performance change when a processor’s low-level memory hierarchy parameters are changed.  相似文献   

3.
This paper presents a novel generative model to synthesize fluid simulations from a set of reduced parameters. A convolutional neural network is trained on a collection of discrete, parameterizable fluid simulation velocity fields. Due to the capability of deep learning architectures to learn representative features of the data, our generative model is able to accurately approximate the training data set, while providing plausible interpolated in‐betweens. The proposed generative model is optimized for fluids by a novel loss function that guarantees divergence‐free velocity fields at all times. In addition, we demonstrate that we can handle complex parameterizations in reduced spaces, and advance simulations in time by integrating in the latent space with a second network. Our method models a wide variety of fluid behaviors, thus enabling applications such as fast construction of simulations, interpolation of fluids with different parameters, time re‐sampling, latent space simulations, and compression of fluid simulation data. Reconstructed velocity fields are generated up to 700× faster than re‐simulating the data with the underlying CPU solver, while achieving compression rates of up to 1300×.  相似文献   

4.
Shylashree  N.  Venkatesh  B.  Saurab  T. M.  Srinivasan  Tarun  Nath  Vijay 《Microsystem Technologies》2019,25(6):2349-2359

All modern computational devices consist of ALU. With increase in complexity of software and the consistent shift of software towards parallelism, high speed processors with hardware support for time consuming operations such as multiplication would benefit. Smaller, compact devices such as IoT devices need to run software such as security software and be able to offload computation cost from the cloud. In this paper, a high speed 8-bit ALU using 18 nm FinFET technology is proposed. The arithmetic and logical unit consists of fast compute units such as Kogge Stone fast adder and Dadda multiplier along with basic logic gates. In this paper, an ALU with each compute unit optimized for speed is proposed, while responsibly consuming area. Dadda multiplier is of 8 × 8 architecture as opposed to conventional approach of 4 × 4 making it a true 8-bit ALU. Simulation and analysis is done using Cadence Virtuoso in Analog Design Environment. The transistor count of proposed design is 5298, the power consumption is 219 µW and maximum delay is 166.8 ps. The design is also expected to consume a maximum of one clock cycle for any computation.

  相似文献   

5.
《Parallel Computing》1997,23(13):1963-1986
Memory conflict is a major phenomenon which may cause dramatic loss of performance in vector pipeline multiprocessors. Various techniques have been proposed and implemented to avoid such conflicts. They rely mostly on well-tuned vector element allocation in memory banks (either using programming tools or hard-wired features). We tackle this problem in another way. Instead of trying to avoid memory contention, we aim to enhance the performance of the memory system by scheduling vector element accesses in order to increase memory accesses. This scheduling depends on the memory bank activity when an access is issued, leading to out-of-order access to vector elements. An out-of-order pipeline execution is associated with this out-of-order memory access in order to maintain the processor functional unit chaining. In this paper we study some factors influencing this execution model: vector length, number of processors and number of banks. An analysis of this model using the Markov chain technique and simulation results are also presented. They show the importance of this model in comparison with the classical one encountered in pipelined vector supercomputers.  相似文献   

6.
乱序执行是现代微处理器设计中普遍采用的提高流水线性能的方法,但乱序执行并乱序退出的全乱序结构在超标量处理器中应用并不普遍,这种全乱序的结构对基于参考模型的处理器正确性验证提出了巨大的挑战。主要介绍了从处理器的程序行为是否正确的最终标准——程序员可见的结构变量按程序行为进行顺序变化的角度对全乱序结构的处理器验证提出了一种全新的解决方法。  相似文献   

7.
We present a new technique for verification of complex hardware devices that allows both generality and a high degree of automation. The technique is based on our new way of constructing a light-weight completion function together with new encoding of uninterpreted functions called reference file representation.Our technique combines our completion function method and reference file representation with compositional model checking and theorem proving. This extends the state of the art in two directions. First, we obtain a more general verification methodology. Second, it is easier to use, since it has a higher degree of automation.As a benchmark, we take Tomasulo's algorithm for scheduling out-of-order instruction execution used in many modern superscalar processors like the Pentium-II and the PowerPC 604. The algorithm is parameterized by the processor configuration, and our approach allows us to prove its correctness in general, independent of any actual design.  相似文献   

8.
State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %.  相似文献   

9.
发射队列是超标量处理器的乱序控制部件,也是处理器中的关键部件,对整个处理器的性能起着决定性的作用.提出了一种能够有效提高乱序超标量处理器性能的双端口发射队列结构.该队列能够根据指令之间的相关性,估算指令的发射时机,将指令分配到不同的队列中.对比了2种不同的发射策略对性能的影响,输入端标记执行流水线的策略能够获得较高的I...  相似文献   

10.
The increasing attention on deep learning has tremendously spurred the design of intelligence processing hardware. The variety of emerging intelligence processors requires standard benchmarks for fair comparison and system optimization (in both software and hardware). However, existing benchmarks are unsuitable for benchmarking intelligence processors due to their non-diversity and nonrepresentativeness. Also, the lack of a standard benchmarking methodology further exacerbates this problem. In this paper, we propose BenchIP, a benchmark suite and benchmarking methodology for intelligence processors. The benchmark suite in BenchIP consists of two sets of benchmarks: microbenchmarks and macrobenchmarks. The microbenchmarks consist of single-layer networks. They are mainly designed for bottleneck analysis and system optimization. The macrobenchmarks contain state-of-the-art industrial networks, so as to offer a realistic comparison of different platforms. We also propose a standard benchmarking methodology built upon an industrial software stack and evaluation metrics that comprehensively reflect various characteristics of the evaluated intelligence processors. BenchIP is utilized for evaluating various hardware platforms, including CPUs, GPUs, and accelerators. BenchIP will be open-sourced soon.  相似文献   

11.
Modeling out-of-order processors for WCET analysis   总被引:1,自引:0,他引:1  
Estimating the Worst Case Execution Time (WCET) of a program on a given processor is important for the schedulability analysis of real-time systems. WCET analysis techniques typically model the timing effects of micro-architectural features in modern processors (such as pipeline, cache, branch prediction) to obtain safe and tight estimates. In this paper, we model out-of-order superscalar processor pipelines for WCET analysis. The analysis is, in general, difficult even for a basic block (a sequence of instructions with single-entry and single-exit points) if some of the instructions have variable latencies. This is because the WCET of a basic block on out-of-order pipelines cannot be obtained by assuming maximum latencies of the individual instructions. Our timing estimation technique for a basic block proceeds by a fixed-point analysis of the time intervals at which the instructions enter/leave a pipeline stage. To extend our estimation to whole programs, we use Integer Linear Programming (ILP) to combine the timing estimates for basic blocks. Timing effects of instruction cache and branch prediction are also modeled within our pipeline analysis framework. This forms a combined timing analysis framework that captures out-of-order pipeline, cache, branch prediction as well as the mutual interaction among these micro-architectural features. The accuracy of our analysis is demonstrated via tight estimates obtained for several benchmarks. Preliminary version of parts of this paper has previously been published as Li et al. (2004). Abhik Roychoudhury received his B.E. in Computer Engineering from Jadavpur University (India) in 1995 and his M.S. / Ph.D. degrees (both in Computer Science) from the State University of New York at Stony Brook in 1997 and 2000 respectively. Since 2001 he has been an Assistant Professor at National University of Singapore. His research interests are in models and methods for reliable development of embedded software and systems, with specific focus on software validation, analysis and comprehension. Xianfeng Li is a postdoctoral researcher in the Department of Computer Science and Technology at Peking University, China. He received his Ph.D. from National University of Singapore in 2005. His research interests include real-time systems, modeling and evaluation of computer architecture, and System-on-Chips. Tulika Mitra is an Assistant Professor in School of Computing at National University of Singapore from January 2001. She received her PhD in Computer Science from SUNY at Stony Brook in December 2000. Tulika received M.E in Computer Science and Automation from Indian Institute of Science in 1997 and her B.E. in Computer Engineering from Jadavpur University, India in 1995. Her current research focuses on design and analysis of embedded and real-time systems.  相似文献   

12.
We introduce an Artificial Neural Network (ANN) quantization methodology for platforms without wide accumulation registers. This enables fixed-point model deployment on embedded compute platforms that are not specifically designed for large kernel computations (i.e. accumulator-constrained processors). We formulate the quantization problem as a function of accumulator size, and aim to maximize the model accuracy by maximizing bit width of input data and weights. To reduce the number of configurations to consider, only solutions that fully utilize the available accumulator bits are being tested. We demonstrate that 16 bit accumulators are able to obtain a classification accuracy within 1% of the floating-point baselines on the CIFAR-10 and ILSVRC2012 image classification benchmarks. Additionally, a near-optimal 2 × speedup is obtained on an ARM processor, by exploiting 16 bit accumulators for image classification on the All-CNN-C and AlexNet networks.  相似文献   

13.
能够提供更强计算能力的多核处理器将在安全关键系统中得到广泛应用.但是,由于现代处理器所使用的流水线、乱序执行、动态分支预测、Cache等性能提高机制以及多核之间的资源共享,使得系统的最坏执行时间分析变得非常困难.为此,国际学术界提出时间可预测系统设计的思想,以降低系统的最坏执行时间分析难度.已有研究主要关注硬件层次及其编译方法的调整和优化,而较少关注软件层次,即时间可预测多线程代码的构造方法以及到多核硬件平台的映射.本文提出一种基于同步语言模型驱动的时间可预测多线程代码生成方法,并对代码生成器的语义保持进行证明;提出一种基于AADL(Architecture Analysis and Design Language)的时间可预测多核体系结构模型,作为本文研究的目标平台;最后,给出多线程代码到多核体系结构模型的映射方法,并给出系统性质的分析框架.  相似文献   

14.
Burns  Frank  Koelmans  Albert  Yakovlev  Alexandre 《Real-Time Systems》2000,18(2-3):275-288
Determining a tight WCET of a block of code to be executed on a modern superscalar processor architecture is becoming ever more difficult due to the dynamic behaviour exhibited by current processors, which include dynamic scheduling features such as speculative and out-of-order execution in the context of multiple execution units with deep pipelines. We describe the use of Coloured Petri Nets (CP-nets) in a simulation based approach to this problem. A complex model of a generic processor architecture is described, with emphasis on the modelling strategy for obtaining the WCET and an analysis of the results.  相似文献   

15.
针对超标量处理器中长周期执行指令延迟退休及持续译码导致的重排序缓存(ROB)阻塞问题,提出一种指令乱序提交机制.通过设计容量可配置的多缓存指令提交结构,实现存储器操作指令和ALU类型指令的分类退休,根据超标量处理器架构及性能需求对目标缓存和存储缓存容量进行参数化配置降低流水线阻塞风险,同时利用指令目的寄存器编码提交模式...  相似文献   

16.
Embedded systems often contain multiple applications, some of which have real-time requirements and whose performance must be guaranteed. To efficiently execute applications, modern embedded systems contain Globally Asynchronous Locally Synchronous (GALS) processors, network on chip, DRAM and SRAM memories, and system software, e.g. microkernel and communication libraries. In this paper we describe a dataflow formalisation to independently model real-time applications executing on the CompSOC platform, including new models of the entire software stack. We compare the guaranteed application throughput as computed by our tool flow to the throughput measured on an FPGA implementation of the platform, for both synthetic and real H.263 applications. The dataflow formalisation is composable (i.e. independent for each real-time application), conservative, models the impact of GALS on performance, and correctly predicts trends, such as application speed-up when mapping an application to more processors.  相似文献   

17.
A parallel heuristic algorithm for traffic control problems in three-stage connecting networks is presented in this paper. A three-stage connecting network consists of an input crossbar switching stage, an intermediate crossbar switching stage, and an output crossbar switching stage. The goal of our algorithm is to quickly and efficiently find a conflict-free switching assignment for communication demands through the network. The algorithm requires n2 × m processing elements for the network composed of n input/output switches and m intermediate switches, where it runs not only on a sequential machine, but also on a parallel machine with maximally n2 × m processors. The algorithm was verified by 1100 simulation runs with the network size from 102 × 7 to 502 × 27. The simulation results show that the algorithm can find a solution in nearly constant time with n2 × m processors.  相似文献   

18.
《Parallel Computing》2014,40(5-6):113-128
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation.In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation.  相似文献   

19.
Document-level sentiment classification aims to automate the task of classifying a textual review, which is given on a single topic, as expressing a positive or negative sentiment. In general, supervised methods consist of two stages: (i) extraction/selection of informative features and (ii) classification of reviews by using learning models like Support Vector Machines (SVM) and Na?¨ve Bayes (NB). SVM have been extensively and successfully used as a sentiment learning approach while Artificial Neural Networks (ANN) have rarely been considered in comparative studies in the sentiment analysis literature. This paper presents an empirical comparison between SVM and ANN regarding document-level sentiment analysis. We discuss requirements, resulting models and contexts in which both approaches achieve better levels of classification accuracy. We adopt a standard evaluation context with popular supervised methods for feature selection and weighting in a traditional bag-of-words model. Except for some unbalanced data contexts, our experiments indicated that ANN produce superior or at least comparable results to SVM’s. Specially on the benchmark dataset of Movies reviews, ANN outperformed SVM by a statistically significant difference, even on the context of unbalanced data. Our results have also confirmed some potential limitations of both models, which have been rarely discussed in the sentiment classification literature, like the computational cost of SVM at the running time and ANN at the training time.  相似文献   

20.
For memory constrained environments, optimization for program size is often as important as, if not more important than, optimization for execution speed. Commonly, compilers try to reduce the code segment but neglect the stack segment, although the stack can significantly grow during the execution of recursive functions because a separate activation record is required for each recursive call.If a formal parameter or local variable is dead at all recursive calls, then it can be declared global so that only one instance exists independent of the recursion depth. We found that in 70% of our benchmark functions, it is possible to reduce the stack size by declaring formal parameters and local variables global. Often, live ranges of formal parameters and local variables can be split at recursive calls through program transformations. These splitting transformations allowed us to further optimize the stack size of all our benchmark functions. If all formal parameters and local variables can be declared global, then such functions may be transformable into iterations. This was possible for all such benchmark functions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号