首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstacles and causing significant performance degradation in executing instructions from wrong paths. Therefore, the negative impact of (unbiased) branches over global performance should be seriously attenuated by anticipating the results of long-latency instructions, including critical Loads. On the other hand, hiding instructions’ long latencies in a pipelined superscalar processor represents an important challenge itself. We developed a superscalar architecture that selectively anticipates the values produced by high-latency instructions. In this work we are focusing on multiply, division and loads with miss in L1 data cache, implementing a dynamic instruction reuse scheme for the MUL/DIV instructions and a simple last value predictor for the critical Load instructions. Our improved superscalar architecture achieves an average IPC speedup of 3.5% on the integer SPEC 2000 benchmarks, of 23.6% on the floating-point benchmarks, and an improvement in energy-delay product (EDP) of 6.2% and 34.5%, respectively. We also quantified the impact of our developed selective instruction reuse and value prediction techniques in a simultaneous multithreaded architecture (SMT) that implies per thread reuse buffers and load value prediction tables. Our simulation results showed that the best improvements on the SPEC integer applications have been obtained with 2 threads: an IPC speedup of 5.95% and an EDP gain of 10.44%. Although, on the SPEC floating-point programs, we obtained the highest improvements with the enhanced superscalar architecture, the SMT with 3 threads also provides an important IPC speedup of 16.51% and an EDP gain of 25.94%.  相似文献   

2.
‘Looping’ of nondeterministic while-programs is shown to be expressible in Regular First Order Dynamic Logic with or without array assignment instructions in the programs. The expressive power of quantifier-free Dynamic Logic increases when nondeterminism is introduced in the programs that are part of formulae of Dynamic Logic. Allowing assignments of random values to variables also increases expressive power.  相似文献   

3.
We consider automatic verification of finite state concurrent programs. The global state graph of such a program can be viewed as a finite (Kripke) structure, and amodel checking algorithm can be given for determining if a given structure is a model of a specification expressed in a propositional temporal logic. In this paper, we present a unified approach for efficient model checking under a broad class of generalized fairness constraints in a branching time framework extending that of Clarke et al. (1983). Our method applies to any type of fairness expressed in a certain canonical form. Almost all ‘practical’ types of fairness from the literature, including the fundamental notions of impartiality, weak fairness, and strong fairness, can be succintly written in our canonical form. Moreover, our branching time approach can easily be adapted to handle types of fairness (such as fair reachability of a predicate) which cannot even be expressed in a linear temporal logic. We go on to argue that branching time logic is always better than linear time logic for model checking. We show that given any model checking algorithm for any system of linear time logic (in particular, for the usual system of linear time logic) there is a model checking algorithm of the same order of complexity (in both the structure and formula size) for the corresponding full branching time logic which trivially subsumes the linear time logic in expressive power (in particular, for the system of full branching time logic CTL*). We also consider an application of our work to the theory of finite automata on infinite strings.  相似文献   

4.
Victor Schneider 《Software》1989,19(11):1111-1113
This is a technical note that assumes reader familiarity with Pagan's paper on converting scaled down Pascal p-code programming language interpreters into simple compilers.1 This note discusses the extension of Pagan's methods to a full-scale Pascal-to-C translator. In order to make procedure calls and typed function calls work properly for such a translator, it was found necessary to add type information to one of the p-code calling instructions (and alter the Pascal-to-p-code translator accordingly). A table of execution times for Pagan's ‘50 primes’ bench-mark program shows the improvements obtained as a result of successive refinements in the Pascal-to-C translator, until the present version that uses C ‘register variables’ for integer arithmetic.2 Programs compiled using this system can also be linked into C programs as functions or procedures and can be debugged using standard C debuggers.  相似文献   

5.
Papamichalis  P. Simar  R.  Jr. 《Micro, IEEE》1988,8(6):13-29
The 320C30 is a fast processor with a large memory space and floating-point-arithmetic capabilities. The authors describe the 320C30 architecture in detail, discussing both the internal organization of the device and the external interfaces. They also explain the pipeline structure, addressing software-related issues and constructs, and examine the development tools and support. Finally, they present examples of applications. Some of the major features of the 320C30 are: a 60-ns cycle time that results in execution of over 16 million instructions per second (MIPS) and over 33 million floating-point operations per second (Mflops); 32-bit data buses and 24-bit address buses for a 16M-word overall memory space; dual-access, 4 K×32-bit on-chip ROM and 2 K×32-bit on-chip RAM; a 64×32-bit program cache; a 32-bit integer/40-bit floating-point multiplier and ALU; eight extended-precision registers, eight auxiliary registers, and 23 control and status registers; generally single-cycle instructions; integer, floating-point, and logical operation; two- and three-operand instructions; an on-chip DMA controller; and fabrication in 1-μm CMOS technology and packaging in a 180-pin package. These facilitate FIR (finite impulse response) and IIR (infinite impulse response) filtering, telecommunications and speech applications, and graphics and image processing applications  相似文献   

6.
As the width of the processor grows, complexity of a register file (RF) with multiple ports grows more than linearly and leads to larger register access time and higher power consumption. Analysis of SPEC2000 programs reveals that only a small portion of the instructions in a program (16% in integer and 38% in floating-point) require both the source operands. Also, when the programs are executed in an 8-wide processor only a very few (two or less) two-source instructions are executed in a cycle for a significant portion of time (more than 98% for integer and 93% for floating-point), leading to a significant under-utilization of register port bandwidth. In this paper, we propose a novel technique to significantly reduce the number of register ports, with a very minor modification in the select logic to issue only a limited number of two-source instructions each cycle. This is achieved with no significant impact on processor’s overall performance. The novelty of the technique is that it is easy to implement and succeeds in reducing the access time, power, and area of the register file, without aggravating these factors in any other logic on the chip. With this technique in an 8-wide processor, as compared to a conventional 128-entry RF with 16 read ports, for integer programs a register file can be designed with 11 or 10 read ports as these configurations result in instructions per cycle (IPC) degradation of only 0.929% and 3.38%, respectively. This significantly low degradation in IPC is achieved while reducing the register access time by 9% and 12%, respectively, and reducing power by 35% and 50%, respectively. For FP programs, a register file can be designed with 12 read ports (1.16% IPC loss, 8% less access time, and 28% less power) or with 11 read ports (3.5% IPC loss, 9% less access time, and 35% less power). The paper analyzes the performance of all the possible flavors of the proposed technique for register file in both 4-wide and 8-wide processors, and presents a choice of the performance and register port complexity combination to the designer.  相似文献   

7.
针对工程实践中时常出现超高精度计算的程序设计需求,分析超高精度乘、除法运算规则,提出超高精度整数与普通整数的乘除法运算算法及两个超高精度整数的乘除法运算算法,并分别给出时间复杂度分析及实验数据。给出3个完整的C语言程序,分别完成指定整数高次幂的精确计算并输出、按超高精度要求输出两个普通整数的商、超高精度求圆周率并输出。实验结果证明,在现有的常规软件及语言下,提出的超高精度计算的程序设计方法在工程实践中是简单可行的。  相似文献   

8.
Robert Bernstein 《Software》1986,16(7):641-652
Methods are given for finding a sequence of ‘add’, ‘subtract’ and ‘shift’ instructions to multiply the contents of a register by an integer constant. Each method generalizes the previous one and requires only a few intermediate or scratch registers. A variation of the last method is used in the PL.8 compiler and uses an unnoticeable amount of the overall compile time. Some statistics roughly indicating the effectiveness of the methods are presented.  相似文献   

9.
10.
We study the power of RAM acceptors with several instruction sets. We exhibit several instances where the availability of the division operator increases the power of the acceptors. We also show that in certain situations parallelism and stochastic features (“distributed random choices”) are provably more powerful than either parallelism or randomness alone. We relate the class of probabilistic Turing machine computations to random access machines with multiplication (but without boolean vector operations). Again, the availability of integer division seems to play a crucial role in these results.  相似文献   

11.
H.264视频编码标准采用了很多新技术,具有更优越的编码效率,同时也增加了计算复杂度,无法满足实时应用。由于单指令多数据扩展指令集2(SSE2)的并行运算能力可以提高计算机对多媒体数据的实时处理。文中主要采用了SSE2对H.264中的一些耗时较多的关键模块,例如整数像素运动估计中计算SAD、整数DCT变换、量化、Hadamard变换以及亚像素运动估计中计算SATD进行了指令级优化。实验结果表明,经过优化后,在保持视频图像质量的前提下,相应模块运行速度得到了提高,使H.264编码器整体的编码速度较好地满足实时要求。  相似文献   

12.
We present a method through which domestic service robots can comprehend natural language instructions. For each action type, a variety of natural language expressions can be used, for example, the instruction, ‘Go to the kitchen’ can also be expressed as ‘Move to the kitchen.’ We are of the view that natural language instructions are intuitive and, therefore, constitute one of the most user-friendly robot instruction methods. In this paper, we propose a method that enables robots to comprehend instructions spoken by a human user in his/her natural language. The proposed method combines action-type classification, which is based on a support vector machine, and slot extraction, which is based on conditional random fields, both of which are required in order for a robot to execute an action. Further, by considering the co-occurrence relationship between the action type and the slots along with the speech recognition score, the proposed method can avoid degradation of the robot’s comprehension accuracy in noisy environments, where inaccurate speech recognition can be problematic. We conducted experiments using a Japanese instruction data-set collected using a questionnaire-based survey. Experimental results show that the robot’s comprehension accuracy is higher in a noisy environment using our method than when using a baseline method with only a 1-best speech recognition result.  相似文献   

13.
In recent years, the conversion of residue numbers to a binary integer has been intensively studied. The Chinese Remainder Theorem (CRT) is a solution to this conversion problem of a number to the Residue Number System with a general moduli set. This paper presents a new division-free conversion approach for the conversion of residue numbers to a binary integer. The algorithm differs from others employing a great number of division instructions by using shift instructions instead. These simple instructions keep the power consumption lower. This algorithm can also be implemented with a lookup table or upon a vector machine. Both make the conversion process efficient. This division-free algorithm employs the concept of Montgomery multiplication algorithm. There are two variations of Montgomery algorithm proposed, which are algorithms MMA and IMA. The algorithm MMA is to transform the input number into the output presentation of Montgomery algorithm. Algorithm IMA is therefore inverse the computation of Montgomery algorithm to obtain the multiplicand. These two algorithms are in the complexity of O(n), where n is log2 qi. qi is a modulus. The proposed algorithm for converting the residues to a binary integer therefore runs on O(n × log m) times on O(m) processors. There are O(log m) iterations of O(n) complexity. Compared with the traditional conversion algorithm, the advantages of this proposed algorithm are not only in employing simpler operations but also in performing fewer iterations.  相似文献   

14.
There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. We created a new parallelizing compiler that analyses the read and write instructions, and control-flow modifications in programs to identify a set of dependencies between the instructions in the program. Afterwards, the compiler, based on the generated dependencies graph, rewrites and organizes the program in a task-oriented structure. Parallel tasks are composed by instructions that cannot be executed in parallel. A work-stealing-based parallel runtime is responsible for scheduling and managing the granularity of the generated tasks. Furthermore, a compile-time granularity control mechanism also avoids creating unnecessary data-structures. This work focuses on the Java language, but the techniques are general enough to be applied to other programming languages. We have evaluated our approach on 8 benchmark programs against OoOJava, achieving higher speedups. In some cases, values were close to those of a manual parallelization. The resulting parallel code also has the advantage of being readable and easily configured to improve further its performance manually.  相似文献   

15.
C. Pronk 《Software》1992,22(10):885-897
During standardization discussions on Modula-2 the question was raised whether the future standard should need a clause stating a set of minimal requirements for conforming implementations regarding the ‘structuredness’ of lexical, syntactical and run-time constructs in programs written in this language. It was decided to test a number of compilers to find their actual limits on the structuredness of some of these constructs. The test results do not only show these limits, but the inability of some compilers to handle large and complex inputs is shown as well.  相似文献   

16.
为满足嵌入式设备小面积高性能的需求,设计一种基于开源RISC-V指令集的32位可综合乱序处理器.处理器包括分支预测、相关性处理等关键技术,支持RISC-V基本整数运算、乘除法以及压缩指令集.采用具有顺序单发射、乱序执行、乱序写回等特性的三级流水线结构,运用哈佛体系结构及AHB总线协议,可满足并行访问指令与数据的需求.在...  相似文献   

17.
The problem of automatic control education in developing countries is an integral part of related difficulties in the education of engineering and applied sciences in such regions. Some of the basic problems are lack of technical manpower, shortage of related industries, insufficient technical programs, and deficiency in competent and qualified faculty members. Other problems, which depend on the particular region, country or circumstance are historical traditions, economic conditions, population size, financial budgets, educational tools, media of instructions, curriculum structures and non-uniformity of the educational programs. Most of these problems are closely related and cannot be separately discussed nor can a unique solution for them be proposed. For example, the shortage of ‘technical manpower’ is commonly due to the lack of development of ‘technical training programs’, which are, in turn, a mere reflection of the very nature of their underdeveloped or developing ‘economies’ and ‘industries’.In this paper, an attempt is made to bring up some of the problems and difficulties of the education of automatic control in developing countries. The problems of control education are divided into primary and secondary classes. After the exposition of the problems and their nature in each category, a proposal is made to help as a guideline for better understanding of the difficulties and to provide likely clues for their solution.  相似文献   

18.
通过两个取指令部件消除流水线控制相关延迟   总被引:2,自引:0,他引:2  
分支预测技术能够在一定程序上消除指令间的控制相关延迟,提高微处理器的性能,是微处理器设计的一项关键技术,一般说来,静态分支预测效率低,动态分支预测硬件复杂度高,嵌入式微处理器具有功耗低、硬件复杂度低等特点,这决定了它必须采用特殊的分支处理技术,本文提出了一种面向嵌入式微处理器的分支处理技术,利用双端口指 指令Cache,加上预期指令,在编译器的共同配合下,消除由控制相关引起的延迟,模拟结果表明,该技术具有硬件复杂度低,实现简单、控制相关消除率高等优点。  相似文献   

19.
Modern computers provide excellent opportunities for performing fast computations. They are equipped with powerful microprocessors and large memories. However, programs are not necessarily able to exploit those computer resources effectively. In this paper, we present the way in which we have implemented a nearest neighbor classification. We show how performance can be improved by exploiting the ability of superscalar processors to issue multiple instructions per cycle and by using the memory hierarchy adequately. This is accomplished by the use of floating-point arithmetic which usually outperforms integer arithmetic, and block (tiled) algorithms which exploit the data locality of programs allowing for an efficient use of the data stored in the cache memory. Our results are validated with both an analytical model and empirical results. We show that regular codes could be performed faster than more complex irregular codes using standard data sets.  相似文献   

20.
Formal program development by transformations comprises not only transitions between equivalent control constructs but also suitable changes of data structures. In principle these two (conceptually different) kinds of derivation steps can be done rather independent of each other. If, however, efficient programs are aimed at, it is vitally important to suitably intertwine them in order to benefit from their mutual influence. For this way of ‘joint development’ the paper aims at elaborating a kind of guide-line the practical use of which is illustrated by means of a non-trivial example.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号