首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we propose a solution for a worst‐case execution time (WCET) analyzable Java system: a combination of a time‐predictable Java processor and a tool that performs WCET analysis at Java bytecode level. We present a Java processor, called JOP, designed for time‐predictable execution of real‐time tasks. The execution time of bytecodes, the instructions of the Java virtual machine, is known to cycle accuracy for JOP. Therefore, JOP simplifies the low‐level WCET analysis. A method cache, which fills whole Java methods into the cache, simplifies cache analysis. The WCET analysis tool is based on integer linear programming. The tool performs the low‐level analysis at the bytecode level and integrates the method cache analysis. An integrated data‐flow analysis performs receiver‐type analysis for dynamic method dispatches and loop‐bound analysis. Furthermore, a model checking approach to WCET analysis is presented where the method cache can be exactly simulated. The combination of the time‐predictable Java processor and the WCET analysis tool is evaluated with standard WCET benchmarks and three real‐time applications. The WCET friendly architecture of JOP and the integrated method cache analysis yield tight WCET bounds. Comparing the exact, but expensive, model checking‐based analysis of the method cache with the static approach demonstrates that the static approximation of the method cache is sufficiently tight for practical purposes. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

2.
Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device's limited computing resources. The prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and may be impossible in an early development phase, or rely on simulation, which can be very slow. In this article, we introduce a customizable cross‐profiling framework for embedded Java processors, including processors featuring a method cache. The developer profiles the embedded software in the host environment, completely decoupled from the target system, on any standard Java virtual machine, but the generated profiles represent the execution time metric of the target system. Our cross‐profiling framework is based on bytecode instrumentation. We identify several pointcuts in the execution of bytecode that need to be instrumented in order to estimate the CPU cycle consumption on the target system. An evaluation using the JOP embedded Java processor as target confirms that our approach reconciles high profile accuracy with moderate overhead. Our cross‐profiling framework also enables the performance evaluation of new processor architectures before they are implemented. As a case study, we explore the performance impact of various processor design choices and optimizations, such as different cache sizes or pipeline organizations, and come up with an improved processor design that yields speedups of up to 40% on standard Java benchmarks. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

3.
The accurate measurement of the execution time of Java bytecode is one factor that is important in order to estimate the total execution time of a Java application running on a Java Virtual Machine. In this paper we document the difficulties and solutions for the accurate timing of Java bytecode. We also identify trends across the execution times recorded for all imperative Java bytecodes. These trends would suggest that knowing the execution times of a small subset of the Java bytecode instructions would be sufficient to model the execution times of the remainder. We first review a statistical approach for achieving high precision timing results for Java bytecode using low precision timers and then present a more suitable technique using homogeneous bytecode sequences for recording such information. We finally compare instruction execution times acquired using this platform independent technique against execution times recorded using the read time stamp counter assembly instruction. In particular our results show the existence of a strong linear correlation between both techniques.  相似文献   

4.
Originally developed with a single language in mind, the JVM is now targeted by numerous programming languages—its automatic memory management, just‐in‐time compilation, and adaptive optimizations—making it an attractive execution platform. However, the garbage collector, the just‐in‐time compiler, and other optimizations and heuristics were designed primarily with the performance of Java programs in mind. Consequently, many of the languages targeting the JVM, and especially the dynamically typed languages, are suffering from performance problems that cannot be simply solved at the JVM side. In this article, we aim to contribute to the understanding of the character of the workloads imposed on the JVM by both dynamically typed and statically typed JVM languages. To this end, we introduce a new set of dynamic metrics for workload characterization, along with an easy‐to‐use toolchain to collect the metrics. We apply the toolchain to applications written in six JVM languages (Java, Scala, Clojure, Jython, JRuby, and JavaScript) and discuss the findings. Given the recently identified importance of inlining for the performance of Scala programs, we also analyze the inlining behavior of the HotSpot JVM when executing bytecode originating from different JVM languages. As a result, we identify several traits in the non‐Java workloads that represent potential opportunities for optimization. © 2015 The Authors. Software: Practice and Experience Published by John Wiley & Sons Ltd.  相似文献   

5.
Many embedded Java platforms execute two types of Java classes: those installed statically on the client device and those downloaded dynamically from service providers at run time. For achieving higher performance, the static Java classes can be compiled into machine code by ahead‐of‐time compiler (AOTC) in the server, and the translated machine code can be installed on the client device. Unfortunately, AOTC cannot be applicable to the dynamically downloaded classes. This paper proposes client‐AOTC (c‐AOTC), which performs AOTC on the client device using the just‐in‐time compiler (JITC) module installed on the device, obviating the JITC overhead and complementing the server‐AOTC. The machine code of a method translated by JITC is cached on a persistent memory of the device, and when the method is invoked again in a later run of the program, the machine code is loaded and executed directly without any translation overhead. A major issue in c‐AOTC is relocation because some of the address constants embedded in the cached machine code are not correct when the machine code is loaded and used in a different run; those addresses should be corrected before they are used. Constant pool resolution and inlining complicate the relocation problem, and we propose our solutions. The persistent memory overhead for saving the relocation information is also an issue, and we propose a technique to encode the relocation information and compress the machine code efficiently. We developed a c‐AOTC on Sun's CDC VM reference implementation, and our evaluation results indicate that c‐AOTC can improve the performance significantly, as much as an average of 12% for EEMBC and 4% for SpecJVM98, with a persistent memory overhead of 1% on average. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

6.
The Java language is popular because of its platform independence, making it useful in a lot of technologies ranging from embedded devices to high‐performance systems. The platform‐independent property of Java, which is visible at the Java bytecode level, is only made possible thanks to the availability of a Virtual Machine (VM), which needs to be designed specifically for each underlying hardware platform. More specifically, the same Java bytecode should run properly on a 32‐bit or a 64‐bit VM. In this paper, we compare the behavioral characteristics of 32‐bit and 64‐bit VMs using a large set of Java benchmarks. This is done using the Jikes Research VM as well as the IBM JDK 1.4.0 production VM on a PowerPC‐based IBM machine. By running the PowerPC machine in both 32‐bit and 64‐bit mode we are able to compare 32‐bit and 64‐bit VMs. We conclude that the space an object takes in the heap in 64‐bit mode is 39.3% larger on average than in 32‐bit mode. We identify three reasons for this: (i) the larger pointer size, (ii) the increased header and (iii) the increased alignment. The minimally required heap size is 51.1% larger on average in 64‐bit than in 32‐bit mode. From our experimental setup using hardware performance monitors, we observe that 64‐bit computing typically results in a significantly larger number of data cache misses at all levels of the memory hierarchy. In addition, we observe that when a sufficiently large heap is available, the IBM JDK 1.4.0 VM is 1.7% slower on average in 64‐bit mode than in 32‐bit mode. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

7.
Providing high‐performance inter‐node communication is a key capability for running high performance computing applications efficiently on parallel architectures. In fact, current systems deployments are aggregating a significant number of cores interconnected via advanced networking hardware with Remote Direct Memory Access (RDMA) mechanisms, that enable zero‐copy and kernel‐bypass features. The use of Java for parallel programming is becoming more promising thanks to some useful characteristics of this language, particularly its built‐in multithreading support, portability, easy‐to‐learn properties, and high productivity, along with the continuous increase in the performance of the Java virtual machine. However, current parallel Java applications generally suffer from inefficient communication middleware, mainly based on protocols with high communication overhead that do not take full advantage of RDMA‐enabled networks. This paper presents efficient low‐level Java communication devices that overcome these constraints by fully exploiting the underlying RDMA hardware, providing low‐latency and high‐bandwidth communications for parallel Java applications. The performance evaluation conducted on representative RDMA networks and parallel systems has shown significant point‐to‐point performance increases compared with previous Java communication middleware, allowing to obtain up to 40% improvement in application‐level performance on 4096 cores of a Cray XE6 supercomputer. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

8.
Virtual execution environments, such as the Java virtual machine, promote platform‐independent software development. However, when it comes to analyzing algorithm complexity and performance bottlenecks, available tools focus on platform‐specific metrics, such as the CPU time consumption on a particular system. Other drawbacks of many prevailing profiling tools are high overhead, significant measurement perturbation, as well as reduced portability of profiling tools, which are often implemented in platform‐dependent native code. This article presents a novel profiling approach, which is entirely based on program transformation techniques, in order to build a profiling data structure that provides calling‐context‐sensitive program execution statistics. We explore the use of platform‐independent profiling metrics in order to make the instrumentation entirely portable and to generate reproducible profiles. We implemented these ideas within a Java‐based profiling tool called JP. A significant novelty is that this tool achieves complete bytecode coverage by statically instrumenting the core runtime libraries and dynamically instrumenting the rest of the code. JP provides a small and flexible API to write customized profiling agents in pure Java, which are periodically activated to process the collected profiling information. Performance measurements point out that, despite the presence of dynamic instrumentation, JP causes significantly less overhead than a prevailing tool for the profiling of Java code. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

9.
Bytecode instrumentation is a widely used technique to implement aspect weaving and dynamic analyses in virtual machines such as the Java virtual machine. Aspect weavers and other instrumentations are usually developed independently and combining them often requires significant engineering effort, if at all possible. In this article, we present polymorphic bytecode instrumentation(PBI), a simple but effective technique that allows dynamic dispatch amongst several, possibly independent instrumentations. PBI enables complete bytecode coverage, that is, any method with a bytecode representation can be instrumented. We illustrate further benefits of PBI with three case studies. First, we describe how PBI can be used to implement a comprehensive profiler of inter‐procedural and intra‐procedural control flow. Second, we provide an implementation of execution levels for AspectJ, which avoids infinite regression and unwanted interference between aspects. Third, we present a framework for adaptive dynamic analysis, where the analysis to be performed can be changed at runtime by the user. We assess the overhead introduced by PBI and provide thorough performance evaluations of PBI in all three case studies. We show that pure Java profilers like JP2 can, thanks to PBI, produce accurate execution profiles by covering all code, including the core Java libraries. We then demonstrate that PBI‐based execution levels are much faster than control flow pointcuts to avoid interference between aspects and that their efficient integration in a practical aspect language is possible. Finally, we report that PBI enables adaptive dynamic analysis tools that are more reactive to user inputs than existing tools that rely on dynamic aspect‐oriented programming with runtime weaving. These experiments position PBI as a widely applicable and practical approach for combining bytecode instrumentations. © 2015 The Authors. Software: Practice and Experience Published by John Wiley & Sons Ltd.  相似文献   

10.
Energy‐efficient implementation techniques for virtual machines (VMs) have received little attention yet: conventional wisdom claims that VMs have a diametrical effect on energy consumption, and VM‐based applications are therefore short‐lived. In this paper, we argue that bytecode interpretation is affordable if we synthesize VMs specifically for energy efficiency. We present TinyVM, an execution infrastructure that seamlessly integrates with C and nesC/TinyOS‐based programming environments. TinyVM achieves high code density through the use of compressed bytecode as the primary program representation. Compressed bytecode allows rapid application deployment with low communication overhead. TinyVM executes compressed bytecode in place, which eliminates the need for a decompression stage and thereby reduces memory consumption on sensor nodes. Our infrastructure automates the creation of energy‐efficient application‐specific VMs. Applications are partitioned in machine code, bytecode, and VM instruction set extensions. Partitioning is manually controlled and/or fully guided by a discrete optimization problem that produces a partitioning with lowest energy consumption for a given program size limit. We provide experimental results for sensor network benchmarks and for selected applications on various CPU architectures including Atmega128‐based motes and the ARM‐based Intel iMote2. TinyVM has been released under the GNU General Public License. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

11.
We present a study of the static structure of real Java bytecode programs. A total of 1132 Java jar‐files were collected from the Internet and analyzed. In addition to simple counts (number of methods per class, number of bytecode instructions per method, etc.), structural metrics such as the complexity of control‐flow and inheritance graphs were computed. We believe this study will be valuable in the design of future programming languages and virtual machine instruction sets, as well as in the efficient implementation of compilers and other language processors. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

12.
This paper describes a new method for code space optimization for interpreted languages called LZW‐CC . The method is based on a well‐known and widely used compression algorithm, LZW , which has been adapted to compress executable program code represented as bytecode. Frequently occurring sequences of bytecode instructions are replaced by shorter encodings for newly generated bytecode instructions. The interpreter for the compressed code is modified to recognize and execute those new instructions. When applied to systems where a copy of the interpreter is supplied with each user program, space is saved not only by compressing the program code but also by automatically removing the unused implementation code from the interpreter. The method's implementation within two compiler systems for the programming languages Haskell and Java is described and implementation issues of interest are presented, notably the recalculations of target jumps and the automated tailoring of the interpreter to program code. Applying LZW‐CC to nhc98 Haskell results in bytecode size reduction by up to 15.23% and executable size reduction by up to 11.9%. Java bytecode is reduced by up to 52%. The impact of compression on execution speed is also discussed; the typical speed penalty for Java programs is between 1.8 and 6.6%, while most compressed Haskell executables run faster than the original. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

13.
Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution‐driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two‐Phase Trace‐driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle‐accurate simulation‐based trace generation phase and can be omitted in the repeated trace‐driven simulations. We report our experiences with tsim, an event‐driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16‐thread parallel applications. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

14.
Since its release, the Java programming language has attracted considerable attention from the high‐performance computing (HPC) community because of its portability, high programming productivity, and built‐in multithreading and networking support. As a consequence, several initiatives have been taken to develop a high‐performance Java message‐passing library to program distributed memory architectures, such as clusters. The performance of Java message‐passing applications relies heavily on the communications performance. Thus, the design and implementation of low‐level communication devices that support message‐passing libraries is an important research issue in Java for HPC. MPJ Express is our Java message‐passing implementation for developing high‐performance parallel Java applications. Its public release currently contains three communication devices: the first one is built using the Java New Input/Output (NIO) package for the TCP/IP; the second one is specifically designed for the Myrinet Express library on Myrinet; and the third one supports thread‐based shared memory communications. Although these devices have been successfully deployed in many production environments, previous performance evaluations of MPJ Express suggest that the buffering layer, tightly coupled with these devices, incurs a certain degree of copying overhead, which represents one of the main performance penalties. This paper presents a more efficient Java message‐passing communications device, based on Java Input/Output sockets, that avoids this buffering overhead. Moreover, this device implements several strategies, both in the communication protocol and in the HPC hardware support, which optimizes Java message‐passing communications. In order to evaluate its benefits, this paper analyzes the performance of this device comparatively with other Java and native message‐passing libraries on various high‐speed networks, such as Gigabit Ethernet, Scalable Coherent Interface, Myrinet, and InfiniBand, as well as on a shared memory multicore scenario. The reported communication overhead reduction encourages the upcoming incorporation of this device in MPJ Express ( http://mpj‐express.org ). Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

15.
Minimizing the amount of spill code is still an open problem in code generation and optimization. The amount of spill code depends on both the register allocation algorithm and the pre‐allocation instruction scheduling algorithm that controls the register pressure. In this paper, we focus on the impact of pre‐allocation instruction scheduling on the amount of spill code. Many heuristic techniques have been proposed to do instruction scheduling with the objective of minimizing register pressure and consequently the amount of spill code. However, the performance of these heuristic techniques has not been studied relative to optimality on real large‐scale programs. In this paper, we present an experimental study that evaluates the performance of several pre‐allocation scheduling heuristics. The evaluation involves computing an experimental lower bound on the size of gap between each heuristic's performance and optimal performance. We also propose a simple heuristic technique based on a specific permutation of two basic priority schemes and experimentally evaluate the performance of this technique compared with other heuristics, including the heuristics implemented in the LLVM open‐source Compiler. The evaluation is carried out by running SPEC CPU2006 on real x86‐64 hardware and measuring both the amount of spill code and the execution time. The results of our study show that the proposed heuristic technique gives better overall performance than LLVM's best heuristic on x86‐64, although it produces slightly more spilling. The proposed heuristic has better overall performance, because it achieves a better balance between register pressure and instruction‐level parallelism (ILP). This result shows the importance of ILP in pre‐allocation scheduling even on out‐of‐order machines. Furthermore, the results of the study show that there is a large gap between the performance of any of the studied heuristics and optimal performance; even the best heuristic in the study produces significantly more spill code than the optimal amount. This experimental result quantifies the intuitive belief that it is unlikely to find a heuristic that works well in all cases, thus showing the need for more rigorous solutions using combinatorial approaches. The paper discusses the challenges and complexities that are involved in developing such rigorous solutions. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

16.
PicoJava: a direct execution engine for Java bytecode   总被引:2,自引:0,他引:2  
McGhan  H. O'Connor  M. 《Computer》1998,31(10):22-30
Key to the central promise inherent in Java technology-“write once, run anywhere”-is the fact that Java programs run on the Java virtual machine, insulating them from any contact with the underlying hardware. Consequently, Java programs must execute indirectly through a translation layer built into the Java virtual machine. Translation essentially converts Java virtual machine instructions (called bytecodes) into corresponding machine-specific binary instructions. Bytecode is a single image of a program that will execute identically (in principle) on any system equipped with a JVM. The first step toward the development of a new class of Java processors was the creation of the bytecode execution engine itself, called the picoJava core. PicoJava directly executes Java bytecode instructions and provides hardware support for other essential functions of the JVM. Executing bytecode instructions in hardware eliminates the need for dynamic translation, thus extending the useful range of Java bytecode programs to embedded environments. By the end of 1998, Java processors like Sun's microJava 701 should be available for evaluation from several licensees of the picoJava core technology  相似文献   

17.
This paper presents a novel profiling approach, which is entirely based on program transformation techniques in order to enable exact profiling, preserving complete call stacks, method invocation counters, and bytecode instruction counters. We exploit the number of executed bytecode instructions as profiling metric, which has several advantages, such as making the instrumentation entirely portable and generating reproducible profiles. These ideas have been implemented as the JP tool. It provides a small and flexible API to write portable profiling agents in pure Java, which are periodically activated to process the collected profiling information. Performance measurements point out that JP causes significantly less overhead than a prevailing tool for the exact profiling of Java code.  相似文献   

18.
Object‐oriented component engineering is increasingly used for system development, partly because it emphasizes portability and reusability. Each time a component is used, it must be retested in the new environment. Unfortunately, the data abstraction that components usually use results in low testability. First, internal variables cannot be directly set. Second, even though a test input may trigger a fault, the failure does not propagate to the output. This paper presents a technique to increase object‐oriented component testability, thereby making it easier to detect faults. Components are often sealed so that source code is not available. The program analysis is performed at the Java component bytecode level. A component's bytecode is analysed to create a control and data flow graph, which is then used to increase component testability by increasing both controllability and observability. We have implemented this technique and applied it to several components. Experimental results reveal that fault detection can be increased by using our increasing testability process. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

19.
Many profilers for virtual execution environments, such as the Java virtual machine (JVM), are implemented with low‐level bytecode instrumentation techniques, which is tedious, error‐prone, and complicates maintenance and extension of the tools. In order to reduce the development time and cost, we promote building profilers for the JVM using high‐level aspect‐oriented programming (AOP). We show that the use of aspects yields concise profilers that are easy to develop, extend, and maintain, because low‐level instrumentation details are hidden from the tool developer. In order to build efficient profilers, we introduce inter‐advice communication, an extension to common AOP languages that enables efficient data passing between advices that are woven into the same method using local variables. We illustrate our approach with two case studies. First, we show that an existing, instrumentation‐based tool for listener latency profiling can be easily recast as an aspect. Second, we present an aspect for comprehensive calling context profiling. In order to reduce profiling overhead, our aspect parallelizes application execution and profile creation, resulting in a speedup of 110% on a machine with more than two cores, compared with a primitive, non‐parallel approach. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

20.
基于混合模式的Java卡字节码优化器   总被引:1,自引:0,他引:1       下载免费PDF全文
Java卡是一种基于Java语言的智能卡。因为智能卡的空间和处理器速度的约束,一个应用程序在Java卡上运行时面临的最大问题是存储空间的不足和对程序执行时间的严格限制。因此,对下载到卡中的字节码进行优化是十分必要的。本文提出了一种综合使用扩展指令集和分段压缩算法的Java卡字节码优化器的设计方案,通过对字节码文件的优化,可得到占用空间较少且没有降低执行速率的字节码文件。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号