共查询到10条相似文献,搜索用时 0 毫秒
1.
Compile-time reordering of low level instructions is successful in achieving large increases in performance of programs on fine grain parallel machines. However, because of the interdependences between instruction scheduling and register allocation, a lack of cooperation between the scheduler and register allocator can result in generating code that contains excess register spills and/or a lower degree of parallelism than actually achievable. This paper describes a strategy for providing cooperation between register allocation and both global and local instruction scheduling. We experimentally compare this strategy with other cooperative and uncooperative scenarios. 相似文献
2.
Dalvik虚拟机作为Android系统上运行所有应用程序的基础,其性能瓶颈一直制约着Android系统的用户体验。通过研究Android系统中的Dalvik架构,分析其解释器和JIT模块的工作原理,发现热Trace选择过程中短Trace编译损耗大以及即时编译过程中寄存器分配不合理的情况。结合Java虚拟机技术和编译器技术,在现有热Trace选择和寄存器分配机制的基础上,提出基于Trace合并和寄存器分配的优化算法,在国产高性能嵌入式CPU CSKY体系下移植Dalvik虚拟机并实现了上述优化算法。通过实验证明优化后Dalvik执行Java程序的性能提高了近10%。 相似文献
3.
We present the first construction for sorting and counting networks of arbitrary width that requires both small depth and
small constant factors in the depth expression. Let w be the product w = p
0
⋅ p
1
⋅s p
n-1
, whose factors are not necessarily prime. We present a novel network construction of width w and depth O(n
2
) = O(log
2
w) , using comparators (or balancers) of width less than or equal to max(p
i
) . This construction is practical in the sense that the asymptotic notation does not hide any large constants.
An interesting aspect of this construction is that it establishes a family of sorting and counting networks of width w , one for each distinct factorization of w . A factorization in which max(p
i
) is large and n is small yields a network that trades small depth for large comparators (or balancers), and a factorization where max(p
i
) is small and n is large makes the opposite tradeoff.
Received June 18, 2001. Online publication October 30, 2001. 相似文献
4.
Byung-Sun Yang Junpyo Lee SeungIl Lee Seongbae Park Yoo C. Chung Suhyun Kim Kemal Ebcioglu Erik Altman Soo-Mook Moon 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(1):57-69
Java just-in-time (JIT) compilers improve the performance of a Java virtual machine (JVM) by translating Java bytecode into native machine code on demand. One important problem in Java JIT compilation is how to map stack entries and local variables to registers efficiently and quickly, since register-based computations are much faster than memory-based ones, while JIT compilation overhead is part of the whole running time. This paper introduces LaTTe, an open-source Java JIT compiler that performs fast generation of efficiently register-mapped RISC code. LaTTe first maps "all" local variables and stack entries into pseudoregisters, followed by real register allocation which also coalesces copies corresponding to pushes and pops between local variables and stack entries aggressively. Our experimental results indicate that LaTTe's sophisticated register mapping and allocation really pay off, achieving twice the performance of a naive JIT compiler that maps all local variables and stack entries to memory. It is also shown that LaTTe makes a reasonable trade-off between quality and speed of register mapping and allocation for the bytecode. We expect these results will also be beneficial to parallel and distributed Java computing: 1) by enhancing single-thread Java performance; and 2) by significantly reducing the number of memory accesses which the rest of the system must properly order to maintain coherence and keep threads synchronized 相似文献
5.
基于遗传算法提出了溢出代码和访存压力敏感的机器学习来调试寄存器分配的权值函数。不同于以往采用目标程序的运行时间作为适应值,通过静态分析寄存器分配产生的溢出代码和基本块中的访存压力来构建适应值,以减少学习时间。这些分析被限定在热点函数中,在保证适应值精度的同时进一步加快了学习速度。实验表明,快速学习仅需要考虑热点函数的编译时间,整个CPU2000CINT测试集在5 h内即可学习完毕。大部分CPU2000CINT测试例子的性能得到了提高。其中perlbmk的性能提升最高可达到7.2%。 相似文献
6.
ZigBee使用的分布式地址分配算法(DAAM)为节点分配地址时没有考虑网络拓扑结构的变化。这就造成了地址空间的严重浪费,使得节点入网成功率降低。同时基于DAAM机制的树路由算法没有考虑节点的负载,负载不均衡将导致网络分割的提前到来。本文提出一种改进的分布式地址分配算法和基于它的负载均衡的树路由算法。改进的地址分配算法通过获取邻居节点的地址空间从而提高节点入网成功率。改进的树路由算法可以均衡节点能耗,延长网络寿命。 相似文献
7.
This paper presents a fundamentally new approach to global register allocation that optimally allocates registers and optimally places spill code, significantly decreasing spill code overhead compared with the traditional graph-coloring approach. The Optimal Register Allocation (ORA) approach formulates global register allocation as a 0–1 integer programming problem, incorporating all aspects of register allocation within a unified framework, including copy elimination, live range splitting, rematerialization, callee and caller register spilling, special instruction-operand requirements, and paired registers. A prototype ORA allocator is built into the Gnu C Compiler (GCC). For the SPEC92 integer benchmarks, the ORA allocator actually produces a net decrease of more than 100 million cycles across the entire benchmark set, because the dynamic copies the ORA allocator removes exceed the dynamic loads and stores that are inserted. In contrast, the GCC allocator and a Chaitin-style graph-coloring allocator each cause a net increase of more than 1 billion cycles. Because global register allocation is NP-complete, optimal register allocation has been considered intractable. However, the run-time complexity of the ORA approach is shown experimentally to be O(n3). A profile-guided hybrid allocation approach is proposed that uses the ORA allocator for the performance critical regions in the performance critical functions, while using a graph-coloring allocator for the non-critical functions and regions. An ORA-GCC hybrid allocator takes an average of 4.6 seconds per function to produce an allocation that is within 1% of optimal for 97% of the SPEC92 integer benchmark functions, showing that the hybrid allocator is practical as an advanced optimization for performance-critical codes. 相似文献
8.
9.
Abstract. Let G=(V,E) be a 2-edge connected, undirected and nonnegatively weighted graph, and let S(r) be a single source shortest paths tree (SPT) of G rooted at r ∈ V . Whenever an edge e in S(r) fails, we are interested in reconnecting the nodes now disconnected from the root by means of a single edge e' crossing the cut created by the removal of e . Such an edge e' is named a swap edge for e . Let S
e/e'
(r) be the swap tree (no longer an SPT, in general) obtained by swapping e with e' , and let S
e
be the set of all possible swap trees with respect to e . Let F be a function defined over S
e
that expresses some feature of a swap tree, such as the average length of a path from the root r to all the nodes below edge e , or the maximum length, or one of many others. A best swap edge for e with respect to F is a swap edge f such that F(S
e/f
(r)) is minimum.
In this paper we present efficient algorithms for the problem of finding a best swap edge, for each edge e of S(r) , with respect to several objectives. Our work is motivated by a scenario in which individual connections in a communication
network suffer transient failures. As a consequence of an edge failure, the shortest paths to all the nodes below the failed
edge might completely change, and it might be desirable to avoid an expensive switch to a new SPT, because the failure is
only temporary. As an aside, what we get is not even far from a new SPT: our analysis shows that the trees obtained from the
swapping have features very similar to those of the corresponding SPTs rebuilt from scratch. 相似文献