首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
多核处理器的性能与系统软件有着密切的联系:操作系统是处理器与应用程序之间的接口,对于充分利用处理器特性和提高应用程序的性能起着极其重要的作用;编译器与处理器体系结构密切相关,一方面要产生处理器支持的二进制代码,另一方面还要结合处理器特性产生高效运行的代码,其性能好坏直接影响着系统的整体性能.为了提高龙芯3A系统的实际性能,从操作系统和编译器着手,结合龙芯3A微结构特征,进行了一系列有效的优化.这些措施包括CC-NUMA多核操作系统的实现、操作系统二级Cache锁机制、操作系统调度共享二级Cache分配、自动向量化编译和支持预取机制的编译等.实验结果表明,在系统软件中增加对处理器特性的支持,能够充分挖掘体系结构的优势,对系统性能有较大的好处.其性能优化技术对于其他处理器的优化也有一定的借鉴价值.  相似文献   

2.
Java虚拟机即时编译器以方法为单位进行编译,编译器将字节码方法编译成可执行代码,并经过数据cache存入内存中,当再次执行到该代码段时,处理器需要从包含该代码段的内存区域取指令执行,如果该内存区域在数据cache中已经建立映射,就可以直接从数据cache中读取数据,读数据的性能就会有大幅度的提高.但是编译生成的大量可执行代码在cache中频繁替换,当生成代码被替换出cache后,代码再次执行时处理器必须访问速度较慢的主存储器,成为编译器的性能瓶颈.设计并实现了硬件cache锁机制,提出了一种软硬件协同设计的即时编译方法.通过该方法,生成代码执行时的cache失效次数降低了6.9%,SPECjvm2008中程序最高获得了17.9%的性能提升,平均性能提升4.2%.  相似文献   

3.
MIOS是一个面向大规模CCNUMA系统设计的新型高可扩展操作系统.MIOS创新地采用了多实例内核结构,每个内核实例执行相同代码,分别独立运行和管理一个处理器,多核间通过分布存储管理构成高可扩展的一致性系统映像空间,支持弱共享进程、线程并行模型.MIOS针对大规模CCNUMA系统特点和高性能并行科学计算应用的需求,采用了显式共享数据分布、层次式任务调度、自适应任务间通信以及寄存器锁等优化.在大规模CCNUMA体系结构的银河深度并行计算机上的测试表明,MIOS对MPI应用具有同传统操作系统类似的性能,并可以有效支持2048处理器规模的OMP应用高效运行,具有良好的系统可扩展性.  相似文献   

4.
page-color的研究集中在如何通过有效的cache分区技术隔离弱局部性数据与强局部性数据的访问冲突,减少数据处理过程中由弱局部性数据产生的cache污染对强局部性数据的影响.但这些优化技术依赖于特殊的处理器硬件设计、操作系统内核功能的扩展或同时依赖于硬件的特殊设计和操作系统扩展功能的支持.提出了应用软件层上基于p...  相似文献   

5.
对于节点计算、通信与存储能力不同、节点由多个多核处理器(多个片上多处理器)组成且共享L3cache的机群系统,采取计算与传输重叠模式,提出了主节点以多进程方式并发发送数据给从节点的可分负载调度模型.该调度模型自适应节点具有不同的计算、通信和存储能力,动态计算、确定调度轮数和每轮调度分配给各从节点的负载块规模,以平衡各节点的计算负载、减少节点之间的通信开销,缩短任务调度长度.依据各节点中的L3cache,L2cache和L1cache的可用存储容量,提出了对节点主存中接收到的负载块进行多级缓存划分的数据分配方法,以确保分配给节点中各个多核处理器、各个内核的负载平衡.基于提出的多核机群节点间可分负载调度模型和节点内多级存储数据分配方法,设计实现了节点拥有多个多核处理器的异构机群上通信和存储高效的k-选择并行算法.在曙光TC5000A多核机群系统上,测试了主节点并行与串行发送数据给从节点的任务调度方式、各级缓存利用率、每个核心执行不同数目的线程对并行算法运行性能的影响.实验结果表明:基于主节点并发发送数据给从节点的调度模型设计的k-选择并行算法,其运行性能优于基于主节点串行发送数据给从节点的调度模型设计的k-选择并行算法;L3cache和L2cache利用率大小对算法运行性能影响较大;当L3cache,L2cache和L1cache利用率取其优化组合值、每个核心运行3个线程时,算法所需的运行时间最短.  相似文献   

6.
一种异构多核处理器嵌入式实时操作系统构架设计   总被引:3,自引:1,他引:2  
由于异构多核处理器和多处理器系统及同构多核处理器的构架存在很大差别,应用于多处理器系统的分布式结构以及应用于同构多核系统的主从式结构操作系统不能解决异构多核处理器的实时调度和效率问题。对异构多核处理器的特点及发展趋势进行了研究,提出了一种适用异构多核处理器的多主模式实时操作系统构架。这种构架将通信总线中的多主模式引入多核操作系统构架中,采用对称式结构及组件模式设计操作系统模型,使多核处理器中每个内核都可以作为主核实现对资源、任务的实时管理,提高系统性能,同时可以解决主从式操作系统存在的由于处理器核增多而带来的主内核不能满足系统性能要求的瓶颈问题。通过这种单一构架模型可以进行灵活配置,以适应不同结构及功能要求的处理器内核,降低操作系统开发难度。  相似文献   

7.
陈志锋  李清宝  张平  丁文博 《软件学报》2016,27(12):3172-3191
内核恶意软件对操作系统的安全造成了严重威胁.现有的内核恶意软件检测方法主要从代码角度出发,无法检测代码复用、代码混淆攻击,且少量检测数据篡改攻击的方法因不变量特征有限导致检测能力受限.针对这些问题,提出了一种基于数据特征的内核恶意软件检测方法,通过分析内核运行过程中内核数据对象的访问过程,构建了内核数据对象访问模型;然后,基于该模型讨论了构建数据特征的过程,采用动态监控和静态分析相结合的方法识别内核数据对象,利用EPT监控内存访问操作构建数据特征;最后讨论了基于数据特征的内核恶意软件检测算法.在此基础上,实现了内核恶意软件检测原型系统MDS-DCB,并通过实验评测MDS-DCB的有效性和性能.实验结果表明:MDS-DCB能够有效检测内核恶意软件,且性能开销在可接受的范围内.  相似文献   

8.
任建宝  齐勇  戴月华  王晓光  宣宇  史椸 《软件学报》2015,26(8):2124-2137
操作系统漏洞经常被攻击者利用,从而以内核权限执行任意代码(返回用户态攻击,ret2user)以及窃取用户隐私数据.使用虚拟机监控器构建了一个对操作系统及应用程序透明的内存访问审查机制,提出了一种低性能开销并且无法被绕过的内存页面使用信息实时跟踪策略;结合安全加载器,保证了动态链接库以及应用程序的代码完整性.能够确保即使操作系统内核被攻击,应用程序的内存隐私数据依然无法被窃取.在Linux操作系统上进行了原型实现及验证,实验结果表明,该隐私保护机制对大多数应用只带来6%~10%的性能负载.  相似文献   

9.
目前多核架构已成为处理器的主流设计并成为各种多媒体应用的主流处理平台,而核间通信的效率是影响多核处理器性能的重要因素之一.提出了一种针对多媒体应用程序的核间通信的优化方法.该方法利用此类应用程序数据读取的规律性,通过在多核处理器上添加通信队列,实现只读数据的快速传递,从而提高多媒体应用程序的并行执行效率.实验表明使用通信队列对各种多媒体核心算法的性能都有普遍提高.同时,该方法具有良好的扩展性,当内核数目增加,通信队列所带来的好处也更加明显.  相似文献   

10.
邓良  曾庆凯 《软件学报》2016,27(5):1309-1324
在现代操作系统中,内核运行在最高特权层,管理底层硬件并向上层应用程序提供系统服务,因而安全敏感的应用程序很容易受到来自底层不可信内核的攻击.提出了一种在不可信操作系统内核中保护应用程序的方法AppFort.针对现有方法的高开销问题,AppFort结合x86硬件机制(操作数地址长度)、内核代码完整性保护和内核控制流完整性保护,对不可信内核的硬件操作和软件行为进行截获和验证,从而高效地保证应用程序的内存、控制流和文件I/O安全.实验结果表明:AppFort的开销极小,与现有工作相比明显提高了性能.  相似文献   

11.
本文阐述了数据库并发处理的关键方法,详细分析和研究了数据库并发操作产生数据不一致问题的原因,以及独占锁、共享锁和更新锁的基本特征和应用,重点讨论了产生死锁的原因,从使用者角度对如何有效提高数据的并发处理和共享效率提出了一些建议。  相似文献   

12.
Solving partial differential equations using finite element (FE) methods for unstructured meshes that contain billions of elements is computationally a very challenging task. While parallel implementations can deliver a solution in a reasonable amount of time, they suffer from low cache utilization due to unstructured data access patterns. In this work, we reorder the way the mesh vertices and elements are stored in memory using Hilbert space-filling curves to improve cache utilization in FE methods for unstructured meshes. This reordering technique enumerates the mesh elements such that parallel threads access shared vertices at different time intervals, reducing the time wasted waiting to acquire locks guarding atomic regions. Further, when the linear system resulting from the FE analysis is solved using the preconditioned conjugate gradient method, the performance of the block-Jacobi preconditioner also improves, as more nonzeros are present near the stiffness matrix diagonal. Our results show that our reordering reduces the L1 and L2 cache miss-rates in the stiffness matrix assembly step by about 50 and 10 %, respectively, on a single-core processor. We also reduce the number of iterations required to solve the linear system by about 5 %. Overall, our reordering reduces the time to assemble the stiffness matrix and to solve the linear system on a 4-socket, 48-core multi-processor by about 20 %.  相似文献   

13.
Concurrent systems often have many processes sharing a common set of resources, both memory regions and hardware devices. Among the many challenges in producing safe concurrent software are single access, atomic transactions, starvation, and deadlock. Locks are frequently used to provide single access to shared resources, but do not guarantee safe usage. This paper extends the previous work on linear, singleton, and arithmetic types and linear memory primitives. Our contributions are capabilities for shared resources, and locks to control these capabilities in provably safe ways. We present formalized locks in a lambda calculus along with the soundness properties of preservation and progress. The type system described here prevents data races. The formalized locks have also been implemented in a C‐like language and used in a network device driver. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

14.
Multicore processors need to communicate when working on shared tasks. In classical systems, this is performed via shared objects protected by locks, which are implemented with atomic operations on the main memory. However, access to shared main memory is already a bottleneck for multicore processors. Furthermore, the access time to a shared memory is often hard to predict and therefore problematic for real-time systems.This paper presents a shared on-chip memory that is used for communication and supports atomic operations to implement locks. Access to the shared memory is arbitrated with time division multiplexing, providing time-predictable access. The shared memory supports extended time slots so that a processor can execute more than one memory operation atomically. This allows for the implementation of locking and other synchronization primitives.We evaluate this shared scratchpad memory with synchronization support on a 9-core version of the T-CREST multicore platform. Worst-case access latency to the shared scratchpad is 13 clock cycles. Access to the atomic section under full contention, when every processor core wants access to acquire a lock, is 135 clock cycles.  相似文献   

15.
实时分布式操作系统中共享对象通信模型的实现   总被引:2,自引:0,他引:2  
与消息传递、远程过程调用和共享内存等进程间通信模型相比,在分布式存储多处理器上实现的共享对象模型能更自然、更高效地描述进程间的相互作用。它通过共享对象消除了各结点之间的物理边界,为系统提供了一种透明的分布式处理能力。这样,整个系统(包括硬件和软件)在逻辑上成为一个单系统。该模型已在仪器用弱实时分布式操作系统(IOWRTDOS)的全局共享对象(GSO)层上实现。文中研究了共享对象通信模型,并详细讨论  相似文献   

16.
The performance of transaction processing systems is determined by the contention for hardware as well as software resources (database locks), due to the concurrency control mechanism of the database being accessed by transactions. We consider a transaction processing system with a set of dominant transcation classes. Each class needs to acquire a certain subset of the locks in the database before it can be processed, i.e., predeclared lock requests with static locking. Straightforward application of the decomposition method requires the numerical solution of a two-dimensional Markov chain. Equivalently, a hierarchical simulation method, where the computer system is represented by a composite queue with exponential service rates, can be used to analyze the system. We propose an inexpensive analytic solution method, also based on hierarchical decomposition, such that the throughput of the computer system ic characterized by the number of active transactions (regardless of class). Numerical results are provided to show that the new method is adequately accurate compared to the other two rather costly methods. It can be used to determine the effect of granularity of locking on system performance. The solution method is also applicable to multiresource queueing systems with multiple contention points.  相似文献   

17.
现代仪器用实时分布式操作系统   总被引:1,自引:0,他引:1  
为了给多功能,多参数,智能化,网络化的现代仪器系统提供一和中更好的硬件支撑环境,作者研制了现代仪器用弱实时分化布式操作系统(IOWRTDOS)其结构自底向上分为3层,通用硬件接口(GHI)层,微内核(Microkernel)层和全局共享对象(GSO)层,GHI封装了硬件细节,为操作系统的其余部分提供了一个理想的机器结构,Microkernel是IOWRTDOS的核心,主要提供内存管理,多线程管理和  相似文献   

18.
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms  相似文献   

19.
The introduction of multicore microprocessors has enabled smaller organizations to invest in high performance shared memory parallel systems. These systems ship with standard operating systems using preset thresholds for task imbalance assessment to activate load balancing. Unfortunately, this will unnecessarily trigger task migrations when the number of tasks is a few multiples of the number of processing cores. We illustrate this unnecessary task migration behavior through simulation and introduce a dynamic threshold for task imbalance assessment that is dependent on the number of tasks and the number of processing cores. This is as a replacement for the static threshold that is used by standard operating systems. With the dynamic threshold method, we are able to illustrate a performance gain of up to 17% on a synthetic benchmark and up to 25% gain using the Integer Sort Benchmark from the National Aeronautics and Space Administration (NASA) Advanced Supercomputing Parallel Benchmark Suite.  相似文献   

20.
This modular, hierarchical operating system will capitalize on hardware features deemed important for image and digital signal processing: a fast LAN and shared memory.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号