首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution. Lastly, we show how the state of term-wise partitioning relates to the industry standard document-wise partitioning. Even though there are situations pipelined query processing is advantegous, document-wise partitioning is still the road to follow.  相似文献   

2.
Pipelining is a widely used technique for implementing architectures that have inherent temporal parallelism when there is an operational requirement for high throughput. Many variations on the basic theme have been proposed, with varying degrees of success. The aim of this paper is to present a critical review of conventional pipelined architectures and put some well-known problems in sharp relief. It is argued that conventional pipelined architectures have underlying limitations that can only be dealt with by adopting a different view of pipelining. These limitations are explained in terms of discontinuities in the flow of instructions and data, and representative machines are examined in support of this argument. In a companion paper [Topham, Omondi and Ibbett, 1988] we examine an alternative approach to the design of pipelined architectures and introduce an alternative theory of pipelining, which we call Context Flow.  相似文献   

3.
Scheduling, in many application domains, involves optimization of multiple performance metrics. For example, application workflows with real-time constraints have strict throughput requirements and also desire a low latency or response time. In this paper, we present a novel algorithm for the scheduling of workflows that act on a stream of input data. Our algorithm focuses on the two performance metrics, latency and throughput, and minimizes the latency of workflows while satisfying strict throughput requirements. We also describe steps to use the above approach to solve the problem of meeting latency requirements while maximizing throughput. We leverage pipelined, task and data parallelism in a coordinated manner to meet these objectives and investigate the benefit of task duplication in alleviating communication overheads in the pipelined schedule for different workflow characteristics. The proposed algorithm is designed for a realistic bounded multi-port communication model, where each processor can simultaneously communicate with at most k distinct processors. Experimental evaluation using synthetic benchmarks as well as those derived from real applications shows that our algorithm consistently produces lower latency schedules that meet throughput requirements, even when previously proposed schemes fail.  相似文献   

4.
CORDIC流水线结构因其高吞吐率及规整性,而很适合于FFT蝶形运算,但其缺点是耗资源多,本文从FFT中旋转因子固定不任意的特点出发,根据CORDIC基本旋转角度与缩放因子的对应关系和缩放因子之间的转换规律,对CORDIC流水线结构进行了改进,在蝶形运算速度不变的情况下,进一步减少所耗资源,在字长为16位的FFT中,每个旋转因子可用25位的控制序列来替代,从而使每个旋转因子的存储空间由32位减少到25位。  相似文献   

5.
A motion estimation architecture allowing the execution of a variety of block-matching search techniques is presented in this paper. The ability to choose the most efficient search technique with respect to speeding up the process and locating the best matching target block leads to the improvement of the quality of service and the performance of the video encoding. The proposed architecture is pipelined to efficiently support a large set of currently used block-matching algorithms including Diamond Search, 3-step, MVFAST and PMVFAST. The proposed design executes the algorithms by providing a set of instructions common for all the block-matching algorithms and a few instructions accommodating the specific actions of each technique. Moreover, the architecture supports the use of different search techniques at the block level. The results and performance measurements of the architecture have been validated on FPGA supporting maximum throughput of 30 frames/s with frame size 1,024 × 768.  相似文献   

6.
重构机制对可重构密码处理系统的性能有着重要的影响,该文从全局、局部、静态、动态几方面提出了流水化可重构密码处理结构中重构机制的分类,给出了各种机制的吞吐率和延迟公式,并分析了几种机制的性能和实现代价,最后给出了在采用局部动态重构机制的可重构密码处理结构中密码处理的性能。  相似文献   

7.
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of optimizing throughput with a latency constraint. The problem formulation uses a general and realistic model of intertask communication and addresses the entire problem of mapping, which includes clustering tasks into modules, assigning of processors to modules, and possible replicating of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.  相似文献   

8.
Energy efficient single-processor and fully pipelined architectures for the lifting-based JPEG2000's 5/3 two-dimensional (2D)-discrete wavelet transform are presented. The single processor performs both the row-and column-wise processing simultaneously, that is, full 2D transform with 100% hardware utilisation. In addition, the architecture uses minimum embedded memory. The fully pipelined architecture is obtained by replicating the single-processor block depending on the levels of decomposition with much lower memory requirement and higher throughput than the single processor involved in multi-level transforms. These architectures can be directly used in real-time image/video consumer applications to extend the battery life of portable systems.  相似文献   

9.
10.
A scalable backplane topology which allows a practically unlimited number of modules with identical interfaces is presented. Short, buffered, point-to-point connections overcome clock skew problems. Synchronized, pipelined data transfer operations ensure high throughput and reasonably low latency times for fine-grain parallel algorithms. A simple bus interface logic without any special hardware configuration guarantees a cheap implementation with standard FPGAs. The measured performance in our FPGA based prototype with 32 bit wide data bus shows a throughput of 160 Mbytes/s for each module with 75 ns latency time between modules.  相似文献   

11.
12.
We describe an approach to verifying bit-level pipelined machine models using a combination of deductive reasoning and decision procedures. While theorem-proving systems such as ACL2 have been used to verify bit-level designs, they typically require extensive expert user support. Decision procedures such as those implemented in UCLID can be used to automatically and efficiently verify term-level pipelined machine models, but these models use numerous abstractions, implement a subset of the instruction set, and are far from executable. We show that by integrating UCLID with the ACL2 theorem-proving system, we can use ACL2 to reduce the proof that an executable, bit-level machine refines its instruction set architecture to a proof that a term-level abstraction of the bit-level machine refines the instruction set architecture, which is then handled automatically by UCLID. We demonstrate the efficiency of our approach by applying it to verify a complex, seven-stage, bit-level interface pipelined machine model that implements 593 instructions and has features such as branch prediction, exceptions, and predicated instruction execution. Such a proof is not possible using UCLID and would require prohibitively more effort using just ACL2. This research was funded in part by NSF grants CCF-0429924, IIS-0417413, and CCF-0438871.  相似文献   

13.
Communication efficient matrix multiplication on hypercubes   总被引:1,自引:0,他引:1  
In a recent paper Fox, Otto and Hey consider matrix algorithms for hypercubes. For hypercubes allowing pipelined broadcast of messages they present a communication efficient algorithm. We present in this paper a similar algorithm that uses only nearest neighbour communication. This algorithm will therefore by very communication efficient also on hypercubes not allowing pipelined broadcast. We introduce a new algorithm that reduces the asymptotic communication cost from . This is achieved by regarding the hypercube as a set of subcubes and by using the cascade sum algorithm.  相似文献   

14.
为实现高速可配RSA硬件加速器,提出了一种基于基—64蒙哥马利算法的模乘器流水线架构及其对应的可配置存储结构。通过五级流水线的并行运算和存储器的灵活配置,可以高效地实现256位到2048位的RSA运算。实验结果表明:与其他相关工作比较,提出的流水线架构能够取得较好的性能和资源消耗比,加速器在模乘器性能和数据吞吐率方面有明显提高。在73 k门硬件资源下,在1024位RSA运算情况下,实现了333 kbps的数据吞吐率。  相似文献   

15.
Switch-based interconnects are used in a number of application domains, including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both extremely low latency and very high throughput for a variety of different message sizes. While some architectures with output queuing have been shown to perform extremely well in terms of throughput, their performance can suffer when used in systems where a significant portion of the packets are extremely small. On the other hand, architectures with input queuing offer limited throughput or require fairly complex and centralized arbitration that increases latency. In this paper, we present a new input queue-based switch architecture called HIPIQS (HIgh-Performance Input-Queued Switch). It offers low latency for a range of message sizes and provides throughput comparable to that of output queuing approaches. Furthermore, it allows simple and distributed arbitration. HIPIQS uses a dynamically allocated multiqueue organization, pipelined access to multibank input buffers, and small cross-point buffers to deliver high performance. Our simulation results show that HIPIQS can deliver performance close to that of output queuing approaches over a range of message sizes, system sizes, and traffic. The switch architecture can therefore be used to build high performance switches that are useful for both parallel system interconnects and for building computer networks  相似文献   

16.
使用了MATLAB提供的SIMUL INK工具箱,对1.5位/级10b流水线结构模数转换器系统进行了设计及系统仿真。用SIMUL INK建立起每级的仿真模型,并将其封装成一个模块,然后把9级模块级联起来,建立起系统的模型。运行结果表明,通过该方法建立起来的模型能够达到10b的分辨率和80MH z的采样速率。  相似文献   

17.
The basic concept of pipelined data-parallel algorithms is introduced by contrasting the algorithms with other styles of computation and by a simple example (a pipeline image distance transformation algorithm). Pipelined data-parallel algorithms are a class of algorithms which use pipelined operations and data level partitioning to achieve parallelism. Applications which involve data parallelism and recurrence relations are good candidates for this kind of algorithm. The computations are ideal for distributed-memory multicomputers. By controlling the granularity through data partitioning and overlapping the operations through pipelining, it is possible to achieve a balanced computation on multicomputers. An analytic model is presented for modeling pipelined data-parallel computation on multicomputers. The model uses timed Petri nets to describe data pipelining operations. As a case study, the model is applied to a pipelined matrix multiplication algorithm. Predicted results match closely with the measured performance on a 64-node NCUBE hypercube multicomputer  相似文献   

18.
By splitting a large broadcast message into segments and broadcasting the segments in a pipelined fashion, pipelined broadcast can achieve high performance in many systems. In this paper, we investigate techniques for efficient pipelined broadcast on clusters connected by multiple Ethernet switches. Specifically, we develop algorithms for computing various contention-free broadcast trees that are suitable for pipelined broadcast on Ethernet switched clusters, extend the parametrized LogP model for predicting appropriate segment sizes for pipelined broadcast, show that the segment sizes computed based on the model yield high performance, and evaluate various pipelined broadcast schemes through experimentation on Ethernet switched clusters with various topologies. The results demonstrate that our techniques are practical and efficient for contemporary fast Ethernet and Giga-bit Ethernet clusters.  相似文献   

19.
Financial time series prediction using polynomial pipelined neural networks   总被引:1,自引:1,他引:0  
This paper proposes a novel type of higher-order pipelined neural network: the polynomial pipelined neural network. The proposed network is constructed from a number of higher-order neural networks concatenated with each other to predict highly nonlinear and nonstationary signals based on the engineering concept of divide and conquer. The polynomial pipelined neural network is used to predict the exchange rate between the US dollar and three other currencies. In this application, two sets of experiments are carried out. In the first set, the input data are pre-processed between 0 and 1 and passed to the neural networks as nonstationary data. In the second set of experiments, the nonstationary input signals are transformed into one step relative increase in price. The network demonstrates more accurate forecasting and an improvement in the signal to noise ratio over a number of benchmarked neural networks.  相似文献   

20.
Two important issues in systolic array designs are addressed: How is fault tolerance provided in systolic arrays to enhance the yield of wafer-scale integration implementations? And, how are efficient systolic arrays with two levels of pipelining designed? (The first level refers to the pipelined organization of the array at the cellular level, and the second refers to the pipelined functional units inside the cells.) The fault-tolerant scheme proposed replaces defective cells with clocked delays. This has the distinct characteristic that data can flow through the array with faulty cells at the original clock speed. It is shown that both the defective cells under this fault-tolerant scheme and the second-level pipeline stages can simply be modeled as additional delays in the data paths of “generic” systolic designs. The mathematical notion of a cut is introduced to solve the problem of how to allow for these extra delays while preserving the correctness of the original systolic array designs. The results obtained by applying these techniques are encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of failures (with the addition of very little hardware) while maintaining the original throughput. Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with cycles typically induces a significant decrease in throughput. In response to this, a new class of systolic algorithms has been derived in which the data cycle around a ring of processing cells. The systolic ring architecture has the property that its performance degrades gracefully as cells fail. Use of the cut theory and ring architectures for arrays with feedback gives effective fault-tolerant and two-level pipelining schemes for most systolic arrays. As a side effect of developing the ring architecture approach, several new systolic algorithms have been derived. These algorithms generally require only one-third to one-half of the number of cells used in previous designs to achieve the same throughput. The new systolic algorithms include ones for LU-decomposition, QR-decomposition, and the solution of triangular linear systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号