期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient External Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms

Dehne Dittrich Hutchinson 《Algorithmica》2008,36(2):97-122

Abstract. External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/ O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems. 相似文献

2.

Practical methods for constructing suffix trees 总被引：7，自引：0，他引：7

Yuanyuan Tian Sandeep Tata Richard A. Hankins Jignesh M. Patel 《The VLDB Journal The International Journal on Very Large Data Bases》2005,14(3):281-299

Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n²) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n²) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD. 相似文献

3.

A unified benchmarking and model-based framework for building QoS-aware streaming media services

Ludmila Cherkasova Wenting Tang Amin Vahdat 《Multimedia Systems》2006,11(6):532-549

A number of technology and workload trends motivate us to consider the appropriate resource allocation mechanisms and policies for streaming media services in shared cluster environments. We present MediaGuard – a model-based infrastructure for building streaming media services – that can efficiently determine the fraction of server resources required to support a particular client request over its expected lifetime. The proposed solution is based on a unified cost function that uses a single value to reflect overall resource requirements such as the CPU, disk, memory, and bandwidth necessary to support a particular media stream based on its bit rate and whether it is likely to be served from memory or disk. We design a novel, time-segment-based memory model of a media server to efficiently determine in linear time whether a request will incur memory or disk access when given the history of previous accesses and the behavior of the server's main memory file buffer cache. Using the MediaGuard framework, we design two media services: (1) an efficient and accurate admission control service for streaming media servers that accounts for the impact of the server's main memory file buffer cache, and (2) a shared streaming media hosting service that can efficiently allocate the predefined shares of server resources to the hosted media services, while providing performance isolation and QoS guarantees among the hosted services. Our evaluation shows that, relative to a pessimistic admission control policy that assumes that all content must be served from disk, MediaGuard (as well as services that are built using it) deliver a factor of two improvement in server throughput. 相似文献

4.

Simulated performance of a data-driven database machine

《Journal of Parallel and Distributed Computing》1986,3(1):1-22

One of the main limitations to high parallelism in database processing is the available 1/O bandwidth of the storage devices comprising the machine. One way to increase this bandwidth is to use multiple parallel disk units. The main problem with this approach is the lack of a computational model capable of utilizing any significant number of such devices. In this paper we present a different model, referred to as the Active Graph Model, which is based on the principles of asynchronous data-driven computation. To demonstrate the viability of this approach, we have implemented the model on a simulated multiprocessor architecture. By varying the speed of processors, memory units, communication links, and the types of queries processed, we demonstrate that the resulting database machine is capable of exploiting the I/O bandwidth of a large number of disk units as well as the computational power of the associated processors. 相似文献

5.

New Algorithms for Disk Scheduling

Andrews Bender Zhang 《Algorithmica》2008,32(2):277-301

Abstract. Processor speed and memory capacity are increasing several times faster than disk speed. This disparity suggests that disk I/ O performance could become an important bottleneck. Methods are needed for using disks more efficiently. Past analysis of disk scheduling algorithms has largely been experimental and little attempt has been made to develop algorithms with provable performance guarantees. We consider the following disk scheduling problem. Given a set of requests on a computer disk and a convex reachability function that determines how fast the disk head travels between tracks, our goal is to schedule the disk head so that it services all the requests in the shortest time possible. We present a 3/2 -approximation algorithm (with a constant additive term). For the special case in which the reachability function is linear we present an optimal polynomial-time solution. The disk scheduling problem is related to the special case of the Asymmetric Traveling Salesman Problem with the triangle inequality (ATSP-Δ ) in which all distances are either 0 or some constant α . We show how to find the optimal tour in polynomial time and describe how this gives another approximation algorithm for the disk scheduling problem. Finally we consider the on-line version of the problem in which uniformly distributed requests arrive over time. We present an algorithm related to the above ATSP-Δ . 相似文献

6.

Lightweight Data Indexing and Compression in External Memory

Paolo Ferragina Travis Gagie Giovanni Manzini 《Algorithmica》2012,63(3):707-730

In this paper we describe algorithms for computing the Burrows-Wheeler Transform (bwt) and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size n, they use only n bits of working space on disk while all previous approaches use Θ(nlog n) bits. This is achieved by building the bwt directly without passing through the construction of the Suffix Array/Tree data structure. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scan-based algorithm for inverting the bwt that uses Θ(n) bits of working space, and a lightweight internal-memory algorithm for computing the bwt which is the fastest in the literature when the available working space is o(n) bits. Finally, we prove lower bounds on the complexity of computing and inverting the bwt via sequential scans in terms of the classic product: internal-memory space × number of passes over the disk data, showing that our algorithms are within an O(log n) factor of the optimal. 相似文献

7.

多种哈希算法的可重构硬件架构设计

刘恒黄凯修思文李奕均严晓浪《计算机工程与科学》2016,38(3):411-417

针对现有的哈希算法硬件架构仅实现少量几种算法的问题,设计了一种可实现SM3,MD5,SHA-1以及SHA-2系列共7种哈希算法的可重构IP,以满足同一系统对安全性可选择的需求。通过分析各哈希算法及其运算逻辑的相似性,该设计最大化地重用加法器和寄存器,极大地减少了总的实现面积。此外,该设计灵活可配,可以对内存直接存取。以Altera的Stratix II为FPGA目标器件,其最高频率可达100 MHz,总面积较现有设计减少26.7%以上,且各算法单位面积吞吐率均优于现有设计。相似文献

8.

Efficient External Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms

Dehne Dittrich Hutchinson 《Algorithmica》2003,36(2):97-122

External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/ O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems. 相似文献

9.

Choosing a Random Peer in Chord

Valerie King Scott Lewis Jared Saia Maxwell Young 《Algorithmica》2007,49(2):147-169

We present two new algorithms, Arc Length and Peer Count, for choosing a peer uniformly at random from the set of all peers in Chord (Proceedings of the ACM SIGCOMM 2001 Technical Conference, 2001). We show analytically that, in expectation, both algorithms have latency O(log n) and send O(log n) messages. Moreover, we show empirically that the average latency and message cost of Arc Length is 10.01log n and that the average latency and message cost of Peer Count is 20.02log n. To the best of our knowledge, these two algorithms are the first fully distributed algorithms for choosing a peer uniformly at random from the set of all peers in a Distributed Hash Table (DHT). Our motivation for studying this problem is threefold: to enable data collection by statistically rigorous sampling methods; to provide support for randomized, distributed algorithms over peer-to-peer networks; and to support the creation and maintenance of random links, and thereby offer a simple means of improving fault-tolerance. Research of S. Lewis, J. Saia and M. Young was partially supported by NSF grant CCR-0313160 and Sandia University Research Program grant No. 191445. 相似文献

10.

A review on diagnostic autism spectrum disorder approaches based on the Internet of Things and Machine Learning

Hosseinzadeh Mehdi Koohpayehzadeh Jalil Bali Ahmed Omar Rad Farnoosh Afshin Souri Alireza Mazaherinezhad Ali Rezapour Aziz Bohlouli Mahdi 《The Journal of supercomputing》2021,77(3):2590-2608

Children with autism spectrum disorders (ASDs) have some disturbance activities. Usually, they cannot speak fluently. Instead, they use gestures and pointing words to make a relationship. Hence, understanding their needs is one of the most challenging tasks for caregivers, but early diagnosis of the disease can make it much easier. The lack of verbal and nonverbal communications can be eliminated by assistive technologies and the Internet of Things (IoT). The IoT-based systems help to diagnose and improve the patients’ lives through applying Deep Learning (DL) and Machine Learning (ML) algorithms. This paper provides a systematic review of the ASD approaches in the context of IoT devices. The main goal of this review is to recognize significant research trends in the field of IoT-based healthcare. Also, a technical taxonomy is presented to classify the existing papers on the ASD methods and algorithms. A statistical and functional analysis of reviewed ASD approaches is provided based on evaluation metrics such as accuracy and sensitivity.

相似文献

11.

Bulk Synchronous Parallel Algorithms for the External Memory Model

Dehne Dittrich Hutchinson Maheshwari 《Theory of Computing Systems》2008,35(6):567-597

Abstract. Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/ O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/ O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including three-dimensional convex hulls (two-dimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I /O complexity lower bounds for various problems, including sorting. 相似文献

12.

Strategy and algorithms for the parallel solution of the nearest neighborhood problem in shared-memory processors

Tapia-Fern&#;ndez Santiago Alonso-Miyazaki Pablo Hiroshi Romero Ignacio Garc&#;a-Beltr&#;n Angel 《Engineering with Computers》2021,38(2):1669-1679

The neighborhood problem appears in many applications of computational geometry, computational mechanics, etc. In all these situations, the main requirement for a competitive implementation is performance, which can only be attained in modern hardware by exploiting parallelism. However, whereas the performance of serial algorithms is fairly predictable, that of parallel methods depends on delicate issues that have a huge impact (cache memory, cache misses, memory alignment, etc.), but are not easy to control. Even if there is not a simple approach to deal with these factors in shared-memory architectures, it is quite convenient to program parallel algorithms where the data are segregated on a per-thread basis. With this objective in mind, we propose a strategy to develop parallel algorithms based on a two-level design, and apply it to efficiently solve the nearest neighborhood problem. At a higher level, the proposed methods orchestrate the parallel algorithm and split the space into cells stored in a hash table; at the lower level, our methods hold serial search algorithms that are completely agnostic to the high-level counterpart. Using this strategy, we have developed a library combining different serial and parallel algorithms, optimized them, and assessed their performance. The analysis carried out allows to better understand the main bottlenecks in the algorithmic solution of the nearest neighborhood problem and come out with very fast implementations that improve existing available software.

相似文献

13.

Scalable,parallel computers: Alternatives,issues, and challenges 总被引：3，自引：0，他引：3

Gordon Bell 《International journal of parallel programming》1994,22(1):3-46

The 1990s will be the era of scalable computers. By giving up uniform memory access, computers can be built that scale over a range of several thousand. These provide highpeak announced performance (PAP), by using powerful, distributed CMOS microprocessor-primary memory pairs interconnected by a high performance switch (network). The parameters that determine these structures and their utility include: whether hardware (a multiprocessor) or software (a multicomputer) is used to maintain a distributed, or shared virtual memory (DSM) environemnt; the power of computing nodes (these improve at 60% per year); the size and scalability of the switch; distributability (the ability to connect to geographically dispersed computers including workstations); and all forms of software to exploit their inherent parallelism. To a great extent, viability is determined by a computer's generality—the ability to efficiently handle a range of work that requires varying processing (from serial to fully parallel), memory, and I/O resources. A taxonomy and evolutionary time line outlines the next decade of computer evolution, included distributed workstations, based on scalability and parallelism. Workstations can be the best scalables. 相似文献

14.

FFTs in external or hierarchical memory 总被引：2，自引：0，他引：2

David H. Bailey 《The Journal of supercomputing》1990,4(1):23-35

Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2^m-point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.A condensed version of this paper previously appeared in the Proceedings of Supercomputing '89. 相似文献

15.

Efficient viewshed computation on terrain in external memory

Marcus V. A. Andrade Salles V. G. Magalhães Mirella A. Magalhães W. Randolph Franklin Barbara M. Cutler 《GeoInformatica》2011,15(2):381-397

The recent availability of detailed geographic data permits terrain applications to process large areas at high resolution. However the required massive data processing presents significant challenges, demanding algorithms optimized for both data movement and computation. One such application is viewshed computation, that is, to determine all the points visible from a given point p. In this paper, we present an efficient algorithm to compute viewsheds on terrain stored in external memory. In the usual case where the observer’s radius of interest is smaller than the terrain size, the algorithm complexity is θ(scan(n ²)) where n ² is the number of points in an n × n DEM and scan(n ²) is the minimum number of I/O operations required to read n ² contiguous items from external memory. This is much faster than existing published algorithms. 相似文献

16.

基于Fermi架构的Join算法

李观钊陈思桐甄真陈虎《计算机科学》2013,40(3):62-67

在列数据库中,连接操作依然是最核心和最耗时的操作,GPU强大的计算能力可为此提供新的优化手段。基于Fermi架构,提出了新的Hash Join算法和Sort merge Join算法,其基本思想是充分利用该架构新增的缓存结构来减少连接操作的cache缺失率。与CUDA stream技术相结合,新算法在输出结果较多时可以有效地隐藏主存与显存间数据传输带来的延迟,进一步提升其执行效率。实验结果证实了基于Fcrmi架构的Hash Join算法处理偏抖数据的高效性及Sort merge Join算法的稳定性,并且通过比较表明,这两种算法的性能全面优于基于多核CPU充分优化的Join算法,最大加速2.4倍,在外键分布高偏抖时新的Hash Join算法的执行速度甚至达到每秒217M元组。相似文献

17.

Location-Oblivious Distributed Unit Disk Graph Coloring

Michel Barbeau Prosenjit Bose Paz Carmi Mathieu Couture Evangelos Kranakis 《Algorithmica》2011,60(2):236-249

We present the first location-oblivious distributed unit disk graph coloring algorithm having a provable performance ratio of three (i.e. the number of colors used by the algorithm is at most three times the chromatic number of the graph). This is an improvement over the standard sequential coloring algorithm that has a worst case lower bound on its performance ratio of 4−3/k (for any k>2, where k is the chromatic number of the unit disk graph achieving the lower bound) (Tsai et al., in Inf. Process. Lett. 84(4):195–199, 2002). We present a slightly better worst case lower bound on the performance ratio of the sequential coloring algorithm for unit disk graphs with chromatic number 4. Using simulation, we compare our algorithm with other existing unit disk graph coloring algorithms. 相似文献

18.

Using difference intervals for time-varying isosurface visualization

Waters KW Co CS Joy KI 《IEEE transactions on visualization and computer graphics》2006,12(5):1275-1282

We present a novel approach to out-of-core time-varying isosurface visualization. We attempt to interactively visualize time-varying datasets which are too large to fit into main memory using a technique which is dramatically different from existing algorithms. Inspired by video encoding techniques, we examine the data differences between time steps to extract isosurface information. We exploit span space extraction techniques to retrieve operations necessary to update isosurface geometry from neighboring time steps. Because only the changes between time steps need to be retrieved from disk, I/O bandwidth requirements are minimized. We apply temporal compression to further reduce disk access and employ a point-based previewing technique that is refined in idle interaction cycles. Our experiments on computational simulation data indicate that this method is an extremely viable solution to large time-varying isosurface visualization. Our work advances the state-of-the-art by enabling all isosurfaces to be represented by a compact set of operations. 相似文献

19.

Efficient control generation for mapping nested loop programs onto processor arrays

《Journal of Systems Architecture》2007,53(5-6):300-309

Processor array architectures are optimal platforms for computationally intensive applications. Such architectures are characterized by hierarchies of parallelism and memory structures, i.e. processor arrays apart from different levels of cache have a large number of processing elements (PE) where each PE can further contain sub-word parallelism. In order to handle large scale problems, balance local memory requirements with I/O-bandwidth, and use different hierarchies of parallelism and memory, one needs a sophisticated transformation called hierarchical partitioning. Innately the applications are data flow dominant and have almost no control flow, but the application of hierarchical partitioning techniques has the disadvantage of a more complex control flow. In a previous paper, the authors presented first time a methodology for the automated control path synthesis for the mapping of partitioned algorithms onto processor arrays. However, the control path contained complex multiplication and division operators. In this paper, we propose a significant extension to the methodology which reduces the hardware cost of the global controller and memory address generators by avoiding these costly operations. 相似文献

20.

A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory

Crauser Ferragina 《Algorithmica》2008,32(1):1-35

Abstract. The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [32] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/ fast search operations supported. In this paper we analyze, both theoretically and experimentally, the I/ O complexity and the working space of six algorithms for constructing large suffix arrays. Three of them are state-of-the-art, the other three algorithms are our new proposals. We perform a set of experiments based on three different data sets (English texts, amino-acid sequences and random texts) and give a precise hierarchy of these algorithms according to their working-space versus construction-time tradeoff. Given the current trends in model design [12], [32] and disk technology [29], [30], we pose particular attention to differentiate between ``random' and ``contiguous' disk accesses, in order to explain reasonably some practical I/ O phenomena which are related to the experimental behavior of these algorithms and that would otherwise be meaningless in the light of other simpler external-memory models. We also address two other issues. The former is concerned with the problem of building word indexes; we show that our results can be successfully applied to this case too, without any loss in efficiency and without compromising the simplicity of programming to achieve a uniform, simple and efficient approach to both the two indexing models. The latter issue is related to the intriguing and apparently counterintuitive ``contradiction' between the effective practical performance of the well-known Baeza-Yates—Gonnet—Snider algorithm [17], verified in our experiments, and its unappealing worst-case behavior. We devise a new external-memory algorithm that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance. 相似文献