共查询到20条相似文献,搜索用时 703 毫秒
1.
2.
Rabin指纹算法计算效率高、随机性好,可将数据更改对连续指纹序列的影响限制在局部范围内,广泛应用于重复数据检测领域。分析了Rabin指纹在有限域GF(2n)上的运算原理,得出滑动窗口移动时定长字符序列的数字指纹快速计算公式。用伪代码描述了Rabin指纹算法在重复数据检测中的应用,并用VC++语言进行了算法实现,在普通计算机上提取Word文档、程序源代码和BMP图像等三类文件作为测试数据集,测试结果表明算法是有效的。 相似文献
3.
4.
Rsync差异同步方法是一种高效的文件同步方法,但是在差异数据分布离散和同步网络速度较快等情况下,可能会出现其同步效率低于完全同步的问题。为了避免这种问题,提出了一种基于极值点分块自适应的快速文件同步方法。该方法利用数据极值点进行基于内容的可变长分块,快速估算同步源端和同步目的端在差异数据分布不同时的动态数据差异度,根据计算出的差异度和当前同步网络速度的定量值,自适应地选择采用更优的同步方法。实验结果表明,在差异数据分布不同和网络同步速率不同的情况下,本文的方法能有效地选择更优的同步方法,达到减少同步时间,提高同步效率的目的。 相似文献
5.
有很多不同的分块算法都可以对web网页进行分块.研究分块的1/1的是为了相关领域进一步研究的需要。例如通过页面块内容的重要程度研究基于块的搜索、定位网页的重要主题或内容,研究网页主要内容或主题的抽取,以及基于Web页面分块的Web存档等。首先给出Web页面分块问题定义和分类,并对几种典型的分块算法进行原理剖析,为进一步研究web页面分块问题提供一些有益的参考。 相似文献
6.
差分隐私作为现在的一种隐私保护机制得到了广泛的应用.目前虽然存在着很多种静态数据集上的直方图发布方法,但是对于数据流环境下的基于滑动窗口直方图发布方法较少,并且面临着直方图的发布误差较高的问题.对于此问题,提出了一种适用于滑动窗口模型的数据流差分隐私直方图发布算法(histogram pub-lishing algorithm for sliding window model,HPA-SW).该算法首先基于数据分块的思想来把一个滑动窗口划分为k个子块,并通过该参数来控制和调节数据直方图的统计误差;随后,该算法通过比较相邻两个直方图数据分布的差异来优化当前窗口的隐私预算分配,从而快速计算出局部最优直方图.为了验证算法的有效性,首先通过严格的理论推导证实了所设计的算法符合差分隐私要求,并且其近似误差不超过W/2k.其次,通过在真实数据集合上的实验对比,显示了该算法的发布误差较低,比SSHP算法降低了50%. 相似文献
7.
8.
重复数据检测技术能够大幅降低数据中心的存储量,节省网络带宽,减少建设和运维成本。为了克服基于内容分块(CDC)方法容易出现超长块的缺点,文章提出了基于极值点分块(EDC)的重复数据检测算法。EDC算法先计算出所有右边界在数据块上下限范围内的滑动窗口中数据的指纹,找出最后一个指纹极值,所对应的滑动窗口结束位置作为数据块的分界点,再计算该数据块的哈希值并判断是否重复块。实验结果表明,EDC算法的重复数据检测率、磁盘利用率分别是CDC算法的1.48倍和1.12倍,改进效果显著。 相似文献
9.
10.
基于Web标准的页面分块算法研究 总被引:1,自引:0,他引:1
页面分块在文档分类,信息抽取,主题信息采集,以及搜索引擎优化等方面具有重要的作用.首先提出了一种基于Web标准的页面分块算法,通过对网页进行解析和布局分析,利用Web标准对网页进行分块.实验证明该算法在对遵循Web标准的网页进行分块时,在分块准确性和复杂页面适应性方面得到了提高. 相似文献
11.
为改进rsync算法在进行远程文件同步时Client和Server端差异数据量较大的缺点,提出一种新的远程文件同步方法。该方法在rsync算法的基础上,利用差异压缩技术,采用块移动技术和KMP算法查找Client和Server端的差异和匹配,使用滑动窗口压缩算法对差异数据进行压缩,能有效减少差异数据在网络中的流量。实验表明,该方法能将差异量降低97%以上,从而有效减少差异量在网络中的传输量,减小网络带宽消耗,提高远程文件的同步效率。 相似文献
12.
13.
Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms. 相似文献
14.
为提高数据远程同步的效率,在分析rsync同步框架的基础上,提出一种借助文件系统监控机制同步数据的方法,并通过多线程流水线技术进行数据的实时同步。而在主服务器上基于文件系统监控机制再将数据实时同步到各个目标服务器进行数据备份。最后在实时同步的基础上每隔一定时间再将数据进行一次完全同步,以此保证数据的完整性。测试结果表明:论文提出的rsync改进措施在一定程度上提高了同步的效率。 相似文献
15.
RPKI(Resource Public Key Infrastructure,互联网码号资源公钥证书体系)中的签名对象由RP(Relying Party,依赖方)端同步下载后处理成IP地址块与AS(Autonomous System,自治域)号的真实授权关系,用于指导BGP路由.当前的RP使用软件rsync(Remote Sync)来同步,而rsync的同步算法并未考虑RPKI中文件(目录)的特点,导致同步效率并不理想.通过分析并结合RPKI中文件(目录)的特点,设计并实现了一种基于有序哈希树的RPKI资料库同步工具htsync.实验结果表明,与rsync相比较,htsync在同步时的数据传输量较少,同步时间较短.在设计的3种实验场景下,同步时间平均加速比分别为38.70%、30.13%和3.63%,有效地减少了同步时的时间和资源的消耗. 相似文献
16.
17.
18.
This paper presents a new scheme of I/O scheduling on storage servers of distributed/parallel file systems, for yielding better I/O performance. To this end, we first analyze read/write requests in the I/O queue of storage server (we name them block I/Os), by using our proposed technique of horizontal partition. Then, all block requests are supposed to be divided into multiple groups, on the basis of their offsets. This is to say, all requests related to the same chunk file will be grouped together, and then be satisfied within the same time slot between opening and closing the target chunk file on the storage server. As a result, the time resulted by completing block I/O requests can be significantly decreased, because of less file operations on the corresponding chunk files at the low-level file systems of server machines. Furthermore, we introduce an algorithm to rate a priority for each group of block I/O requests, and then the storage server dispatches groups of I/Os by following the priority order. Consequently, the applications having higher I/O priorities, e.g. they have less I/O operations and small size of involved data, can finish at a earlier time. We implement a prototype of this server-side scheduling in the PARTE file system, to demonstrate the feasibility and applicability of the proposed scheme. Experimental results show that the newly proposed scheme can achieve better I/O bandwidth and less I/O time, compared with the strategy of First Come First Served, as well as other server-side I/O scheduling approaches. 相似文献
19.
20.
《Journal of Parallel and Distributed Computing》1988,5(1):59-81
Evaluating the performance of local area networks is a major concern of the research community and organizations that install this type of network. In data processing applications, sorting is one of the most important and frequently performed operations, because database operations can be performed efficiently on sorted files. The efficiency of sorting algorithms in local networks, therefore, plays a significant role in determining overall network performance. This paper evaluates four alternate methods of performing external sort in common-bus local networks. Each method has five steps: interrupt and synchronization, network data transfer by packets, local sorting, global sorting, and output. The execution time for each method is calculated for four cases using different overlaps among the time components. Each method is evaluated by observing its behavior at different network speeds, file sizes, network sizes, page sizes, I/O times, and interrupt and synchronization times. This paper shows how changing the value of these variables affects the performance of the sorting algorithms. These observations are useful in designing sorting algorithms and other general, parallel algorithms for common-bus local networks. 相似文献