首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The exponential growth of digital data in cloud storage systems is a critical issue presently as a large amount of duplicate data in the storage systems exerts an extra load on it. Deduplication is an efficient technique that has gained attention in large-scale storage systems. Deduplication eliminates redundant data, improves storage utilization and reduces storage cost. This paper presents a broad methodical literature review of existing data deduplication techniques along with various existing taxonomies of deduplication techniques that have been based on cloud data storage. Furthermore, the paper investigates deduplication techniques based on text and multimedia data along with their corresponding taxonomies as these techniques have different challenges for duplicate data detection. This research work is useful to identify deduplication techniques based on text, image and video data. It also discusses existing challenges and significant research directions in deduplication for future researchers, and article concludes with a summary of valuable suggestions for future enhancements in deduplication.  相似文献   

2.
Deduplication is an important technology in the cloud storage service. For protecting user privacy, sensitive data usually have to be encrypted before outsourcing. This makes secure data deduplication a challenging task. Although convergent encryption is used to securely eliminate duplicate copies on the encrypted data, these secure deduplication techniques support only exact data deduplication. That is, there is no tolerance of differences in traditional deduplication schemes. This requirement is too strict for multimedia data including image. For images, typical modifications such as resizing and compression only change their binary presentation but maintain human visual perceptions, which should be eliminated as duplicate copies. Those perceptual similar images occupy a lot of storage space on the remote server and greatly affect the efficiency of deduplication system. In this paper, we first formalize and solve the problem of effective fuzzy image deduplication while maintaining user privacy. Our solution eliminates duplicated images based on the measurement of image similarity over encrypted data. The robustness evaluation is given and demonstrates that this fuzzy deduplication system is able to duplicate perceptual similar images, which optimizes the storage and bandwidth overhead greatly in cloud storage service.  相似文献   

3.
Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms.  相似文献   

4.
针对目前社工库存储的海量数据,数据冗余、查询效率低下的质量问题,本文提出了一种有效的基于划分的近邻排序算法.对不同渠道采集、以不同存储方式存储的社工数据进行整合形成能以二维表形式存储的海量数据集,采用划分思想,对大数据集进行分割,形成簇;采用改进的近邻排序算法对各个簇中的小数据集进行检测得到最终的相似重复记录检测结果.实验和对比分析结果表明,划分和近邻排序算法的结合使用不仅提高了海量数据相似重复记录检测的时间效率,检测准确率也有所提升.  相似文献   

5.
随着网络技术和电力信息化业务的不断发展,网络信息越发膨胀,将导致互联网和电力信息网中存在海量网页冗余的现象,这类现象将会使数据挖掘、快速检索的复杂度加大,从而对网络设备和存储设备的性能带来了巨大的挑战,因此研究海量网页快速去重是非常有必要的。网页去重是从给定的大量的数据集合中检测出冗余的网页,然后将冗余的网页从该数据集合中去除的过程,其中基于同源网页的URL去重的研究已经取得了很大的发,但是针对海量网页去重问题,目前还没有很好的解决方案,本文在基于MD5指纹库网页去重算法的基础上,结合Counting Bloom filter算法的特性,提出了一种快速去重算法IMP-CMFilter。该算法通过减少I/0频繁操作,来提高海量网页去重的效率。实验表明,IMP-CMFilter算法的有效性。  相似文献   

6.
基于多服务器架构、为多用户服务的网络文件存储系统普遍存在资源分配不均,重复文件多,存储空间浪费严重的问题。设计并实现了TNS网络文件存储系统,该系统基于多服务器存储架构,分别由用户服务器、索引服务器、数据服务器、共享服务器、管理服务器和登录服务器组成,为多用户服务,采用一致性Hash实现负载均衡,支持在客户端进行文件粒度的重复数据删除。经过实际生产环境运行测试,具有良好的负载均衡能力和重复数据删除功能,可以有效节省存储空间,提高存储设备利用率。  相似文献   

7.
随着多云存储市场的快速发展, 越来越多的用户选择将数据存储在云上, 随之而来的是云环境中的重复数据也呈爆炸式增长. 由于云服务代理是相互独立的, 因此传统的数据去重只能消除代理本身管理的几个云服务器上的冗余数据. 为了进一步提高云环境中数据去重的力度, 本文提出了一种多代理联合去重方案. 通过区块链技术促成云服务代理间的合作, 并构建代理联盟, 将数据去重的范围从单个代理管理的云扩大到多代理管理的多云. 同时, 能够为用户、云服务代理和云服务提供商带来利益上的共赢. 实验表明, 多代理联合去重方案可以显著提高数据去重效果、节约网络带宽.  相似文献   

8.
数据库中大量重复图片的存在不仅影响学习器性能,而且耗费大量存储空间。针对海量图片去重,提出一种基于pHash分块局部探测的海量图像查重算法。首先,生成所有图片的pHash值;其次,将pHash值划分成若干等长的部分,若两张图片的某一个pHash部分的值一致,则这两张图片可能是重复的;最后,探讨了图片重复的传递性问题,针对传递和非传递两种情况分别进行了算法实现。实验结果表明,所提算法在处理海量图片时具有非常高的效率,在设定相似度阈值为13的条件下,传递性算法对近30万张图片的查重仅需2 min,准确率达到了53%。  相似文献   

9.
重复数据删除技术   总被引:12,自引:2,他引:12  
敖莉  舒继武  李明强 《软件学报》2010,21(4):916-929
重复数据删除技术主要分为两类:相同数据的检测技术和相似数据的检测与编码技术,系统地总结了 这两类技术,并分析了其优缺点.此外,由于重复数据删除技术会影响存储系统的可靠性和性能,又总结了针对这 两方面的问题提出的各种技术.通过对重复数据删除技术当前研究现状的分析,得出如下结论:a) 重复数据删除 中的数据特性挖掘问题还未得到完全解决,如何利用数据特征信息有效地消除重复数据还需要更深入的研 究;b) 从存储系统设计的角度,如何引入恰当的机制打破重复数据删除技术的可靠性局限并减少重复数据删除技术带来的额外系统开销也是一个需要深入研究的方面.  相似文献   

10.
随着多云存储市场的快速发展, 越来越多的用户选择将数据存储在云上, 随之而来的是云环境中的重复数据也呈爆炸式增长. 由于云服务代理是相互独立的, 因此传统的数据去重只能消除代理本身管理的几个云服务器上的冗余数据. 为了进一步提高云环境中数据去重的力度, 本文提出了一种多代理联合去重方案. 通过区块链技术促成云服务代理间的合作, 并构建代理联盟, 将数据去重的范围从单个代理管理的云扩大到多代理管理的多云. 同时, 能够为用户、云服务代理和云服务提供商带来利益上的共赢. 实验表明, 多代理联合去重方案可以显著提高数据去重效果、节约网络带宽.  相似文献   

11.
徐奕奕  唐培和 《计算机科学》2015,42(7):174-177, 209
云存储系统的重复数据作为大量冗余数据的一种,对其有效及时地删除能保证云存储系统的稳定与运行。由于云存储系统中的干扰数据较多,信噪比较低,传统的重删算法会在分数阶Fourier域出现伪峰峰值,不能有效地对重复数据进行检测滤波和删除处理,因此提出一种改进的基于分数阶Fourier变换累积量检测的云存储系统重复数据删除算法。首先分析云存储系统重复数据删除机制体系架构,定义数据存储点的适应度函数,得到云存储节点的系统子集随机概率分布;采用经验约束函数对存储节点中的校验数据块分存,通过分数阶Fourier变换对云存储系统中的幅度调制分量进行残差信号滤波预处理。采用4阶累积量切片后置算子,把每个文件分为若干个块,针对每个文件块进行重删,进行重复数据检测后置滤波处理,实现存储资源上的重复数据检测及其删除。仿真实验表明,该算法能提高集群云存储系统计算资源的利用率,重复数据准确删除率较高,有效避免了数据信息流的干扰特征造成的误删和漏删,性能优越。  相似文献   

12.
Delta compression is an efficient data reduction approach to removing redundancy among similar data chunks and files in storage systems. One of the main challenges facing delta compression is its low encoding speed, a worsening problem in face of the steadily increasing storage and network bandwidth and speed. In this paper, we present Ddelta, a deduplication-inspired fast delta compression scheme that effectively leverages the simplicity and efficiency of data deduplication techniques to improve delta encoding/decoding performance. The basic idea behind Ddelta is to (1) accelerate the delta encoding and decoding processes by a novel approach of combining Gear-based chunking and Spooky-based fingerprinting for fast identification of duplicate strings for delta calculation, and (2) exploit content locality of redundant data to detect more duplicates by greedily scanning the areas immediately adjacent to already detected duplicate chunks/strings. Our experimental evaluation of a Ddelta prototype based on real-world datasets shows that Ddelta achieves an encoding speedup of 2.5×–8× and a decoding speedup of 2×–20× over the classic delta-compression approaches Xdelta and Zdelta while achieving a comparable level of compression ratio.  相似文献   

13.
随着高校信息化的发展以及教学、科研和管理应用系统的广泛应用,数据资源如:图片、文档、视频等非结构化资源增长十分迅速。如何应对校园网络环境中不断增大的存储需求,提高存储资源的利用效率,是校园数据中心运维中一个比较重要的问题。本文介绍了基于开源软件 Swift 的云存储平台的搭建,以及带有重复数据删除功能的校园云存储系统(Dedupe_swift) 的设计与实现。通过重复数据删除功能的引入,提高了底层存储空间利用率;采用源端去重机制,为用户缩短了重复文件的上传时间;通过 Web 服务将存储作为服务提供给用户,为用户提供良好的云存储访问体验。  相似文献   

14.
The presented study describes a false-alarm probability-FAP bounded solution for detecting and quantifying Heart Rate Turbulence (HRT) major parameters including heart rate (HR) acceleration/deceleration, turbulence jump, compensatory pause value and HR recovery rate. To this end, first, high resolution multi-lead holter electrocardiogram (ECG) signal is appropriately pre-processed via Discrete Wavelet Transform (DWT) and then, a fixed sample size sliding window is moved on the pre-processed trend. In each slid, the area under the excerpted segment is multiplied by its curve-length to generate the Area Curve Length (ACL) metric to be used as the ECG events detection-delineation decision statistic (DS). The ECG events detection-delineation algorithm was applied to various existing databases and as a result, the average values of sensitivity and positive predictivity Se = 99.95% and P+ = 99.92% were obtained for the detection of QRS complexes, with the average maximum delineation error of 7.4 msec, 4.2 msec and 8.3 msec for P-wave, QRS complex and T-wave, respectively. Because the heart-rate time series might include fast fluctuations which don’t follow a premature ventricular contraction (PVC) causing high-level false alarm probability (false positive detections) of HRT detection, based on the binary two-dimensional Neyman-Pearson radius test (which is a FAP-bounded classifier), a new method for discrimination of PVCs from other beats using the geometrical-based features is proposed. The statistical performance of the proposed HRT detection-quantification algorithm was obtained as Se = 99.94% and P+ = 99.85% showing marginal improvement for the detection-quantification of this phenomenon. In summary, marginal performance improvement of ECG events detection-delineation process, high performance PVC detection and isolation from noisy holter data and reliable robustness against holter strong noise and artifacts can be mentioned as important merits and capabilities of the proposed HRT detection algorithm.  相似文献   

15.
Whenever files are modified, large parts of existing data must get unnecessarily re-written to storage due to the inefficiency on identifying those portions of the files that are actually new in the latest update. The unmodified data are considered as duplicate data since these do not have to be re-written. If NAND flash memory is used for storage, it is beneficial to reduce the duplicate data as many as possible. The issue is how to identify and eliminate the duplicate region efficiently. In this paper, the advanced architecture of flash file system, called duplication-eliminated flash file system, is introduced for duplicate elimination. The important design issues supporting duplicate elimination are how to manage data blocks and how to detect duplicate region. In the DeFFS, index entries of inodes support variable-sized blocks in order to increase the manageability and flexibility of duplicate regions. In addition, DeFFS uses non-overlapping duplicate checking algorithm to reduce the complexity of duplicate checking algorithm. The duplicate elimination can prolong flash memory life cycles by reducing actual amount of page writes, and increase write bandwidth.  相似文献   

16.
针对现有云存储系统中数据去重采用的收敛加密算法容易遭到暴力破解以及猜测攻击等不足,提出一种基于布隆过滤器的混合云存储安全去重方案BFHDedup,改进现有混合云存储系统模型,私有云部署密钥服务器Key Server支持布隆过滤器认证用户的权限身份,实现了用户的细粒度访问控制。同时使用双层加密机制,在传统收敛加密算法基础上增加额外的加密算法并且将文件级别去重和块级别去重相结合实现细粒度去重。此外,BFHDedup采用密钥加密链机制应对去重带来的密钥管理难题。安全性分析及仿真实验结果表明,该方案在可容忍的时间开销代价下实现了较高的数据机密性,有效抵抗暴力破解以及猜测攻击,提高了去重比率并且减少了存储空间。  相似文献   

17.
Virtual machine (VM) image backups have duplicate data blocks distributed in different physical addresses, which occupy a large amount of storage space in a cloud computing platform (Choo et al.,  [1] and González-Manzano et al.,  [2]). Deduplication is a widely used technology to reduce the redundant data in a VM backup process. However, deduplication always causes the fragmentation of data blocks, which seriously affects the VM restoration performance. Current approaches often rewrite data blocks to accelerate image restoration, but rewriting could cause significant performance overhead because of frequent I/O operations. To address this issue, we have found that the reference count is a key to the fragmentation degree from a series of experiments. Thus, we propose a reference count based rewriting method to defragment VM image backups, and a caching method based on the distribution of rewritten data blocks to restore VM images. Compared with existing studies, our approach has no interfere to the deduplication process, needs no extra storage, and efficiently improves the performance of VM image restoration. We have implemented a prototype to evaluate our approach in our real cloud computing platform OnceCloud. Experimental results show that our approach can reduce about 57% of the dispersion degree of data blocks, and accelerate about 51% of the image restoration of virtual machines.  相似文献   

18.
可去重云存储系统中一般采用收敛加密算法,通过计算数据的哈希值作为其加密密钥,使得重复的数据加密后得到相同的密文,可实现对重复数据的删除;然后通过所有权证明(PoW),验证用户数据的真实性来保障数据安全。针对可去重云存储系统中所有权证明时间开销过高导致整个系统性能下降问题,提出了一种基于布隆过滤器进行所有权证明的高效安全方法,实现用户计算哈希值与初始化值的快速验证。最后,提出一种支持细粒度重复数据删除的BF方案,当文件级数据存在重复时进行所有权证明,否则只需要进行局部的文件块级数据重复检测。通过仿真对比实验,结果表明所提BF方案空间开销低于经典Baseline方案,同时时间开销低于经典Baseline方案,在数据文件越大的情况下性能优势更加明显。  相似文献   

19.
In this paper, we present an efficient and simplified algorithm for the Residue Number System (RNS) conversion to weighted number system which in turn will simplify the implementation of RNS sign detection, magnitude comparison, and overflow detection. The algorithm is based on the Mixed Radix Conversion (MRC). The new algorithm simplifies the hardware implementation and improves the speed of conversion by replacing a number of multiplication operations with small look-up tables. The algorithm requires less ROM size compared to those required by existing algorithms. For a moduli set consisting of eight moduli, the new algorithm requires seven tables to do the conversion with a total table size of 519 bits, while Szabo and Tanaka MRC algorithm [N.S. Szabo, R.I. Tanaka, Residue Arithmetic and its Application to Computer Technology, McGraw-Hill, New York, 1967; C.H. Huang, A fully parallel mixed-radix conversion algorithm for residue number applications, IEEE Transactions on Computers c-32 (4) (1983)] requires 28 tables with a total table size of 8960 bits; and Huang MRC algorithm (Huang, 1983) requires 36 tables with a total table size of 5760 bits.  相似文献   

20.
跨用户数据去重技术, 通过在用户端减少重复数据上传来提高云端数据存储效率和用户的带宽使用效率。然而, 在数据上传过程中, 云服务商反馈给用户的确定性去重响应为攻击者建立了一个极具安全风险的边信道, 攻击者利用该边信道可推断出目标数据在云端的存在性隐私。现有的抗边信道攻击跨用户去重方法, 采用各种混淆策略试图扰乱攻击者的判断, 然而, 这些方法难以实现完全混淆, 攻击者仍然可通过字典攻击、附加块攻击等方式达到数据窃取的目的。目前, 如何防止攻击者利用边信道窃取数据的存在性隐私, 成为了跨用户数据去重技术亟待解决的重要问题。为应对这一挑战, 本文采用了一种基于广义去重的新型跨用户安全去重框架, 将原始数据从字节层面分解为基和偏移量, 对基进行跨用户去重, 并对偏移量进行云端去重。特别地, 本文采用 Reed-Solomon 纠删码编码思想实现基的提取, 使得从相似的数据中可以较高概率提取出相同的基。不仅可以实现对攻击者的混淆, 还可以有效节省通信开销和云端存储开销。此外, 为了进一步提高效率, 本文在偏移量上传前, 引入数据压缩算法, 减少偏移量间的冗余数据量。实验结果表明, 在实现有效抵抗边信道攻击的前提下, 本方法相比该领域最新工作在通信和存储效率等方面具有显著优势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号