期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

一种基于语义及统计分析的Deep Web实体识别机制 总被引：1，自引：0，他引：1

寇月申德荣李冬聂铁铮《软件学报》2008,19(2):194-208

分析了常见的实体识别方法,提出了一种基于语义及统计分析的实体识别机制(deep Web entity identification mechanism based on semantics and statistical analysis,简称SS-EIM),能够有效解决Deep Web数据集成中数据纠错、消重及整合等问题.SS-EIM主要由文本匹配模型、语义分析模型和分组统计模型组成,采用文本粗略匹配、表象关联关系获取以及分组统计分析的三段式逐步求精策略,基于文本特征、语义信息及约束规则来不断精化识别结果;根据可获取的有限的实例信息,采用静态分析、动态协调相结合的自适应知识维护策略,构建和完善表象关联知识库,以适应Web数据的动态性并保证表象关联知识的完备性.通过实验验证了SS-EIM中所采用的关键技术的可行性和有效性. 相似文献

2.

Hash-Indexing Block-Based Deduplication Algorithm for Reducing Storage in the Cloud

D. Viji S. Revathy 《计算机系统科学与工程》2023,46(1):27-42

Cloud storage is essential for managing user data to store and retrieve from the distributed data centre. The storage service is distributed as pay a service for accessing the size to collect the data. Due to the massive amount of data stored in the data centre containing similar information and file structures remaining in multi-copy, duplication leads to increase storage space. The potential deduplication system doesn’t make efficient data reduction because of inaccuracy in finding similar data analysis. It creates a complex nature to increase the storage consumption under cost. To resolve this problem, this paper proposes an efficient storage reduction called Hash-Indexing Block-based Deduplication (HIBD) based on Segmented Bind Linkage (SBL) Methods for reducing storage in a cloud environment. Initially, preprocessing is done using the sparse augmentation technique. Further, the preprocessed files are segmented into blocks to make Hash-Index. The block of the contents is compared with other files through Semantic Content Source Deduplication (SCSD), which identifies the similar content presence between the file. Based on the content presence count, the Distance Vector Weightage Correlation (DVWC) estimates the document similarity weight, and related files are grouped into a cluster. Finally, the segmented bind linkage compares the document to find duplicate content in the cluster using similarity weight based on the coefficient match case. This implementation helps identify the data redundancy efficiently and reduces the service cost in distributed cloud storage. 相似文献

3.

Secured Data Storage Using Deduplication in Cloud Computing Based on Elliptic Curve Cryptography

N. Niyaz Ahamed N. Duraipandian 《计算机系统科学与工程》2022,41(1):83-94

The tremendous development of cloud computing with related technologies is an unexpected one. However, centralized cloud storage faces few challenges such as latency, storage, and packet drop in the network. Cloud storage gets more attention due to its huge data storage and ensures the security of secret information. Most of the developments in cloud storage have been positive except better cost model and effectiveness, but still data leakage in security are billion-dollar questions to consumers. Traditional data security techniques are usually based on cryptographic methods, but these approaches may not be able to withstand an attack from the cloud server's interior. So, we suggest a model called multi-layer storage (MLS) based on security using elliptical curve cryptography (ECC). The suggested model focuses on the significance of cloud storage along with data protection and removing duplicates at the initial level. Based on divide and combine methodologies, the data are divided into three parts. Here, the first two portions of data are stored in the local system and fog nodes to secure the data using the encoding and decoding technique. The other part of the encrypted data is saved in the cloud. The viability of our model has been tested by research in terms of safety measures and test evaluation, and it is truly a powerful complement to existing methods in cloud storage. 相似文献

4.

A new content-defined chunking algorithm for data deduplication in cloud storage

《Future Generation Computer Systems》2017

Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Content-defined chunking (CDC) is a method to split files into variable length chunks, where the cut points are defined by some internal features of the files. Unlike fixed-length chunks, variable-length chunks are more resistant to byte shifting. Thus, it increases the probability of finding duplicate chunks within a file and between files. However, CDC algorithms require additional computation to find the cut points which might be computationally expensive for some applications. In our previous work (Widodo et al., 2016), the hash-based CDC algorithm used in the system took more process time than other processes in the deduplication system. This paper proposes a high throughput hash-less chunking method called Rapid Asymmetric Maximum (RAM). Instead of using hashes, RAM uses bytes value to declare the cut points. The algorithm utilizes a fix-sized window and a variable-sized window to find a maximum-valued byte which is the cut point. The maximum-valued byte is included in the chunk and located at the boundary of the chunk. This configuration allows RAM to do fewer comparisons while retaining the CDC property. We compared RAM with existing hash-based and hash-less deduplication systems. The experimental results show that our proposed algorithm has higher throughput and bytes saved per second compared to other chunking algorithms. 相似文献

5.

Large-scale incremental processing with MapReduce

《Future Generation Computer Systems》2014

An important property of today’s big data processing is that the same computation is often repeated on datasets evolving over time, such as web and social network data. While repeating full computation of the entire datasets is feasible with distributed computing frameworks such as Hadoop, it is obviously inefficient and wastes resources. In this paper, we present HadUP (Hadoop with Update Processing), a modified Hadoop architecture tailored to large-scale incremental processing with conventional MapReduce algorithms. Several approaches have been proposed to achieve a similar goal using task-level memoization. However, task-level memoization detects the change of datasets at a coarse-grained level, which often makes such approaches ineffective. Instead, HadUP detects and computes the change of datasets at a fine-grained level using a deduplication-based snapshot differential algorithm (D-SD) and update propagation. As a result, it provides high performance, especially in an environment where task-level memoization has no benefit. HadUP requires only a small amount of extra programming cost because it can reuse the code for the map and reduce functions of Hadoop. Therefore, the development of HadUP applications is quite easy. 相似文献

6.

基于压缩近邻的查重元数据去冗算法设计

姚文斌叶鹏迪李小勇常静坤《通信学报》2015,36(8):1-7

随着重复数据删除次数的增加,系统中用于存储指纹索引的清单文件等元数据信息会不断累积,导致不可忽视的存储资源开销。因此,如何在不影响重复数据删除率的基础上,对重复数据删除过程中产生的元数据信息进行压缩,从而减小查重索引,是进一步提高重复数据删除效率和存储资源利用率的重要因素。针对查重元数据中存在大量冗余数据,提出了一种基于压缩近邻的查重元数据去冗算法Dedup2。该算法先利用聚类算法将查重元数据分为若干类,然后利用压缩近邻算法消除查重元数据中相似度较高的数据以获得查重子集,并在该查重子集上利用文件相似性对数据对象进行重复数据删除操作。实验结果表明,Dedup2可以在保持近似的重复数据删除比的基础上,将查重索引大小压缩50%以上。相似文献

7.

一种基于内容分块的层次化去冗优化策略

下载免费PDF全文

李建江马占宁张凯《电子学报》2019,47(5):1094-1100

在过去的数十年中,信息数据量呈现指数级增长,如何存储和保护这些大量信息数据成为一个难题.云存储和冗余去重技术成为解决上述难题的主要技术.去冗技术在云存储系统中得到广泛应用,但主流的云存储系统存在索引信息的膨胀以及数据分块的不确定性等不足,而这些弊端会导致内存空间的浪费和数据分块的不可预知性.针对这些问题,提出了一种基于内容分块的层次化去冗优化策略,并构建了对应的算法,解决了云存储系统中索引信息表过大和数据分块过大或过小的问题.并且选取CNN新闻的页面内容作为测试集进行实际测试,通过比较去冗比和去冗时间可以看出,相比于目前主流的去冗策略,本文提出的基于内容分块的层次化去冗优化策略能够提升3%左右的去冗比,同时降低2%左右的去冗时间. 相似文献

8.

密钥共享下跨用户密文数据去重挖掘方法

高永强《沈阳工业大学学报》2020,42(2):203-207

针对当前密文数据去重挖掘方法存在去重效果较差、特征聚合能力低的问题,提出一种密钥共享下跨用户密文数据去重挖掘方法.结合非线性统计序列分析方法对密钥共享下跨用户密文数据的统计特征进行采样,通过识别不同领域的统计特征进行密文数据的线性编码设计,抽取密钥共享下跨用户密文数据的平均互信息特征量.采用匹配滤波方法实现密钥共享下跨用户密文数据的去重处理.仿真结果表明,采用该方法的去重效果较好,特征聚合能力较强. 相似文献

9.

Prefetch-aware fingerprint cache management for data deduplication systems

Mei LI Hongjun ZHANG Yanjun WU Chen ZHAO 《Frontiers of Computer Science》2019,13(3):500

Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all fingerprints in memory is not cost-effective because fingerprint indexes are typically very large. Many data deduplication systems maintain a fingerprint cache in memory and exploit fingerprint prefetching to accelerate the deduplication process. Although fingerprint prefetching can improve the performance of data deduplication systems by leveraging the locality of workloads, inaccurately prefetched fingerprints may pollute the cache by evicting useful fingerprints. We observed that most of the prefetched fingerprints in a wide variety of applications are never used or used only once, which severely limits the performance of data deduplication systems. We introduce a prefetch-aware fingerprint cache management scheme for data deduplication systems (PreCache) to alleviate prefetch-related cache pollution. We propose three prefetch-aware fingerprint cache replacement policies (PreCache-UNU, PreCache-UOO, and PreCache-MIX) to handle different types of cache pollution. Additionally, we propose an adaptive policy selector to select suitable policies for prefetch requests. We implement PreCache on two representative data deduplication systems (Block Locality Caching and SiLo) and evaluate its performance utilizing three real-world workloads (Kernel, MacOS, and Homes). The experimental results reveal that PreCache improves deduplication throughput by up to 32.22% based on a reduction of on-disk fingerprint index lookups and improvement of the deduplication ratio by mitigating prefetch-related fingerprint cache pollution. 相似文献

10.

Practices of backuping homomorphically encrypted databases

Sa WANG Yiwen SHAO Yungang BAO 《Frontiers of Computer Science》2019,13(2):220

Ideal homomorphic encryption is theoretically achievable but impractical in reality due to tremendous computing overhead. Homomorphically encrypted databases, such as CryptDB, leverage replication with partially homomorphic encryption schemes to support different SQL queries over encrypted data directly. These databases reach a balance between security and efficiency, but incur considerable storage overhead, especially when making backups. Unfortunately, general data compression techniques relying on data similarity exhibit inefficiency on encrypted data. We present CryptZip, a backup and recovery system that could highly reduce the backup storage cost of encrypted databases. The key idea is to leverage the metadata information of encryption schemes and selectively backup one or several columns among semantically redundant columns. The experimental results show that CryptZip could reduce up to 90.5% backup storage cost on TPC-C benchmark. 相似文献