首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Although much work has been done on duplicate document detection (DDD) and its applications, we observe the absence of a systematic study on the performance and scalability of large-scale DDD algorithms. It is still unclear how various parameters in DDD correlate mutually, such as similarity threshold, precision/recall requirement, sampling ratio, and document size. This paper explores the correlations among several most important parameters in DDD and the impact of sampling ratio is of most interest since it heavily affects the accuracy and scalability of DDD algorithms. An empirical analysis is conducted on a million HTML documents from the TREC .GOV collection. Experimental results show that even when using the same sampling ratio, the precision of DDD varies greatly on documents with different sizes. Based on this observation, we propose an adaptive sampling strategy for DDD, which minimizes the sampling ratio with the constraint of a given precision requirement. We believe that the insights from our analysis are helpful for guiding the future large-scale DDD work. A preliminary version of this paper appears in the Proceedings of the 10th Asia Pacific Conference on Knowledge Discovery and Data Mining, Singapore, April 2006. This work was conducted when Shaozhi Ye visited Microsoft Research Asia.  相似文献   

2.
With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files. Data deduplication is a storage-optimization technique that reduces the data footprint by eliminating multiple copies of redundant data and storing only unique data. The basis of data deduplication is duplicate data detection techniques, which divide files into a number of parts, compare corresponding parts between files via hash techniques and find out redundant data. This paper proposes an efficient sliding blocking algorithm with backtracking sub-blocks called SBBS for duplicate data detection. SBBS improves the duplicate data detection precision of the traditional sliding blocking (SB) algorithm via backtracking the left/right 1/4 and 1/2 sub-blocks in matching-failed segments. Experimental results show that SBBS averagely improves the duplicate detection precision by 6.5% compared with the traditional SB algorithm and by 16.5% compared with content-defined chunking (CDC) algorithm, and it does not increase much extra storage overhead when SBBS divides the files into equal chunks of size 8 kB.  相似文献   

3.
近年来,篡改文本图像在互联网的广泛传播为文本图像安全带来严重威胁。然而,相应的篡改文本检测(TTD,tamperedtextdetection)方法却未得到充分的探索。TTD任务旨在定位图像中所有文本区域,同时根据纹理的真实性判断文本区域是否被篡改。与一般的文本检测任务不同,TTD任务需要进一步感知真实文本和篡改文本分类的细粒度信息。TTD任务有两个主要挑战:一方面,由于真实文本和篡改文本的纹理具有较高的相似性,仅在空域(RGB)进行纹理特征学习的篡改文本检测方法不能很好地区分两类文本;另一方面,由于检测真实文本和篡改文本的难度不同,检测模型无法平衡两类文本的学习过程,从而造成两类文本检测精度的不平衡问题。相较于空域特征,文本纹理在频域中的不连续性能够帮助网络鉴别文本实例的真伪,根据上述依据,提出基于空域和频域(RGB and frequency)关系建模的篡改文本检测方法。采用空域和频域特征提取器分别提取空域和频域特征,通过引入频域信息增强网络对篡改纹理的鉴别能力;使用全局空频域关系模块建模不同文本实例的纹理真实性关系,通过参考同幅图像中其他文本实例的空频域特征来辅助判断当前文本实例...  相似文献   

4.
Detection of both scene text and graphic text in video images is gaining popularity in the area of information retrieval for efficient indexing and understanding the video. In this paper, we explore a new idea of classifying low contrast and high contrast video images in order to detect accurate boundary of the text lines in video images. In this work, high contrast refers to sharpness while low contrast refers to dim intensity values in the video images. The method introduces heuristic rules based on combination of filters and edge analysis for the classification purpose. The heuristic rules are derived based on the fact that the number of Sobel edge components is more than the number of Canny edge components in the case of high contrast video images, and vice versa for low contrast video images. In order to demonstrate the use of this classification on video text detection, we implement a method based on Sobel edges and texture features for detecting text in video images. Experiments are conducted using video images containing both graphic text and scene text with different fonts, sizes, languages, backgrounds. The results show that the proposed method outperforms existing methods in terms of detection rate, false alarm rate, misdetection rate and inaccurate boundary rate.  相似文献   

5.
A knowledge-based approach for duplicate elimination in data cleaning   总被引:6,自引:0,他引:6  
Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall–precision dilemma. We develop a generic knowledge-based framework for effective data cleaning that can implement any existing data cleaning strategies and more. We propose a new method for computing transitive closure under uncertainty for dealing with the merging of groups of inexact duplicate records and explain why small changes to window sizes has little effect on the results of the sorted neighborhood method. Experimental study with two real-world datasets show that this approach can accurately identify duplicates and anomalies with high recall and precision, thus effectively resolving the recall–precision dilemma.  相似文献   

6.
As the recent proliferation of social networks, mobile applications, and online services increased the rate of data gathering, to find near‐duplicate records efficiently has become a challenging issue. Related works on this problem mainly aim to propose efficient approaches on a single machine. However, when processing large‐scale dataset, the performance to identify duplicates is still far from satisfactory. In this paper, we try to handle the problem of duplicate detection applying MapReduce. We argue that the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs and intermediate result size, which is related to the shuffle cost among different nodes in cluster. In this paper, we proposed a new signature scheme with new pruning strategies to minimize the number of candidate pairs and intermediate result size. The proposed solution is an exact one, which assures none duplicate record pair can be lost. The experimental results over both real and synthetic datasets demonstrate that our proposed signature‐based method is efficient and scalable. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

7.
微博立场检测是判断微博作者对某一个话题的态度是支持、反对或中立。在基于监督学习的分类框架上,扩展并提出基于多文本特征融合的中文微博的立场检测方法。首先探究了基于词频统计的特征(词袋特征(Bag-of-Words,BoW)、基于同义词典的词袋特征、考虑词与立场标签共现关系的特征)和文本深度特征(词向量、字向量)。之后使用支持向量机,随机森林和梯度提升决策树对上述特征进行立场分类。最后,结合所有特征分类器进行后期融合。实验表明,文中提出的特征对于不同话题下的微博立场检测的结果都有提升,且文本深度特征和基于词频统计的特征能够捕捉到文本的不同信息,在立场检测中是互补的。基于本文方法的微博立场检测系统在2016年自然语言处理与中文计算会议(NLPCC2016)的中文微博立场检测评测任务中取得了最好的结果。  相似文献   

8.
The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world’s largest database of reports on suspected adverse drug reaction (ADR) incidents that occur after drugs are on the market. The presence of duplicate case reports is an important data quality problem and their detection remains a formidable challenge, especially in the WHO drug safety database where reports are anonymised before submission. In this paper, we propose a duplicate detection method based on the hit-miss model for statistical record linkage described by Copas and Hilton, which handles the limited amount of training data well and is well suited for the available data (categorical and numerical rather than free text). We propose two extensions of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields, and we demonstrate the effectiveness both at identifying the most likely duplicate for a given case report (94.7% accuracy) and at discriminating true duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other knowledge discovery applications as well. Responsible editor: Hannu Toivonen.  相似文献   

9.
在检测数据库重复记录的研究中,基于BP神经网络的检测(Duplicate Record Detection based on BP Neural Network,简称DRDBPNN)算法的性能与初始的参数设置有很大的关系,从而造成其性能不稳定的缺陷,因此本文提出了一种基于QPSO与BP神经网络的重复记录检测算法(Duplicate Record Detection based on Quantum Particle Swarm Optimization and BP Neural Network,简称DRDQPSQBPNN)。仿真表明,该算法能够有效地提升重复记录的检测效率。  相似文献   

10.
Tescher提出的OSH碰撞检测算法,因算法的有效性被应用在需要提供空间映射对的各种场合。但算法仅仅通过求解重心坐标的权值来判断是否侵入,而无法测试侵入深度和求解接触法线,造成算法对于形成合理碰撞响应有一定限制。针对这一问题,提出使用SDM方法求解侵入参数,加入惩罚力提供变形体碰撞响应,结合约束力保证距离面积体积的守恒,从而形成有效的OSH碰撞检测环境。  相似文献   

11.
An approximate duplicate elimination in RFID data streams   总被引:1,自引:0,他引:1  
The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the multiple readings in many cases in detecting an RFID tag and the deployment of multiple readers, RFID data contains many duplicates. Since RFID data is generated in a streaming fashion, it is difficult to remove duplicates in one pass with limited memory. We propose one pass approximate methods based on Bloom Filters using a small amount of memory. We first devise Time Bloom Filters as a simple extension to Bloom Filters. We then propose Time Interval Bloom Filters to reduce errors. Time Interval Bloom Filters need more space than Time Bloom Filters. We propose a method to reduce space for Time Interval Bloom Filters. Since Time Bloom Filters and Time Interval Bloom Filters are based on Bloom Filters, they do not produce false negative errors. Experimental results show that our approaches can effectively remove duplicates in RFID data streams in one pass with a small amount of memory.  相似文献   

12.
数据库中大量重复图片的存在不仅影响学习器性能,而且耗费大量存储空间。针对海量图片去重,提出一种基于pHash分块局部探测的海量图像查重算法。首先,生成所有图片的pHash值;其次,将pHash值划分成若干等长的部分,若两张图片的某一个pHash部分的值一致,则这两张图片可能是重复的;最后,探讨了图片重复的传递性问题,针对传递和非传递两种情况分别进行了算法实现。实验结果表明,所提算法在处理海量图片时具有非常高的效率,在设定相似度阈值为13的条件下,传递性算法对近30万张图片的查重仅需2 min,准确率达到了53%。  相似文献   

13.
In automated assembly or production lines, some stations are duplicated due to their long cycle times. Material handling considerations may require these stations to be arranged in series rather than in parallel. Each job needs to be processed on any one of the duplicate stations. This study deals with scheduling of n available jobs on two serial duplicate stations in an automated production line. The performance measures considered are mean flowtime, makespan, and station idle time. After the problem is formulated, two algorithms are developed to determine the optimal schedules with respect to the performance measures.  相似文献   

14.
International Journal on Document Analysis and Recognition (IJDAR) - Text line segmentation is one of the key steps in historical document understanding. It is challenging due to the variety of...  相似文献   

15.
This paper studies the use of text signatures in string searching. Text signatures are a coded representation of a unit of text formed by hashing substrings into bit positions which are, in turn, set to one. Then instead of searching an entire line of text exhaustively, the text signature may be examined first to determine if complete processing is warranted. A hashing function which minimizes the number of collisions in a signature is described. Experimental results for two signature lengths with both a text file and a program file are given. Analyses of the results and the utility and application of the method conclude the discussion.  相似文献   

16.
In this paper, we present a new text line detection method for handwritten documents. The proposed technique is based on a strategy that consists of three distinct steps. The first step includes image binarization and enhancement, connected component extraction, partitioning of the connected component domain into three spatial sub-domains and average character height estimation. In the second step, a block-based Hough transform is used for the detection of potential text lines while a third step is used to correct possible splitting, to detect text lines that the previous step did not reveal and, finally, to separate vertically connected characters and assign them to text lines. The performance evaluation of the proposed approach is based on a consistent and concrete evaluation methodology.  相似文献   

17.
基于数据库的图文文档的摹本识别是办公自动化的一个重要研究内容。文章利用多元统计中的聚类分析,提出了一种对批量到达的图文文档进行摹本识别的方法。该方法首先把已读入计算机的单页图文文档转换为单色位图,给出若干互不相交的同心圆盘(固盘的中心按页的边缘计算),计算出各轴像素密度(各圆环内“on”象素的个数)作为图形的特征向量。在页面的特征向量之间建立一种距离,再进行聚类分析以识别文档的摹本。通过对从网上下栽的批量图形文档利用MATLAB进行多次仿真实验,单页文档的正确分类率达到了85%~98%。  相似文献   

18.
In this paper, we present a new approach for junction detection and characterization in line-drawing images. We formulate this problem as searching for optimal meeting points of median lines. In this context, the main contribution of the proposed approach is three-fold. First, a new algorithm for the determination of the support region is presented using the linear least squares technique, making it robust to digitization effects. Second, an efficient algorithm is proposed to detect and conceptually remove all distorted zones, retaining reliable line segments only. These line segments are then locally characterized to form a local structure representation of each crossing zone. Finally, a novel optimization algorithm is presented to reconstruct the junctions. Junction characterization is then simply derived. The proposed approach is very highly robust to common geometry transformations and can resist a satisfactory level of noise/degradation. Furthermore, it works very efficiently in terms of time complexity and requires no prior knowledge of the document content. Extensive evaluations have been performed to validate the proposed approach using other baseline methods. An application of symbol spotting is also provided, demonstrating quite good results.  相似文献   

19.
目前国际上对变化检测算法的研究主要集中于在效率或空间上的优化,变化检测的精确程度不能令人满意,比如不能准确定位改变的文字等。通过将XML文档的树型结构和文本之间相似度相结合,提出了一种新颖的面向文本内容的变化检测算法DML-Diff,重点突出了文本内容的变化,使得变化检测结果更精确。  相似文献   

20.
Recently, segmentation-based scene text detection has drawn a wide research interest due to its flexibility in describing scene text instance of arbitrary shapes such as curved texts. However, existing methods usually need complex post-processing stages to process ambiguous labels, i.e., the labels of the pixels near the text boundary, which may belong to the text or background. In this paper, we present a framework for segmentation-based scene text detection by learning from ambiguous labels. We use the label distribution learning method to process the label ambiguity of text annotation, which achieves a good performance without using additional post-processing stage. Experiments on benchmark datasets demonstrate that our method produces better results than state-of-the-art methods for segmentation-based scene text detection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号