首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 109 毫秒
1.
自然语言文档复制检测研究综述   总被引:37,自引:1,他引:37  
鲍军鹏  沈钧毅  刘晓东  宋擒豹 《软件学报》2003,14(10):1753-1760
复制检测技术在知识产权保护和信息检索中有着重要应用.到目前为止,复制检测技术主要集中在文档复制检测上.文档复制检测在初期主要检测程序复制,现在则主要为文本复制检测.分别介绍了程序复制检测和文本复制检测技术的发展,详细分析了目前已知各种文本复制检测系统的检测方法和技术特点,并比较了各系统关键技术的异同,最后指出了文本复制检测技术的发展思路.  相似文献   

2.
文本复制检测是这样一种行为:它判断一个文档的内容是否抄袭、剽窃或者复制于另外一个或者多个文档。文档复制检测领域的算法有很多,基于句子相似度的检测算法结合了基于字符串比较的方法和基于词频统计的方法的优点,在抓住了文档的全局特征的同时又能兼顾文档的结构信息,是一种很好的算法。本文在该算法的基础上对相似度算法进行了改进,提出了一种新的面向中文文档的基于句子相似度的文档复制检测算法。本算法充分考虑了中文文档的特点,选择句子作为文档的特征单元,并解决了需要人工设定阈值的问题,提高了检测精度。实验证明,无论是在效率上,还是在准确性上,该算法都是可行的。  相似文献   

3.
复制检测就是检测文档之间是否存在雷同现象,并将检测结果报告给用户。文章算法将复制检测技术指纹比对法和词频统计法结合起来,首先对文本进行预处理如滤除介词、冠词等,采用指纹比对法判断自然段落之间的相似性;然后将一个自然段视为一个小的整体来构成整个文档,采用基于词频的加权统计法判断全文的相似性。  相似文献   

4.
文档复制检测技术能够自动检测出数字文档间的重叠信息,它是保护知识产权和提高信息检索效率的一种有力手段.为解决中文学术论文复制检测难题,给出一种基于篇章结构相似度的中文学术论文复制检测算法及其问题的数学模型.在分析论文篇章结构的基础上,利用数字指纹和词频统计等技术,经编程实现,用于中文论文复制的初步检测.  相似文献   

5.
海量文本数据近似复制文本检测在现实生活中具有广泛应用,如相似网页检测.提出了一种基于MapReduce的相似文本匹配算法,给定一个文本集合和相似性阈值,该算法能够有效计算文本集合中不小于该阈值的所有文本对.在真实数据集合上的实验结果表明,与现有工作相比,所提算法能够快速返回相似文本对.  相似文献   

6.
贾培武 《微电脑世界》2005,(11):159-159
通常从网上复制的文本,当粘贴到Word 文档时,会带有边框等诸多格式,这给编辑工作造成一些不便。于是,许多人在将网页内容粘贴到Word 文档前,先借助“记事本”作为过渡,即先将这些文本粘贴到”记事本”程序,并在“记事本”中选择并复制,最后再粘贴到Word 文档中。  相似文献   

7.
一种基于梯度差的文档图像文本行检测算法   总被引:1,自引:0,他引:1  
王丹  王希常  杨侠 《微型机与应用》2011,30(18):32-34,37
在分析文本行特点的基础上,提出了一种利用水平梯度差进行文档图像的文本行检测算法。该算法首先对输入的文档图像进行水平梯度差计算,然后在局部窗口中求解最大梯度差并进行文本行区域的合并,通过非文本区域过滤来消除字符阶跃的跳变,最后将文档图像以行块的形式进行显示。实验结果表明,与投影算法进行相比,该算法对于行间距较小的文档图像的检测效果较好,时间复杂度较低并且检测的正确率较高,具有一定的鲁棒性和较好的适应性。  相似文献   

8.
智能手机拍摄的图像中经常会出现变形的文档图像,变形的文档图像影响文本的识别和后期图像处理等工作,而现有的变形文档图像校正方法存在校正类型单一和校正效果不理想的问题.针对以上问题,提出了一种基于最小化重投影的变形文档图像校正方法.该方法首先通过文本域轮廓检测,合并文本域轮廓来获取文本行连通域.然后利用主成分分析法PCA在...  相似文献   

9.
文档的扭曲矫正是进行文档OCR(Optical Character Recognition)的基础步骤,对提高OCR的准确率有重要作用.文档图像的扭曲矫正常常依赖于文本的提取,然而目前文档图像矫正算法大都无法对复杂文档中的文本进行准确定位和分析,导致其矫正效果不理想.针对此问题,提出了一种基于全卷积网络的文字检测框架,并使用合成文档对网络进行针对性训练,可实现对字符、词、文本行三级文本信息的准确获取,进而对文本进行自适应采样并利用三次函数对页面进行三维建模,将矫正问题转化为模型参数优化问题,达到矫正复杂文档图像的目的.使用合成扭曲文档以及真实测试数据进行矫正实验,结果表明,提出的矫正方法能够对复杂文档进行精确的文本提取,明显改善了复杂文档图像矫正后的视觉效果,相比于其他算法,该算法矫正后OCR的准确率得到显著提高.  相似文献   

10.
针对以维吾尔语书写的文档间的相似性计算及剽窃检测问题,提出了一种基于内容的维吾尔语剽窃检测(U-PD)方法。首先,通过预处理阶段对维吾尔语文本进行分词、删除停止词、提取词干和同义词替换,其中提取词干是基于N-gram 统计模型实现。然后,通过BKDRhash算法计算每个文本块的hash值并构建整个文档的hash指纹信息。最后,根据hash指纹信息,基于RKR-GST匹配算法在文档级、段落级和句子级将文档与文档库进行匹配,获得文档相似度,以此实现剽窃检测。通过在维吾尔语文档中的实验评估表明,提出的方法能够准确检测出剽窃文档,具有可行性和有效性。  相似文献   

11.
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype package that combines the spex algorithm with other optimisations and compressed indexing to produce a flexible and scalable co-derivative discovery system. Our experiments with multi-gigabyte document collections demonstrate the effectiveness of the approach.  相似文献   

12.
基于局部排序的视频拷贝检测   总被引:2,自引:0,他引:2  
排序法是一种常用的视频拷贝检测方法.为获得更佳的检测性能,提出一种基于排序特征的视频拷贝检测方案.该方案将帧进行分块,并按照Hilbert曲线顺序分别计算曲线上相邻块的灰度关系排序特征,生成用于检测的哈希码;为了准确地在目标视频中定位可疑内容,提H{了哈希匹配方案,将序列相似度作为匹配的依据,并引入动态规划的方法提高匹配精度;最后构造了拷贝测试样本,并与传统的排序签名检测方案进行性能对比实验.结果表明,文中方案拥有较好的检测性能,适用于视频内容的拷贝检测.  相似文献   

13.
目前拷贝检测中的图像哈希方法由于手工设计特征和线性映射带来的限制,检测精度不高。为了解决这一难题,提出一种端到端的深度哈希拷贝检测算法——DHCD。构建多尺度孪生卷积神经网络,以空间金字塔分层池化的方式来获得图像对的显著性特征;在新设计的哈希损失函数作用下,既保持了特征在语义结构上的相关性,又使得特征输出接近于目标哈希码;通过挖掘难分样本,[JP2]对难分样本再训练,提升了模型的识别效果。在拷贝数据集上的实验结果表明,该算法与当前主流的图像哈希算法相比,准确率提升了10%左右,且效率没有降低。  相似文献   

14.
提出一种基于鲁棒Hash的视频拷贝检测方案.通过对特征点进行分类,选取在时空域上持久存在的稳定点,对邻域点进行微分计算构造局部特征.将多维特征数据进行Hilbert编码,并选取有效位作为检测Hash码.为了准确的在目标视频中定位可疑内容,提出了Hash匹配方案,将序列相似度作为匹配的依据,提高匹配精度.实验结果表明本方案拥有较好检测性能,适用于视频内容的拷贝检测.  相似文献   

15.
Qi  Haifeng  Li  Jing  Wu  Qiang  Wan  Wenbo  Sun  Jiande 《Multimedia Tools and Applications》2020,79(7-8):4763-4782

Hash representation has attracted increasing attentions in recent years, but hash length is still a neglected element in the evaluation of hashing. Hash length is the dimension of hash representation, which is important for the performance of video hash. In this paper, we try to define the optimal hash length according to the probability of collision (PoC) of hash. Based on this definition, we demonstrate that this optimal hash length can be predicted from a small portion of dataset, which could be a reference for practical applications. The verification experiments are performed on several classical hashing methods in the case of video copy detection on different datasets. The experimental results show that each hash method has its own optimal hash length, and the performance can be improved as the length increases.

  相似文献   

16.
郑丽君  李新伟  卜旭辉 《计算机应用》2017,37(12):3447-3451
针对传统基于尺度不变特征变换(SIFT)特征的图像拷贝检测算法特征提取速度慢、匹配效率不高的问题,提出了一种基于SIFT特征点位置分布与方向分布特征的快速图像拷贝检测算法。首先,提取SIFT特征点二维位置信息,通过计算各个特征点与图像中心点的距离、角度,分块统计各区间的特征点数量,依据数量关系量化生成二值哈希序列,构成一级鲁棒特征;然后,根据特征点一维方向分布特征分块统计各方向子区间特征点数量,依据数量关系构成二级图像特征;最后,拷贝检测时采用级联式过滤框架作出是否为拷贝的判断。仿真实验结果表明,与传统SIFT以128维特征描述子为基础构建哈希序列的图像拷贝检测算法相比,所提算法在保证鲁棒性与独特性不降低的同时,特征提取时间缩短为原来的1/20,匹配时间也缩短了1/2以上,可满足在线拷贝检测的需求。  相似文献   

17.
In information retrieval (IR) research, more and more focus has been placed on optimizing a query language model by detecting and estimating the dependencies between the query and the observed terms occurring in the selected relevance feedback documents. In this paper, we propose a novel Aspect Language Modeling framework featuring term association acquisition, document segmentation, query decomposition, and an Aspect Model (AM) for parameter optimization. Through the proposed framework, we advance the theory and practice of applying high‐order and context‐sensitive term relationships to IR. We first decompose a query into subsets of query terms. Then we segment the relevance feedback documents into chunks using multiple sliding windows. Finally we discover the higher order term associations, that is, the terms in these chunks with high degree of association to the subsets of the query. In this process, we adopt an approach by combining the AM with the Association Rule (AR) mining. In our approach, the AM not only considers the subsets of a query as “hidden” states and estimates their prior distributions, but also evaluates the dependencies between the subsets of a query and the observed terms extracted from the chunks of feedback documents. The AR provides a reasonable initial estimation of the high‐order term associations by discovering the associated rules from the document chunks. Experimental results on various TREC collections verify the effectiveness of our approach, which significantly outperforms a baseline language model and two state‐of‐the‐art query language models namely the Relevance Model and the Information Flow model.  相似文献   

18.
When a file is to be transmitted from a sender to a recipient and when the latter already has a file somewhat similar to it, remote differential compression seeks to determine the similarities interactively so as to transmit only the part of the new file not already in the recipient's old file. Content-dependent chunking means that the sender and recipient chop their files into chunks, with the cutpoints determined by some internal features of the files, so that when segments of the two files agree (possibly in different locations within the files) the cutpoints in such segments tend to be in corresponding locations, and so the chunks agree. By exchanging hash values of the chunks, the sender and recipient can determine which chunks of the new file are absent from the old one and thus need to be transmitted.We propose two new algorithms for content-dependent chunking, and we compare their behavior, on random files, with each other and with previously used algorithms. One of our algorithms, the local maximum chunking method, has been implemented and found to work better in practice than previously used algorithms.Theoretical comparisons between the various algorithms can be based on several criteria, most of which seek to formalize the idea that chunks should be neither too small (so that hashing and sending hash values become inefficient) nor too large (so that agreements of entire chunks become unlikely). We propose a new criterion, called the slack of a chunking method, which seeks to measure how much of an interval of agreement between two files is wasted because it lies in chunks that don't agree.Finally, we show how to efficiently find the cutpoints for local maximum chunking.  相似文献   

19.
Cross-media retrieval is an imperative approach to handle the explosive growth of multimodal data on the web. However, existing approaches to cross-media retrieval are computationally expensive due to high dimensionality. To efficiently retrieve in multimodal data, it is essential to reduce the proportion of irrelevant documents. In this paper, we propose a fast cross-media retrieval approach (FCMR) based on locality-sensitive hashing (LSH) and neural networks. One modality of multimodal information is projected by LSH algorithm to cluster similar objects into the same hash bucket and dissimilar objects into different ones and then another modality is mapped into these hash buckets using hash functions learned through neural networks. Once given a textual or visual query, it can be efficiently mapped to a hash bucket in which objects stored can be near neighbors of this query. Experimental results show that, in the set of the queries’ near neighbors obtained by the proposed method, the proportions of relevant documents can be much boosted, and it indicates that the retrieval based on near neighbors can be effectively conducted. Further evaluations on two public datasets demonstrate the efficacy of the proposed retrieval method compared to the baselines.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号