首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper we introduce a dynamic programming algorithm which performs linear text segmentation by global minimization of a segmentation cost function which incorporates two factors: (a) within-segment word similarity and (b) prior information about segment length. We evaluate segmentation accuracy of the algorithm by precision, recall and Beeferman's segmentation metric. On a segmentation task which involves Choi's text collection, the algorithm achieves the best segmentation accuracy so far reported in the literature. The algorithm also achieves high accuracy on a second task which involves previously unused texts.  相似文献   

2.
Digitalization has changed the way of information processing, and new techniques of legal data processing are evolving. Text mining helps to analyze and search different court cases available in the form of digital text documents to extract case reasoning and related data. This sort of case processing helps professionals and researchers to refer the previous case with more accuracy in reduced time. The rapid development of judicial ontologies seems to deliver interesting problem solving to legal knowledge formalization. Mining context information through ontologies from corpora is a challenging and interesting field. This research paper presents a three tier contextual text mining framework through ontologies for judicial corpora. This framework comprises on the judicial corpus, text mining processing resources and ontologies for mining contextual text from corpora to make text and data mining more reliable and fast. A top-down ontology construction approach has been adopted in this paper. The judicial corpus has been selected with a sufficient dataset to process and evaluate the results. The experimental results and evaluations show significant improvements in comparison with the available techniques.  相似文献   

3.
Text Retrieval from Document Images Based on Word Shape Analysis   总被引:2,自引:1,他引:2  
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.  相似文献   

4.
基于链接描述文本及其上下文的Web信息检索   总被引:20,自引:0,他引:20  
文档之间的超链接结构是Web信息检索和传统信息检索的最大区别之一,由此产生了基于超链接结构的检索技术。描述了链接描述文档的概念,并在此基础上研究链接文本(anchor text)及其上下文信息在检索中的作用。通过使用超过169万篇网页的大规模真实数据集以及TREC 2001提供的相关文档及评价方法进行测试,得到如下结论:首先,链接描述文档对网页主题的概括有高度的精确性,但是对网页内容的描述有极大的不完全性;其次,与传统检索方法相比,使用链接文本在已知网页定位的任务上能够使系统性能提高96%,但是链接文本及其上下文信息无法在未知信息查询任务上改善检索性能;最后,把基于链接描述文本的方法与传统方法相结合,能够在检索性能上提高近16%。  相似文献   

5.
6.
杨华  陈波 《中文信息学报》2015,29(4):103-110
基于数量有限的文档,该文构建以基本要素中的head和modifier为节点的无向网络UBEN,调查了话题相关文档的UBEN的连通性,指出了话题相关的文档的UBEN具有的特性。讨论停用词对UBEN连通性的影响,比较了相关文档集和随机文档集的UBEN的联通特性的差异,指出了连通性在一定程度上是文档之间内容相关导致的融合结果。结论对多文档自动文摘和信息检索等任务有一定的意义。  相似文献   

7.
A map of text documents arranged using the Self-Organizing Map (SOM) algorithm (1) is organized in a meaningful manner so that items with similar content appear at nearby locations of the 2-dimensional map display, and (2) clusters the data, resulting in an approximate model of the data distribution in the high-dimensional document space. This article describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval. Furthermore, experiments on the well-known CISI collection [3] show significantly improved performance compared to Salton's vector space model, measured by average precision (AP) when retrieving a small, fixed number of best documents. Regarding comparison with Latent Semantic Indexing the results are inconclusive. This revised version was published online in August 2006 with corrections to the Cover Date.  相似文献   

8.
1 IntroductionThe eXPlOSive growth of the internet and other souxces of networked information has madeautomatic mediation of access to networked information sources an increasingly boortatproblem. Much of this information is eXPressed as electronic teXt in English. However, mostChinese users are able to read English bat without fluent writing ability. So they would liketo express their queries in Chinese to retrieve the rele~ English documents.The use of such systems can aJ8o be benefici…  相似文献   

9.
提出一种潜在文档相似模型(LDSM),把每对文档看作一个二分图,把文档的潜在主题看作二分图的顶点,用主题问的加权相似度为相应边赋权值,并用二分图的最佳匹配表示文档的相似度。实验结果表明,LDSM的平均查准率和平均查全率都优于用TextTiling和二分图最佳匹配方法构建的文档相似模型。  相似文献   

10.
电子出版物的全文检索技术研究   总被引:4,自引:0,他引:4  
电子出版物是目前出版物中发展较快、引人注目的一种新的出版形式,它克服了一般图书出版物容量小、体积大、难检索的不足。本文试图从分析全文检索结构、功能实现等方面探讨如何在电子出版物中实现全文检索。  相似文献   

11.
一种通过内容和结构查询文档数据库的方法   总被引:4,自引:0,他引:4       下载免费PDF全文
文档是有一定逻辑结构的,标题、章节、段落等这些概念是文档的内在逻辑.不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一直是研究的中心任务.结合文档的结构和内容,对结构化文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索,包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书Encarta,通过与传统方法实验比较,证明通过这种方法检索的文章片断更合理、更有效.  相似文献   

12.
Websom for Textual Data Mining   总被引:6,自引:0,他引:6  
New methods that are user-friendly and efficient are needed for guidanceamong the masses of textual information available in the Internet and theWorld Wide Web. We have developed a method and a tool called the WEBSOMwhich utilizes the self-organizing map algorithm (SOM) for organizing largecollections of text documents onto visual document maps. The approach toprocessing text is statistically oriented, computationally feasible, andscalable – over a million text documents have been ordered on a single map.In the article we consider different kinds of information needs and tasksregarding organizing, visualizing, searching, categorizing and filteringtextual data. Furthermore, we discuss and illustrate with examples howdocument maps can aid in these situations. An example is presented wherea document map is utilized as a tool for visualizing and filtering a stream ofincoming electronic mail messages.  相似文献   

13.
中文文本中抽取特征信息的区域与技术   总被引:30,自引:3,他引:30  
本文探讨了各种从中文文本中抽取特征信息的区域和技术。本文以新闻语料、科技论文、公文类文献为例,详细论述了从各类文本中抽取特征信息的区域与技术,对科技论文,还给出了一些可操作的产生式规则。无论对自动标引、自动分类,还是自动文摘的研究者而言,本文的方法与结论都有一定的参考价值。  相似文献   

14.
文档分析与识别(简称文档识别)技术将各种非结构化文档数据(图像、联机笔迹)转化为结构化数据,便于计算机处理和理解,应用场景十分广阔。20世纪60年代以来,文档识别方法研究与应用受到广泛关注并取得巨大进展。得益于深度学习技术的发展和应用,文档识别的性能快速提升,相关技术在文档数字化、票据处理、笔迹录入、智能交通、文档检索与信息抽取等领域得到广泛应用。首先介绍文档识别的背景和技术范畴,回顾该领域发展历史,然后重点对深度学习方法兴起以来的研究进行综述,分析当前技术存在的不足,并建议未来值得重视的研究方向。研究现状综述部分,按文档分析与识别的几个主要技术环节(文档图像预处理、版面分析、场景文本检测、文本识别、结构化符号和图形识别、文档检索与信息抽取)分别进行介绍,简述传统方法研究的代表性工作,重点介绍深度学习方法研究的新进展。总体上,当前研究对象向深度、广度扩展,处理方法全面转向深度神经网络模型和深度学习方法,识别性能大幅提升且应用场景不断扩展。在现状分析基础上,指出当前技术在识别精度和可靠性、可解释性、学习能力和自适应性等方面还有明显不足。最后从提升性能、应用扩展、提升学习能力几个角度提出一些研究方向。从提升性能角度,研究问题包括文本识别可靠性、可解释性、全要素识别、长尾问题、多语言、复杂版面分割与理解、变形文档分析与识别等。应用扩展包括新应用(如机器人流程自动化(robotic process automation,RPA)、文字信息抄录、考古)和新技术问题(语义信息抽取、跨模态融合、面向应用的推理决策等)两方面。从提升学习能力角度,相关问题包括小样本学习、迁移学习、多任务学习、领域自适应、结构化预测、弱监督学习、自监督学习、开放集学习和跨模态学习等。  相似文献   

15.
基于文本的信息隐藏技术研究   总被引:2,自引:0,他引:2  
信息隐藏技术是近几年来信息安全领域出现的一种新技术,不同于传统的密码学技术。它主要研究如何将机密信息隐藏于另一公开的信息中,然后通过公开信息的传输来传递机密信息。作为信息隐藏载体的公开信息可以是一般的文本文件、数字图像、数字视频和数字音频等等。而该文主要介绍了现有的几种基于文本的信息隐藏技术。  相似文献   

16.
The Cambridge University Multimedia Document Retrieval (CU-MDR) Demo System is a web-based application that allows the user to query a database of radio broadcasts that are available on the Internet. The audio from several radio stations is downloaded and transcribed automatically. This gives a collection of text and audio documents that can be searched by a user. The paper describes how speech recognition and information retrieval techniques are combined in the CU-MDR Demo System and shows how the user can interact with it.  相似文献   

17.
18.
We present a software tool called seft which balances the convenience of search tools such as grep with the functionality of full‐text index‐based information retrieval. Based on a novel retrieval heuristic which uses term locality as a guide to relevance, seft combines the freedom of natural language queries with the benefits of a ranked answer list and easy inspection of retrieval results. While not as fast as grep ‐style tools, seft provides a valuable facility for impromptu personal information retrieval tasks. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

19.
一种基于锚文本的并行检索策略   总被引:1,自引:0,他引:1       下载免费PDF全文
高珊  何婷婷  胡文敏 《计算机工程》2008,34(19):30-31,3
进行Web信息检索时,页面中的锚文本与正文存在较大相关性,多数检索系统忽视了锚文本对页面正文的贡献。该文提出一种提高检索精度的方法,为文档集建立一个基于页面正文的索引和一个基于锚文本的索引,对其采取并行检索策略。实验结果表明,该方法可以有效处理特定结构的网页集。  相似文献   

20.
Document management inside an organization is a complex and broadly scoped problem. This paper approaches the technical and social issues of Intranet document management by developing a straightforward document lifecycle model consisting of five phases: creation, publication, organization, access, and destruction. A document management system (DMS) which encompasses these areas should also have an evaluation component so its effectiveness can be measured.The document lifecycle is visualized as a waterfall model to help explore the discrete phases of an idealized Intranet DMS. The discussion of this model pinpoints where traditional DMS have fallen short, most notably in the areas of user-to-user and user-to-evaluator communication and coordination.From the document lifecycle, we derive an agent framework to integrate technical and social considerations and guide the design, implementation, and evaluation of a flexible and efficient DMS. The lifecycle model and agent framework are useful to organize both technical and social perspectives in this area.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号