首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 93 毫秒
1.
中文在线百科包含大量有价值的信息,很多工作成功地将其用于各类知识获取任务。例如,拥有相似话题的文档可以被归为一个概念。从这些在线百科中构建出的针对某一概念的层次话题对于搜索与浏览、信息组织和检索等应用都有很大的帮助。然而,目前尚未出现对在线百科中某一概念层次话题构建的研究。针对中文在线百科的异构性与粗糙性的问题,提出了一种基于贝叶斯网络的话题层次构建方法。该方法同时综合文档的结构化目录信息和非结构化文本信息,采用最大树形图算法自动地在文档所属概念的贝叶斯话题网络中建立层次话题。实验证明,与原有的百科话题结构相比较,所提方法在保持75%的准确性的同时扩充了4倍的内容。  相似文献   

2.
随着文本表现形式越来越丰富,文本分类研究的对象正从平文本逐渐转变为富文本,传统的平文本分类方法不能满足实际需要.分析了富文本中的结构化信息和文本内容信息,把它们作为两个重要的因素,综合考虑了其在分类中的作用,提出并实现了标签组件法、结构组件法和综合法三种富文本分类的方法.实验表明,所提出的方法有较好的分类表现,能解决OpenDocument的分类问题.  相似文献   

3.
半结构化、层次数据的模式发现   总被引:10,自引:0,他引:10  
Web数据资源及数据集成引发了半结构化数据问题,半结构化数据指其结构隐含或不规整的自描述数据。由于缺乏独立于数据的模式,有效地查询划浏览该类数据比较困难,半结构化数据的模式发现成为解决该问题的基础步骤。本文提出的算法能够快速有效地发现半结构化层次数据中的规整结构。它采用自顶向下的生成,结合有效的剪枝策略,从OEM模型表达的半结构化层次数据中构建模式树。  相似文献   

4.
文本结构分析与基于示例的文本过滤   总被引:13,自引:0,他引:13  
本文简要介绍了文本过滤的背景和发展,提出了基于示例的中文文本过滤模型.其基本思想是首先对于用户提出的示例文本进行文本结构分析,采用本文提出的文本层次分析方法,提取文本特征,形成主题词表示的用户模版(user profile),然后进行了文本过滤,同时引进段落匹配机制,提高过滤效率.通过用户反馈,改进用户模版.  相似文献   

5.
文本层次分析与文本浏览   总被引:7,自引:2,他引:5  
本文简要描述了文本的物理结构和逻辑结构以及相应的向量空间模型。研制了具有导航机制的文本浏览系统。提出了文本结构分析中的层次分析方法,它采用有序划分层次的方法。并在此基础上,给出了文本结构中各单元的标记信息,由此形成了文本的可视化表示。利用文本、层次、段落的超文本连接,根据浏览的需要,逐级展现文本细节,帮助用户有目的、有选择地浏览文本。最后给出评价的结果。  相似文献   

6.
针对传统地理信息系统(GIS)结构化或半结构化属性查询方法对查询语句输入的精度及查询范围的限制,提出了以哈尔滨工业大学《同义词词林》扩展版文本相关度计算为核心的非结构化文本数据GIS描述性查询方法。基本过程是根据描述性查询语句计算其与地理要素所关联的文本的相关度,进而以相关度值得出概括性查询结果。对比实验结果表明,描述性查询方法不但支持查询语句输入的多样化,而且能够有效地得出与输入的描述性查询相关联的地理要素。  相似文献   

7.
病理文本作为一类重要的非结构化临床文档,对临床诊断至关重要。针对具体的中文病理文本数据,提出一种简单有效结构化处理方法。首先对中文病理历史文本数据进行预处理,包括数据清洗、短句切分及主干提取等步骤,从中提取出各个样本所对应的文本信息;然后通过短句聚类和统计参数筛选实现样本描述模板的提取;最后利用模板对病理文本进行即时结构化处理,得到最终的结构化处理结果。实验证明,该方法对同类文本可以达到很好的结构化效果;同时提取的模板会被定期优化以适应最新的数据结构化需求。  相似文献   

8.
互联网的兴起带来了大量的文本信息。在半结构化和非结构化的文本中提取对用户有用的信息,主要采用文本挖掘技术.本文对文本挖掘常用的方法进行比较分析,总结文本挖掘目前主要的应用领域  相似文献   

9.
为了高速度、高质量地浏览网络上的大量中文文本,提出了一种文本凹凸树结构的可视化浏览机制,并给出其彤式描述.通过以关键字和概念词典标注的最小概念集标识结点建立文本分类的层次树结构,为用户快速洲览文本提供有效路径.通过统计方法进行文本摘要抽取,按大纲、逻辑主题词段落和摘要洲览文本内容,提高了搜索查询速度与阅读效率,满足了用户快速、主动浏览文本的需求.  相似文献   

10.
面向用户需求的非结构化P2P资源定位泛洪策略   总被引:1,自引:0,他引:1  
何明  张玉洁  孟祥武 《软件学报》2015,26(3):640-662
在非结构化P2P网络中,如何对用户所需资源进行快速、准确定位是当前研究的热点问题,也是P2P应用领域面临的核心问题之一.相关的非结构化P2P资源定位算法在查准率、查全率和查询成本上难以同时被优化,这会造成严重的网络带宽负担以及巨大的索引维护开销.为此,提出一种面向用户需求的非结构化P2P资源定位策略(user requirements resource location strategy,简称U2RLS).该策略的创新点是:在原有非结构化P2P网络资源定位泛洪算法的基础上,融入用户需求、用户偏好、用户兴趣度等因素,首先进行用户资源子网划分;采用带有用户需求信息的泛洪和查询索引机制,对用户所需资源进行精确定位.该策略有效避免了因海量信息引起的网络风暴、信息重叠和资源搜索偏覆盖等问题,从而解决了查询节点盲目使用中继节点的现象.实验结果表明:面向用户需求的非结构化P2P资源定位策略U2RLS以其高搜索成功率、有限网络资源消耗和短查询时间响应等优势,能够显著地提高用户资源定位效率.  相似文献   

11.
陈娟  王贤  黄青松 《现代计算机》2006,(9):19-21,62
近几年,网络被在线数据库迅速地深化.在深网中,大量的资料提供了丰富的数据模式,这些模式详细说明了它们的目标领域和查询性能,因此对大规模数据的整合是当前面临的挑战.在数据挖掘中,聚类分析是一个重要方法.本文论述通过查询接口采用凝聚层次聚类方法聚类结构化的Web资源,并采用先聚类后分类的方法稍加改进.实验显示对于聚类Web查询模式,凝聚的层次聚类能正确地组织资料.  相似文献   

12.
Since documents on the Web are naturally partitioned into many text databases, the efficient document retrieval process requires identifying the text databases that are most likely to provide relevant documents to the query and then searching for the identified text databases. In this paper, we propose a neural net based approach to such an efficient document retrieval. First, we present a neural net agent that learns about underlying text databases from the user's relevance feedback. For a given query, the neural net agent, which is sufficiently trained on the basis of the BPN learning mechanism, discovers the text databases associated with the relevant documents and retrieves those documents effectively. In order to scale our approach with the large number of text databases, we also propose the hierarchical organization of neural net agents which reduces the total training cost at the acceptable level. Finally, we evaluate the performance of our approach by comparing it to those of the conventional well-known approaches. Received 5 March 1999 / Revised 7 March 2000 / Accepted in revised form 2 November 2000  相似文献   

13.
基于潜在语义索引的文本分析方法   总被引:1,自引:0,他引:1  
本文分析是文本处理领域中的重要内容,它可以有效地改进文本检索、文本过滤以及文本摘要的精度.本文简要描述了文本的物理结构和逻辑结构以及文本分析的背景,将潜在语义索引引入文本分析中,提出了基于潜在语义索引的层次分析方法.该方法保证了层次划分的有序性和聚合性,可操作性强,便于解释,并给出了在文本检索、文本过滤和文本摘要中的应用.  相似文献   

14.
The recent decades have witnessed an unprecedented expansion in the volume of unstructured data in digital textual formats. Companies are now starting to recognize the potential economic value lying untapped in their text data repositories and sources, including external ones, such as social media platforms, and internal ones, such as safety reports and other company-specific document collections. Information extracted from these textual data sources is valuable for a range of enterprise application and for informed decision making. In this article we provide a systematic review of the current state of the art in the application of text analytics in industry. Our review is structured along three dimensions: the application context, the methods and techniques utilized, and the evaluation procedure. Based on the review, we identify the different challenges and constraints that an real-world, industrial environment imposes on text analytics techniques, as opposed to their deployment in more controlled, research environments. In addition, we formulate a set of desiderata that text analytics techniques should satisfy in order to alleviate these challenges and to ensure their successful deployment in industry. Furthermore, we discuss future trends in text analytics and their potential application in industry.  相似文献   

15.
一种面向专利文献数据的文本自动分类方法   总被引:1,自引:0,他引:1  
中文专利文献自动分类目前尚无成熟适用的方法。分析了文本自动分类的关键技术,并结合专利数据的特点对无词典分词和权重计算进行了改进,提出了一种适用于专利数据分类的层次分类方法,给出了面向专利文献数据的文本自动分类系统的框架模型。实验表明,该系统具有较好的分类精度与效率。  相似文献   

16.
Most documents have a hierarchical structure, which can be made explicit by markup languages such as SGML. In this paper we propose a formal model for representation of hierarchically structured documents, to be used as the basis for document query languages. The model uses a redundant representation of the document elements to simplify the expression of common queries. As an illustration of the power of the model we show how queries might be expressed, both as set-theoretic expressions and in a simple algebra, and outline how queries might be evaluated in a practical system.  相似文献   

17.
Data integration systems on the Deep Web offer a transparent means to query multiple data sources at once. Result merging– the generation of an overall ranked list of results from different sources in response to a query– is a key component of a data integration system. In this work we present a result merging model, called Active Relevance Weight Estimation model. Different from the existing techniques for result merging, we estimate the relevance of a data source in answering a query at query time. The relevances for a set of data sources are expressed with a (normalized) weighting scheme: the larger the weight for a data source the more relevant the source is in answering a query. We estimate the weights of a data source in each subset of the data sources involved in a training query. Because an online query may not exactly match any training query, we devise methods to obtain a subset of training queries that are related to the online query. We estimate the relevance weights of the online query from the weights of this subset of training queries. Our experiments show that our method outperforms the leading merging algorithms with comparable response time.  相似文献   

18.
When performing queries in web search engines, users often face difficulties choosing appropriate query terms. Search engines therefore usually suggest a list of expanded versions of the user query to disambiguate it or to resolve potential term mismatches. However, it has been shown that users find it difficult to choose an expanded query from such a list. In this paper, we describe the adoption of set‐based text visualization techniques to visualize how query expansions enrich the result space of a given user query and how the result sets relate to each other. Our system uses a linguistic approach to expand queries and topic modeling to extract the most informative terms from the results of these queries. In a user study, we compare a common text list of query expansion suggestions to three set‐based text visualization techniques adopted for visualizing expanded query results – namely, Compact Euler Diagrams, Parallel Tag Clouds, and a List View – to resolve ambiguous queries using interactive query expansion. Our results show that text visualization techniques do not increase retrieval efficiency, precision, or recall. Overall, users rate Parallel Tag Clouds visualizing key terms of the expanded query space lowest. Based on the results, we derive recommendations for visualizations of query expansion results, text visualization techniques in general, and discuss alternative use cases of set‐based text visualization techniques in the context of web search.  相似文献   

19.
Structured text is a general concept that is implicit in a variety of approaches to handling information. Syntactically, an item of structured text is a number of grammatically simple phrases together with a semantic label for each phrase. Items of structured text may be nested within larger items of structured text. The semantic labels in a structured text are meant to parameterize a stereotypical situation, and so a particular item of structured text is an instance of that stereotypical situation. Much information is potentially available as structured text including tagged text in XML, text in relational and object-oriented databases, and the output from information extraction systems in the form of instantiated templates. In this paper, we formalize the concept of structured text, and then focus on how we can identify inconsistency in the logical representation of items of structured text. We then present a new framework for merging logical theories that can be employed to merge inconsistent items of structured text. To illustrate, we consider the problem of merging reports such as weather reports.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号