首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Integrating a large number of Web information sources may significantly increase the utility of the World-Wide Web. A promising solution to the integration is through the use of a Web Information mediator that provides seamless, transparent access for the clients. Information mediators need wrappers to access a Web source as a structured database, but building wrappers by hand is impractical. Previous work on wrapper induction is too restrictive to handle a large number of Web pages that contain tuples with missing attributes, multiple values, variant attribute permutations, exceptions and typos. This paper presents SoftMealy, a novel wrapper representation formalism. This representation is based on a finite-state transducer (FST) and contextual rules. This approach can wrap a wide range of semistructured Web pages because FSTs can encode each different attribute permutation as a path. A SoftMealy wrapper can be induced from a handful of labeled examples using our generalization algorithm. We have implemented this approach into a prototype system and tested it on real Web pages. The performance statistics shows that the sizes of the induced wrappers as well as the required training effort are linear with regard to the structural variance of the test pages. Our experiment also shows that the induced wrappers can generalize over unseen pages.  相似文献   

2.
3.
Adjectives are common in natural language, and their usage and semantics have been studied broadly. In recent years, with the rapid growth of knowledge bases (KBs), many knowledge-based question answering (KBQA) systems are developed to answer users’ natural language questions over KBs. A fundamental task of such systems is to transform natural language questions into structural queries, e.g., SPARQL queries. Thus, such systems require knowledge about how natural language expressions are represented in KBs, including adjectives. In this paper, we specifically address the problem of representing adjectives over KBs. We propose a novel approach, called Adj2SP, to represent adjectives as SPARQL query patterns. Adj2SP contains a statistic-based approach and a neural network-based approach, both of them can effectively reduce the search space for adjective representations and overcome the lexical gap between input adjectives and their target representations. Two adjective representation datasets are built for evaluation, with adjectives used in QALD and Yahoo! Answers, as well as their representations over DBpedia. Experimental results show that Adj2SP can generate representations of high quality and significantly outperform several alternative approaches in F1-score. Furthermore, we publish Lark, a lexicon for adjective representations over KBs. Current KBQA systems show an improvement of over 24% in F1-score by integrating Adj2SP.  相似文献   

4.
多信息块Web页面的信息抽取   总被引:13,自引:0,他引:13  
提出了一个采用新的抽取规则的包装器 ,结合采用基于文档结构抽取规则和基于特征Pattern匹配抽取规则包装器的优点 ,可以适用于含有多个信息块的Web页面。  相似文献   

5.
6.
This article provides a comprehensive and comparative overview of question answering technology. It presents the question answering task from an information retrieval perspective and emphasises the importance of retrieval models, i.e., representations of queries and information documents, and retrieval functions which are used for estimating the relevance between a query and an answer candidate. The survey suggests a general question answering architecture that steadily increases the complexity of the representation level of questions and information objects. On the one hand, natural language queries are reduced to keyword-based searches, on the other hand, knowledge bases are queried with structured or logical queries obtained from the natural language questions, and answers are obtained through reasoning. We discuss different levels of processing yielding bag-of-words-based and more complex representations integrating part-of-speech tags, classification of the expected answer type, semantic roles, discourse analysis, translation into a SQL-like language and logical representations.  相似文献   

7.
The semantic web vision is one in which rich, ontology-based semantic markup will become widely available. The availability of semantic markup on the web opens the way to novel, sophisticated forms of question answering. AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input, and returns answers drawn from one or more knowledge bases (KBs). We say that AquaLog is portable because the configuration time required to customize the system for a particular ontology is negligible. AquaLog presents an elegant solution in which different strategies are combined together in a novel way. It makes use of the GATE NLP platform, string metric algorithms, WordNet and a novel ontology-based relation similarity service to make sense of user queries with respect to the target KB. Moreover it also includes a learning component, which ensures that the performance of the system improves over the time, in response to the particular community jargon used by end users.  相似文献   

8.
We present a method to automatically discover meaningful features in unlabeled image collections. Each image is decomposed into semi-local features that describe neighborhood appearance and geometry. The goal is to determine for each image which of these parts are most relevant, given the image content in the remainder of the collection. Our method first computes an initial image-level grouping based on feature correspondences, and then iteratively refines cluster assignments based on the evolving intra-cluster pattern of local matches. As a result, the significance attributed to each feature influences an image’s cluster membership, while related images in a cluster affect the estimated significance of their features. We show that this mutual reinforcement of object-level and feature-level similarity improves unsupervised image clustering, and apply the technique to automatically discover categories and foreground regions in images from benchmark datasets.  相似文献   

9.
Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but it did so using supervised machine learning. In this paper, we present an unsupervised approach that both selects the relevant reference set(s) automatically and then uses it for unsupervised extraction. We validate our approach with experimental results that show our unsupervised extraction is competitive with supervised machine learning approaches, including the previous supervised approach that exploits reference sets.  相似文献   

10.
基于互信息的无监督特征选择   总被引:5,自引:0,他引:5  
在数据分析中,特征选择可以用来降低特征的冗余,提高分析结果的可理解性和发现高维数据中隐藏的结构.提出了一种基于互信息的无监督的特征选择方法(UFS-MI),在UFS-MI中,使用了一种综合考虑了相关度和冗余度的特征选择标准UmRMR(无监督最小冗余最大相关)来评价特征的重要性.相关度和冗余度分别使用互信息来度量特征与潜在类别变量之间的依赖和特征与特征之间的依赖.UFS-MI同时适用于数值型和非数值型特征.在理论上证明了UFS-MI的有效性,实验结果也表明UFS-MI可以达到与传统的特征选择方法相当甚至更好的性能.  相似文献   

11.
POLYPHONET: An advanced social network extraction system from the Web   总被引:1,自引:0,他引:1  
Social networks play important roles in the Semantic Web: knowledge management, information retrieval, ubiquitous computing, and so on. We propose a social network extraction system called POLYPHONET, which employs several advanced techniques to extract relations of persons, to detect groups of persons, and to obtain keywords for a person. Search engines, especially Google, are used to measure co-occurrence of information and obtain Web documents.

Several studies have used search engines to extract social networks from the Web, but our research advances the following points: first, we reduce the related methods into simple pseudocodes using Google so that we can build up integrated systems. Second, we develop several new algorithms for social network mining such as those to classify relations into categories, to make extraction scalable, and to obtain and utilize person-to-word relations. Third, every module is implemented in POLYPHONET, which has been used at four academic conferences, each with more than 500 participants. We overview that system. Finally, a novel architecture called Iterative Social Network Mining is proposed. It utilizes simple modules using Google and is characterized by scalability and relate–identify processes: identification of each entity and extraction of relations are repeated to obtain a more precise social network.  相似文献   


12.
网页文本信息自动提取技术综述 *   总被引:2,自引:0,他引:2  
对Web网页文本信息自动提取技术提供了一个较为全面的综述。通过分析在这个领域常用到的三种 信息提取模型和四类机器学习算法的发展,较为全面地阐述了当前主流的网页文本信息自动提取技术,对比了 各种方法的应用范围,最后对于该领域当前的热点问题和发展趋势进行了展望。  相似文献   

13.
面向知识库的问答(Question answering over knowledge base, KBQA)是问答系统的重要组成. 近些年, 随着以深度学习为代表的表示学习技术在多个领域的成功应用, 许多研究者开始着手研究基于表示学习的知识库问答技术. 其基本假设是把知识库问答看做是一个语义匹配的过程. 通过表示学习知识库以及用户问题的语义表示, 将知识库中的实体、关系以及问句文本转换为一个低维语义空间中的数值向量, 在此基础上, 利用数值计算, 直接匹配与用户问句语义最相似的答案. 从目前的结果看, 基于表示学习的知识库问答系统在性能上已经超过传统知识库问答方法. 本文将对现有基于表示学习的知识库问答的研究进展进行综述, 包括知识库表示学习和问句(文本)表示学习的代表性工作, 同时对于其中存在难点以及仍存在的研究问题进行分析和讨论.  相似文献   

14.
网页信息的更新是网络一个非常重要的性质。同网络的其他应用类似,随着WWW信息内容更新的不断加快,如何有效地跟踪特定网站和页面的更新情况日渐成为人们关心的课题。论文讨论一个自适应的网页信息跟踪系统ChangeSpider,研究其体系结构、关键技术等方面的内容。实验表明ChangeSpider能够有效地跟踪网页的信息变化,及时地将变化的内容提交给用户。  相似文献   

15.
In this paper we propose a new unsupervised dimensionality reduction algorithm that looks for a projection that optimally preserves the clustering data structure of the original space. Formally we attempt to find a projection that maximizes the mutual information between data points and clusters in the projected space. In order to compute the mutual information, we neither assume the data are given in terms of distributions nor impose any parametric model on the within-cluster distribution. Instead, we utilize a non-parametric estimation of the average cluster entropies and search for a linear projection and a clustering that maximizes the estimated mutual information between the projected data points and the clusters. The improved performance is demonstrated on both synthetic and real world examples.  相似文献   

16.
Abstract The purpose of this exploratory study was to investigate the influence of two individual characteristics (Web experience and academic focus) of adolescents on the Web perception, using off-line questionnaires (a Lickert response scale) constituted on the basis of a series of interviews. Questions concerned: perceptions about the nature of information found in the Web; 'strategies' of access to the interesting Internet sites and the reliability of different information resources (libraries, television, Web, etc.). Results lead to the assumption that adolescents with high Web experience became more critical, less confident and less enthusiastic than adolescents with low Web experience and that, in some dimensions, perceptions of literature students are different to those of science students. Even if some interesting results were obtained, further research is needed to explore users' perceptions related to individuals' characteristics and to determine the generalisability of the influences identified in this exploratory study.  相似文献   

17.
网络爬虫是为了实现网络资源下栽功能的程序,是搜索引擎最重要的构件。考虑到网络上信息的种类繁多,研究一种基于网络爬虫的网页信息提取技术,并给出相关的设计方案,对设计方案进行验证,结果表明设计的可行性。  相似文献   

18.
半结构化网页中多记录信息的自动抽取方法   总被引:1,自引:0,他引:1  
朱明  王庆伟 《计算机仿真》2005,22(12):95-98
从多记录网页中准确的自动抽取出需要的信息,是Web信息处理中的一个重要研究课题。针对现有方法对噪声敏感的缺点,该文提出了基于记录子树的最大相似度发现记录模式的思想,以在同类记录的表现模式存在一定差异的情况下正确识别记录。在此基础上,实现了多记录网页自动抽取系统,该系统可以从多个学术论文检索网站中,自动获取结果网页,并自动抽取其中的记录。对常见论文检索网站的实验表明了该系统具有较好的有效性和准确性。  相似文献   

19.
搭配在语言学习、辞典编撰或自然语言处理的应用中有重要价值,搭配的自动荻取是自然语言计算的基本研究领域之一.利用对数似然度、卡平方和互信息作为关联强度测度,从Penn Treebank语料库中自动获取搭配候选,以比较3种测度的不同特性.实验结果表明由于3种测度遵从不同的分布假设和倾向,抽取的搭配具有不同的分布特征.  相似文献   

20.
区力  王新旭  陈敏 《现代计算机》2007,(10):110-112
在Web文本挖掘的相关理论和技术的基础上,对Web文本挖掘系统进行了总体框架设计,将Web文本挖掘技术与智能文档以及EIP技术相结合,以后者作为前者的前端展现工具,大大地增强了前者的应用性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号