共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
Div+CSS流行于Web页面的布局,在这种布局下,网页中很多数据记录以重复结构的形式聚集在一个层级.提出一种基于属性标签的Web数据提取的方法,构造带有属性标签的DOM树,通过比较属性标签的值挖掘重复模式,制定三个规则排除干扰模式,找到数据域,进而从数据域中提取出数据记录. 相似文献
3.
基于重复模式的Web信息抽取 总被引:1,自引:1,他引:1
网页中的大量数据记录往往以重复的HTML结构进行有规律的组织,从而形成一致的表现形式。根据这一特征,本文给出一种基于重复模式的Web内容抽取方法。通过使用一种叫做后缀树的数据结构,分析页面结构中所包含的重复模式,进而从模式的实例中抽取出对应的数据记录。 相似文献
4.
Web中大量可访问的数据源为人们获取有用的信息带来了极大的便利。作为Web数据源集成的一个必要的步骤,需要将存在于不同数据源表达形式各异的重复Web实体准确地识别出来。在已有的重复实体识别的工作中,主要是在两个数据源之间进行。由于Web数据源数量众多,使得这些方法无法应用于多个Web数据源之间的重复实体识别。针对这个问题提出了一种基于迭代训练的Web重复实体识别方法,可以在较小规模的训练样本上实现在多个Web数据源上的重复实体识别。通过在图书和计算机产品两个不同领域中多个Web数据源上的广泛实验,表明了提出方法的有效性。 相似文献
5.
6.
提出了一种新的基于属性标记的专有名词统一识别方法。其基本思想是:根据专有名词的成词特点,利用标注语料库,设定词语属性作为标准属性重新进行标注,在此语料基础上进行专有名词成词结构、成词环境的实例提取,并采用基于转换的错误驱动方法对提取的实例进行适用规则提取。在提取的实例和规则的基础上进行属性标注,是一种基于转换的错误驱动规则自学习方法与基于实例的学习方法相结合的基于浅层句法分析的一种新的识别专有名词的方法。实验证明该方法在测试样本集上准确率达到95.3%,召回率达到92.5%,是一种有效的专有名词识别方法。 相似文献
7.
Deep Web中,查询接口属性的抽取是Deep Web数据集成中必不可少的一个环节。本文通过将接口属性中文文本翻译成为汉语拼音和英文,利用N-Gram方法完成中文查询接口中属性的抽取。经过多个领域的查询接口的实验,证明该方法能有效地抽取出查询接口的属性。 相似文献
8.
提出了一种新的基于属性标记的专有名词统一识别方法。其基本思想是:根据专有名词的成词特点,利用标注语料库,设定词语属性作为标准属性重新进行标注,在此语料基础上进行专有名词成词结构、成词环境的实例提取.并采用基于转换的错误驱动方法对提取的实例进行适用规则提取,在提取的实例和规则的基础上进行属性标注,是一种基于转换的错误驱动规则自学习方法与基于实例的学习方法相结合的基于浅层句法分析的一种新的识别专有名词的方法。实验证明该方法在测试样本集上准确率达到95.3%.召回率达到92.5%.是一种有效的专有名词识别方法。 相似文献
9.
10.
根据Web页面中出现的重复信息对Web页所体现的语义进行表示,可以提高Web页分类正确的精度.基于这一思想,本文通过对传统重复模式表示法的分析,提出基于重复模式的Web信息语义表示法.该方法在形式化描述重复模式的基础上,抽取Web信息中的重复模式建立表达Web信息语义特征的相关矩阵,并通过γ相似匹配算法计算重复模式的权重继而进行Web信息分类.实验证明,采用基于重复模式的Web信息语义表示法能够较好的体现Web网页信息的主题特征,可以提高Web信息分类的准确率. 相似文献
11.
Data incompleteness is one of the most important data quality problems in enterprise information systems. Most existing data imputing techniques just deduce approximate values for the incomplete attributes by means of some specific data quality rules or some mathematical methods. Unfortunately, approximation may be far away from the truth. Furthermore, when observed data is inadequate, they will not work well. The World Wide Web (WWW) has become the most important and the most widely used information source. Several current works have proven that using Web data can augment the quality of databases. In this paper, we propose a Web-based relational data imputing framework, which tries to automatically retrieve real values from the WWW for the incomplete attributes. In the paper, we try to take full advantage of relations among different kinds of objects based on the idea that the same kind of things must have the same kind of relations with their relatives in a specific world. Our proposed techniques consist of two automatic query formulation algorithms and one graph-based candidates extraction model. Several evaluations are proposed on two high-quality real datasets and one poor-quality real dataset to prove the effectiveness of our approaches. 相似文献
12.
Conflict resolution in a knowledge-based system using multiple attribute decision-making 总被引:1,自引:0,他引:1
The metarules useful in the conflict resolution are often directly related to the multiple, conflicting, and non-commensurate objectives associated with the problem domain. However, in its current form, the use of metarules for conflict resolution has some drawbacks. Above all, the metarule use in rule selection is not tailored to the current situation or the specific user; it is tailored to the domain expert, whose domain expertise and preferences were used to construct a knowledge-based system. In this paper, we presents a new method for resolving the conflicts of rules in the knowledge-based system, using decision analysis techniques that explicitly incorporate a user’s preference judgments about the rules. To this end, we consider a conflict resolution problem as a multiple attribute decision-making problem. Further, the proposed method allows for user’s preference judgments that are specified not by rigid format but by user-friendly format for the purpose of reducing burden of information specification.We have applied the proposed methodology to an organizational information-oriented service and resource planning, in which there exist multiple conflicting objectives to be considered. 相似文献
13.
We propose a (meta‐)search engine, called SnakeT (SNippet Aggregation for Knowledge ExtracTion), which queries more than 18 commodity search engines and offers two complementary views on their returned results. One is the classical flat‐ranked list, the other consists of a hierarchical organization of these results into folders created on‐the‐fly at query time and labeled with intelligible sentences that capture the themes of the results contained in them. Users can browse this hierarchy with various goals: knowledge extraction, query refinement and personalization of search results. In this novel form of personalization, the user is requested to interact with the hierarchy by selecting the folders whose labels (themes) best fit her query needs. SnakeT then personalizes on‐the‐fly the original ranked list by filtering out those results that do not belong to the selected folders. Consequently, this form of personalization is carried out by the users themselves and thus results fully adaptive, privacy preserving, scalable and non‐intrusive for the underlying search engines. We have extensively tested SnakeT and compared it against the best available Web‐snippet clustering engines. SnakeT is efficient and effective, and shows that a mutual reinforcement relationship between ranking and Web‐snippet clustering does exist. In fact, the better the ranking of the underlying search engines, the more relevant the results from which SnakeT distills the hierarchy of labeled folders, and hence the more useful this hierarchy is to the user. Vice versa, the more intelligible the folder hierarchy, the more effective the personalization offered by SnakeT on the ranking of the query results. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献
14.
信息系统中的属性约简是粗糙集知识发现的一个重要步骤。致力于研究一个信息系统中的特征选择、删除冗余属性。新的算法从属性重要性出发,采用迭代特征选择的标准,使得选择特征属性集不断缩小,获得信息系统的约简。通过实验证明该方法可行,有效。 相似文献
15.
16.
以形式化语言给出了本质属性、附属属性、限定性属性等术语的定义,研究了它们的性质与内在联系,给出了属性集的一种新的分类方法。结合对属性子集的一种新运算,特别讨论了本质属性的特征,并以此对IDEF5中种类的概念做了形式化修正。同时,研究发现,在本质属性为多个时,只需保留一条,其他任何一条本质属性既是可约属性也是不必要属性,而本质属性的判定简便易行,在利用相关算法进行属性约简之前可以先剔除部分属性。最后,以实例表明了这样预处理的优越性。 相似文献
17.
针对搜索引擎在海量数据中搜索速度慢,占用存储空间大,对重复的网页去重性差的现状,提出一种基于Rabin指纹算法的去重方法,不仅对搜索到的URL地址进行去重,还对非重复URL地址对应的网页内容进行相似和相同的去重,试验表明能有效地提高搜索速度、节省存储空间,增强搜索的精度。 相似文献
18.
Queries to Web search engines are usually short and ambiguous, which provides insufficient information needs of users for effectively retrieving relevant Web pages. To address this problem, query suggestion is implemented by most search engines. However, existing methods do not leverage the contradiction between accuracy and computation complexity appropriately (e.g. Google's ‘Search related to’ and Yahoo's ‘Also Try’). In this paper, the recommended words are extracted from the search results of the query, which guarantees the real time of query suggestion properly. A scheme for ranking words based on semantic similarity presents a list of words as the query suggestion results, which ensures the accuracy of query suggestion. Moreover, the experimental results show that the proposed method significantly improves the quality of query suggestion over some popular Web search engines (e.g. Google and Yahoo). Finally, an offline experiment that compares the accuracy of snippets in capturing the number of words in a document is performed, which increases the confidence of the method proposed by the paper. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献
19.
Attribution reduction is one of the key topics in the field of rough set theory. Based on such theory, the concept of ensemble attribute reduction has been proposed. The ensemble reduction is to divide the sample into multiple decision systems in terms of the decision categories and then calculate them separately. Although ensemble attribute reduction balances the requirements of various decision classes, the corresponding time of attribute reduction is increased. To solve this problem, an attribute reduction acceleration method based on sequential three-way decisions is proposed. The specific steps are as follows: (1) The importance of the attribute in the decision system is calculated. (2) The attributes are divided into three groups in terms of the significance degree of corresponding attribute. Then, the attributes with maximal significance degree are classified into the positive domain, the attributes with zero significance degree are classified into the negative domain, and other attributes will be classified into the boundary domain. (3) The significance degree of the attributes in the boundary domain is calculated cyclically and the obtained result is divided, until theconstraint is satisfied. 8 UCI data sets are selected to conduct experiments in the traditional attribute reduction and ensemble reduction environments, respectively. The experimental results show that, under the premise of ensuring the classification performance, the proposed method can effectively reduce the time of attribute reduction in such two environments. 相似文献