首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 152 毫秒
1.
实体属性值抽取是信息抽取的重要组成部分.针对数量型属性类型多样以及取值易变的问题,设计实现了一种基于元性质的数量型属性值自动抽取系统.对系统的结构、功能框架以及相关核心技术,包括提取文本的选择、候选值的提取及评估、结果的自动验证等进行了详细讨论.通过对百度百科的五大类9个子类实体数量型属性值的抽取,平均准确率和召回率分别达到71%和89%,高于基于简单搜索的方法和传统的基于词汇-句模的方法.该方法适用于开放领域的数量型属性值获取,易于获取单值属性的精确取值.  相似文献   

2.
针对非结构化自由文本中关系模式比较复杂,关系抽取性能不高的问题,该文提出了利用BP神经网络的优化算法-LM算法,对非结构化自由文本信息中的领域概念实体属性关系进行抽取。首先对语料进行预处理,然后利用CRFs模型对领域概念的实例、属性和属性值进行实体识别,然后根据领域中各类关系的特点分别进行特征提取,构造BP神经网络模型,利用LM算法抽取相应关系。和适用于二分类问题的SVM相比,人工神经网络优化算法自主学习能力强,识别精度高,更适用于多分类的问题。通过几组实验表明,该方法在领域概念实体属性关系抽取方面取得了良好的效果, F值提高了12.8%。  相似文献   

3.
专利信息抽取是专利分析的基础,属性及属性值的识别与抽取是专利信息抽取所要解决的关键问题。目前,在中文专利信息抽取领域针对属性和属性值同步抽取的研究较少。本文以中文专利摘要作为实验语料,运用统计学习知识,提出一种基于条件随机场的抽取方法。该方法将属性和属性值视为命名实体,利用语料训练得到条件随机场模型,从而实现对属性和属性值的抽取;再利用挖掘的关联规则完成属性与属性值匹配。实验结果的准确率、召回率和F值分别是80.8%、81.2%和81.0%,其表明该方法能够高效同步抽取属性和属性值。同时,在抽取结果的基础上,本文完成了对专利的分析和同类专利的比较,体现了本方法的实用价值。  相似文献   

4.
针对Web信息抽取(WIE)技术在健康领域应用的问题,提出了一种基于WebHarvest的健康领域Web信息抽取方法。通过对不同健康网站的结构分析设计健康实体的抽取规则,实现了基于WebHarvest的自动抽取健康实体及其属性的算法;再把抽取的实体及其属性进行一致性检查后存入关系数据库中,然后对关系数据库中隐含健康实体的属性值利用Ansj自然语言处理方法进行实体识别, 进而抽取健康实体之间的联系。该技术在健康实体抽取实验中,平均F值达到99.9%,在实体联系抽取实验中,平均F值达到80.51%。实验结果表明提出的Web信息抽取技术在健康领域抽取的健康信息具有较高的质量和可信性。  相似文献   

5.
提出一种宠物知识图谱的构建框架。通过自顶向下的方式设计并构建了schema(概念)层,从半结构化和非结构化数据中进行知识抽取构建了数据层。在对非结构化数据的实体抽取方面,提出了一种条件随机场(CRF)与宠物症状词典相结合的症状命名实体识别方法。该方法利用症状词典对文本进行识别,获取语义类别信息,CRF结合语义信息实现对症状实体的识别抽取。实验结果表明了该方法的有效性。在知识表示方面,选用OrientDB数据库支持的属性图模型来表示。知识图谱采用OrientDB图数据库来完成知识的存储,并实例展示了构建的宠物知识图谱。  相似文献   

6.
信息抽取是自然语言处理工作中的重要任务之一。针对由于自然语言的多样性、歧义性和结构性而导致的信息抽取困难的问题,提出了一种面向金融事件信息抽取的层次化词汇-语义模式方法。首先,定义了一个金融事件表示模型;然后应用基于深度学习的词向量方法来实现自动生成同义概念词典;最后采用基于有限状态机驱动的层次化词汇-语义规则模式实现了对各类金融事件信息自动抽取的目标。实验结果表明,所提方法可以从金融新闻文本中准确地抽取出各类金融事件信息,并且对26类金融事件的微平均识别准确率达到93.9%,微平均召回率达到86.9%,微平均F1值达到90.3%。  相似文献   

7.
获取概念的属性信息有助于构建概念间的关系,进而改进基于概念的信息检索等应用的性能。研究了如何从机器可读词典中获取释义项的属性信息并实现了一个相应的系统DAE(Dictionary Attribute Extractor)。系统基于bootstrapping思想,进行模板-元组迭代抽取。在模板的获取中,引入了基于生物信息学多序列比对的方法;模板泛化时,引入词汇语义相似度计算和同义词扩展,提高模板覆盖率。实验中,系统抽取了“功能”、“颜色”和“组成”三种属性,取得了较好的效果。  相似文献   

8.
实例扩展与属性值扩充是Web抽取与集成领域中的一个重要研究课题,将Web数据列表和实例建模成二分图,根据扩展实例的质量分数,对扩展集合进行迭代更新直到扩展集合的质量分数最大,且扩展集合不再更新来实现实例的扩展。同时,为了完善扩展实例的属性信息,对结构化数值属性或离散属性进行抽取,提出了基于整数线性规划的属性值扩充方法。实验表明,与以前的方法相比,本方法能更好地处理含有噪声数据的Web网页,并提高了抽取的准确率和召回率。  相似文献   

9.
属性知识库扩展研究中已有的开放式信息抽取方法都十分依赖深度句法分析或有效的词典规则,在短文本处理上效果较差,召回率较低.文中提出基于词共现图的属性知识库迭代自增式扩展算法,利用属性与属性值的共现关系扩展知识库,并设计基于图的社区发现算法,找出社区的核心节点.最后,设计基于卷积神经网络的模型对抽取结果进行去噪.在两个真实数据集上的实验表明,文中方法在抽取质量上优于现有方法.  相似文献   

10.
语义信息在命名实体间语义关系抽取中具有重要的作用。该文以《同义词词林》为例,系统全面地研究了词汇语义信息对基于树核函数的中文语义关系抽取的有效性,深入探讨了不同级别的语义信息和一词多义等现象对关系抽取的影响,详细分析了词汇语义信息和实体类型信息之间的冗余性。在ACE2005中文语料库上的关系抽取实验表明,在未知实体类型的前提下,语义信息能显著提高抽取性能;而在已知实体类型的情况下,语义信息也能明显提高某些关系类型的抽取性能,这说明《词林》语义信息和实体类型信息在中文语义关系抽取中具有一定的互补性。  相似文献   

11.
We propose a novel classification learning method called customized support pattern learner (CSPL). Given an instance to be classified, CSPL explores and discovers support patterns (SPs), which are essentially attribute value subsets of the instance to be classified. The final prediction of the class label is performed by combining some statistics of the discovered useful SPs. One advantage of the CSPL method is that it can explore a richer hypothesis space and discover useful classification patterns involving attribute values with almost indistinguishable information gain. The customized learning characteristic also allows that the target class can vary for different instances to be classified. It facilitates extremely easy training instance maintenance and updates. We have evaluated our method with real-world problems and benchmark data sets. The results demonstrate that CSPL can achieve good performance and high reliability.  相似文献   

12.
Instance-based attribute identification in database integration   总被引:3,自引:0,他引:3  
Most research on attribute identification in database integration has focused on integrating attributes using schema and summary information derived from the attribute values. No research has attempted to fully explore the use of attribute values to perform attribute identification. We propose an attribute identification method that employs schema and summary instance information as well as properties of attributes derived from their instances. Unlike other attribute identification methods that match only single attributes, our method matches attribute groups for integration. Because our attribute identification method fully explores data instances, it can identify corresponding attributes to be integrated even when schema information is misleading. Three experiments were performed to validate our attribute identification method. In the first experiment, the heuristic rules derived for attribute classification were evaluated on 119 attributes from nine public domain data sets. The second was a controlled experiment validating the robustness of the proposed attribute identification method by introducing erroneous data. The third experiment evaluated the proposed attribute identification method on five data sets extracted from online music stores. The results demonstrated the viability of the proposed method.Received: 30 August 2001, Accepted: 31 August 2002, Published online: 31 July 2003Edited by L. Raschid  相似文献   

13.
曾新  李晓伟  杨健 《计算机科学》2018,45(Z6):482-486, 464
在实际应用中,空间特征不仅包含空间信息,其特征实例还伴随着属性信息,这些属性信息对知识发现和科学决策具有重大作用。在现有的co-location模式挖掘算法中,计算两个不同特征实例的邻近距离时并未考虑实例不同属性的取值在邻近距离中所占的权重,导致部分属性权重过大,从而影响co-location模式挖掘的结果。对属性取值进行规范化,赋予所有属性相等的权重,并提出基于join-based的数据规范化算法DNRA;同时,对距离阈值范围难以确定的问题进行了深入研究,推导出DNRA算法中距离阈值的取值范围,为用户选择适当的距离阈值提供帮助。最后,通过大量实验对DNRA算法的性能进行了分析比较。  相似文献   

14.
为了减少实例对属性选择的影响,本文提出了基于PSO的属性选择方法。该方法主要利用PSO算法求实例群的最优熵值,获得相应的属性阈值,并利用阈值确定属性的优先级,最后按优先级进行选择。在实验中,通过确定本体中概念属性的优先级来验证所提算法的性能。实验结果表明,该方法减少了对实例的依赖,计算量也相对减少。  相似文献   

15.
This study proposes a method, designated as the GRP-index method, for the classification of continuous value datasets in which the instances do not provide any class information and may be imprecise and uncertain. The proposed method discretizes the values of the individual attributes within the dataset and achieves both the optimal number of clusters and the optimal classification accuracy. The proposed method consists of a genetic algorithm (GA) and an FRP-index method. In the FRP-index method, the conditional and decision attribute values of the instances in the dataset are fuzzified and discretized using the Fuzzy C-means (FCM) method in accordance with the cluster vectors given by the GA specifying the number of clusters per attribute. Rough set (RS) theory is then applied to determine the lower and upper approximate sets associated with each cluster of the decision attribute. The accuracy of approximation of each cluster of the decision attribute is then computed as the cardinality ratio of the lower approximate sets to the upper approximate sets. Finally, the centroids of the lower approximate sets associated with each cluster of the decision attribute are determined by computing the mean conditional and decision attribute values of all the instances within the corresponding sets. The cluster centroids and accuracy of approximation are then processed by a modified form of the PBMF-index function, designated as the RP-index function, in order to determine the optimality of the discretization/classification results. In the event that the termination criteria are not satisfied, the GA modifies the initial population of cluster vectors and the FCM, RS and RP-index function procedures are repeated. The entire process is repeated iteratively until the termination criteria are satisfied. The maximum value of the RP cluster validity index is then identified, and the corresponding cluster vector is taken as the optimal classification result. The validity of the proposed approach is confirmed by cross validation, and by comparing the classification results obtained for a typical stock market dataset with those obtained by non-supervised and pseudo-supervised classification methods. The results show that the proposed GRP-index method not only has a better discretization performance than the considered methods, but also achieves a better accuracy of approximation, and therefore provides a more reliable basis for the extraction of decision-making rules.  相似文献   

16.
属性神经网络模型   总被引:7,自引:0,他引:7  
在分析了一般神经网络模型和属性论的基础上,提出了一种新的属性神经网络模型,它将数据信息保存在属性神经元和连接函数中,使学习过程变得简单和确定,并且绝对收敛,讨论了属性神经网络的图的性质,指出可以用图论的方法研究属性神经网络的分类器作用,同时证明了属性神经网络与属性坐标系的等价性,从而为属性推理提供可操作的数值推导方法。  相似文献   

17.
Work in inductive learning has mostly been concentrated on classifying.However,there are many applications in which it is desirable to order rather than to classify instances.Formodelling ordering problems,we generalize the notion of information tables to ordered information tables by adding order relations in attribute values.Then we propose a data analysis model by analyzing the dependency of attributes to describe the properties of ordered information tables.The problem of mining ordering rules is formulated as finding association between orderings of attribute values and the overall ordering of objects.An ordering rules may state that “if the value of an object x on an attribute a is ordered ahead of the value of another object y on the same attribute,then x is ordered ahead of y“.For mining ordering rules,we first transform an ordered information table into a binary information table,and then apply any standard machine learning and data mining algorithms.As an illustration,we analyze in detail Maclean‘s universities ranking for the year 2000.  相似文献   

18.
Crowdsourcing provides an effective and low-cost way to collect labels from crowd workers. Due to the lack of professional knowledge, the quality of crowdsourced labels is relatively low. A common approach to addressing this issue is to collect multiple labels for each instance from different crowd workers and then a label integration method is used to infer its true label. However, to our knowledge, almost all existing label integration methods merely make use of the original attribute information and do not pay attention to the quality of the multiple noisy label set of each instance. To solve these issues, this paper proposes a novel three-stage label integration method called attribute augmentation-based label integration (AALI). In the first stage, we design an attribute augmentation method to enrich the original attribute space. In the second stage, we develop a filter to single out reliable instances with high-quality multiple noisy label sets. In the third stage, we use majority voting to initialize integrated labels of reliable instances and then use cross-validation to build multiple component classifiers on reliable instances to predict all instances. Experimental results on simulated and real-world crowdsourced datasets demonstrate that AALI outperforms all the other state-of-the-art competitors.  相似文献   

19.
周亮  晏立 《计算机应用研究》2010,27(8):2899-2901
为了克服现有决策树分类算法在大数据集上的有效性和可伸缩性的局限,提出一种新的基于粗糙集理论的决策树算法。首先提出基于代表性实例的原型抽象方法,该方法从原始数据集中抽取代表性实例组成抽象原型,可缩减实例数目和无关属性,从而使算法可以处理大数据集;然后提出属性分类价值量概念,并作为选择属性的启发式测度,该测度描述了属性对分类的贡献价值量的多少,侧重考虑了属性之间以及实例与分类之间的关系。实验表明,新算法比其他算法生成的决策树规模要小,准确率也有显著提高,在大数据集上尤为明显。  相似文献   

20.
In this article, a filter feature weighting technique for attribute selection in classification problems is proposed (LIA). It has two main characteristics. First, unlike feature weighting methods, it is able to consider attribute interactions in the weighting process, rather than only evaluating single features. Attribute subsets are evaluated by projecting instances into a grid defined by attributes in the subset. Then, the joint relevance of the subset is computed by measuring the information present in the cells of the grid. The final weight for each attribute is computed by taking into account its performance in each of the grids it participates. Second, many real problems contain low signal-to-noise ratios, due to instance of high noise levels, class overlap, class imbalance, or small training samples. LIA computes reliable local information for each of the cells by estimating the number of target class instances not due to chance, given a confidence value. In order to study its properties, LIA has been evaluated with a collection of 18 real datasets and compared to two feature weighting methods (Chi-Squared and ReliefF) and a subset feature selection algorithm (CFS). Results show that the method is significantly better in many cases, and never significantly worse. LIA has also been tested with different grid dimensions (1, 2, and 3). The method works best when evaluating attribute subsets larger than 1, hence showing the usefulness of considering attribute interactions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号