首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
基于无秩树自动机的信息抽取技术研究   总被引:1,自引:0,他引:1  
针对目前基于网页结构的信息抽取方法的缺陷,提出了一种基于无秩树自动机的信息抽取技术,其核心思想是通过将结构化(半结构化)文档转换成无秩树,然后利用(k,l)-contextual树构造样本自动机,依据树自动机接收和拒绝状态来对网页进行数据的抽取.该方法充分利用结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则.实验结果表明,该方法与同类抽取方法相比在准确率、召回率以及抽取所需时间上均有所提高.  相似文献   

2.
针对现有基于网页结构信息抽取技术的不足,提出一种基于确定性树自动机DTA(deterministic tree automaton)的信息抽取技术。其核心思想是通过将HTML文档转换成二叉树的形式,然后依据树自动机对待抽取网页的接收和拒绝状态进行数据的抽取。该方法充分利用了HTML文档的树状结构。依托树自动机将传统的以单一结构途径的信息抽取与文法推理两者相结合。经实验证明与同类抽取方法相比在准确率、召回率以及抽取所需时间上均有所提高。  相似文献   

3.
网页视图的重构与转化   总被引:1,自引:0,他引:1  
兰东俊  朱精南 《计算机应用》2003,23(Z2):158-159
文中提出一种用于描述网页结构化信息的数据模型--区域树模型和一种便于计算机处理,表示网页信息中间数据结构--标记树.讨论了从网页文本生成网页的标记树和区域树的过程和方法,以及使用网页结构化信息对网页视图进行重构和转化.网页版面重构解决了PAD,SMART PHONE等智能终端上网浏览Web信息中遇到的一系列的问题.  相似文献   

4.
Web信息抽取中需要对目标网站的网页进行聚类分析,以检测并生成信息抽取所需的模板。传统的基于DOM树编辑距离的网页聚类算法不适合文档对象模型(DOM)树结构复杂的动态模板网页,提出了一种基于局部标签树匹配的改进网页聚类算法,利用标签树中模板节点和非模板节点的层次差异性,根据节点对布局影响的大小赋予节点不同的匹配权值,使用局部树匹配完成对网页结构相似性的有效计算。实验结果表明,改进的算法较传统的基于DOM树编辑距离的网页聚类算法,在对采用模板生成的动态网页进行聚类分析时具有更高的准确率,且时间复杂度低。  相似文献   

5.
针对金融类公告中的结构化数据难以被高效快速提取的问题,提出一种基于文档结构与Bi-LSTM-CRF网络模型的信息抽取方法。自定义一种文档结构树生成算法,利用规则从文档结构树中抽取所需节点信息;构建基于信息句触发词的局部句子规则,抽取包含结构化字段信息的信息句;将字段的结构化信息抽取看作序列标注问题,分词时加入领域知识词典,构建基于Bi-LSTM-CRF的神经网络模型进行字段信息识别。实验结果表明,该信息抽取方法可以满足多类型公告的结构化信息提取,最终的信息句与字段信息抽取的平均F1值均可达到91%以上,验证了该方法在产品业务中的可行性和实用性。  相似文献   

6.
针对现有网上论坛信息抽取的不足,提出一种基于后缀树的论坛信息抽取方法.将标准化后的HTML文档转换为后缀树,查找出其中的重复模式并产生分装器,将分装器转换为NFA(非确定型有穷自动机)达到抽取论坛信息的目的.该方法运用构造后缀树的技术来抽取论坛信息,较好地解决了现有的抽取方法准确性较差、通用性不强的问题.实验结果表明,该方法具有较高的准确性和实用性.  相似文献   

7.
文档有效性检验是XML领域的一个基本问题.Active XML(AXML)文档在XML文档中引入Web服务,传统用于解决XML文档有效性的检验方法并不适用于AXML文档,为文档有效性检验提出了新的挑战.研究了AXML文档有效性检验问题,在原始树自动机的基础上,定义了AXML模式树自动机-ASTA机,该树自动机能够有效地描述满足AXML模式约束的文档集合.基于ASTA机,提出了一种多项式时间的AXML文档有效性检验算法.实验数据表明,基于提出的算法能够有效的完成对AXML文档的有效性检验.  相似文献   

8.
基于多层模式的多记录网页信息抽取方法   总被引:3,自引:0,他引:3  
为有效解决网页信息抽取所需知识的获取问题,提出一种基于多层模式的网信息抽取方法,(简称HPIE方法)。将网页信息抽取知识分为若干层,由抽象到具体逐层描述信息识别模式知识。HPIE方法能够利用各抽取对象之间存在的相互联系,以及抽取过程与结构所表成的新学习样本,不断完善多层模式的知识内容,并帮助最终从多个信息内容类似但其描述格式各异的HTML网页中,抽取出所需的多记录信息内容,有关多个(美国大学教员)论文目录网页的抽取实验结果表明,HPIE方法具有较强的网而信息自适应抽取能力。  相似文献   

9.
随着CSS+DIV布局方式逐渐成为网页结构布局的主流,对此类网页进行高效的主题信息抽取已成为专业搜索引擎的迫切任务之一。提出一种基于DIV标签树的网页主题信息抽取方法,首先根据DIV标签把HTML文档解析成DIV森林,然后过滤掉DIV标签树中的噪声结点并且建立STU-DIV模型树,最后通过主题相关度分析和剪枝算法,剪掉与主题信息无关的DIV标签树。通过对多个新闻网站的网页进行分析处理,实验证明此方法能够有效地抽取新闻网页的主题信息。  相似文献   

10.
李涛  李延增  孙伟  赵钢 《计算机应用研究》2010,27(10):3829-3833
针对当前产品结构树生成和更新中存在的数据重复录入问题,给出了产品结构化信息树的定义,能够表达产品任意零部件的产品结构及其关联文档数据,采取文档版本变化驱动产品结构树的版本变化的方法,建立了产品结构化信息树版本模型,提出了一套基于控制锁的版本控制规则,能够对产品结构化信息树的版本变化进行控制,避免了版本冗余。基于版本控制规则,论述了任意零部件文档签出或修订前后产品结构化信息树版本变化的求解方法。按照版本的不同,将项目所属文档按照产品结构树的版本分文件夹地保存在物理硬盘上,便于物理文档的维护和备份。最后通过企  相似文献   

11.
We describe an algorithm that allows the incremental addition or removal of unranked ordered trees to a minimal frontier-to-root deterministic finite-state tree automaton (DTA). The algorithm takes a tree t and a minimal DTA A as input; it outputs a minimal DTA A′ which accepts the language L(A) accepted by A incremented (or decremented) with the tree t. The algorithm can be used to efficiently maintain dictionaries which store large collections of trees or tree fragments.  相似文献   

12.
Regular (tree) model checking (RMC) is a promising generic method for formal verification of infinite-state systems. It encodes configurations of systems as words or trees over a suitable alphabet, possibly infinite sets of configurations as finite word or tree automata, and operations of the systems being examined as finite word or tree transducers. The reachability set is then computed by a repeated application of the transducers on the automata representing the currently known set of reachable configurations. In order to facilitate termination of RMC, various acceleration schemas have been proposed. One of them is a combination of RMC with the abstract-check-refine paradigm yielding the so-called abstract regular model checking (ARMC). ARMC has originally been proposed for word automata and transducers only and thus for dealing with systems with linear (or easily linearisable) structure. In this paper, we propose a generalisation of ARMC to the case of dealing with trees which arise naturally in a lot of modelling and verification contexts. In particular, we first propose abstractions of tree automata based on collapsing their states having an equal language of trees up to some bounded height. Then, we propose an abstraction based on collapsing states having a non-empty intersection (and thus “satisfying”) the same bottom-up tree “predicate” languages. Finally, we show on several examples that the methods we propose give us very encouraging verification results.  相似文献   

13.
Caterpillar expressions have been introduced by Brüggemann-Klein and Wood for applications in markup languages. Caterpillar expressions provide a convenient formalism for specifying the operation of tree-walking automata on unranked trees. Here we give a formal definition of determinism of caterpillar expressions that is based on the language of instruction sequences defined by the expression. We show that determinism of caterpillar expressions can be decided in polynomial time.  相似文献   

14.
There are many decision problems in automata theory (including membership, emptiness, inclusion and universality problems) that are NP-hard for some classes of tree automata (TA). The study of their parameterized complexity allows us to find new bounds of their nonpolynomial time algorithmic behaviours. We present results of such a study for classical TA, rigid tree automata, TA with global equality and disequality and t-DAG automata. As parameters we consider the number of states, the cardinality of the signature, the size of the term or the t-dag and the size of the automaton.  相似文献   

15.
A method for inferring of tree automata from sample set of trees is presented. The procedure, which is based on the concept ofk-follower of a tree with respect to the sample tree set, produces a tree automaton capable of accepting all the sample trees as well as other trees similar in structure. The behavior of the inferred tree automaton for varying values of parameterk is also discussed.This work was supported in part by a Scientific Research Grant-In-Aid (Grant No. 57460129) from the Ministry of Education, Science and Culture, Japan.  相似文献   

16.
We introduce the class of rigid tree automata (RTA), an extension of standard bottom-up automata on ranked trees with distinguished states called rigid. Rigid states define a restriction on the computation of RTA on trees: RTA can test for equality in subtrees reaching the same rigid state. RTA are able to perform local and global tests of equality between subtrees, non-linear tree pattern matching, and some inequality and disequality tests as well. Properties like determinism, pumping lemma, Boolean closure, and several decision problems are studied in detail. In particular, the emptiness problem is shown decidable in linear time for RTA whereas membership of a given tree to the language of a given RTA is NP-complete. Our main result is the decidability of whether a given tree belongs to the rewrite closure of an RTA language under a restricted family of term rewriting systems, whereas this closure is not an RTA language. This result, one of the first on rewrite closure of languages of tree automata with constraints, is enabling the extension of model checking procedures based on finite tree automata techniques, in particular for the verification of communicating processes with several local non-rewritable memories, like security protocols. Finally, a comparison of RTA with several classes of tree automata with local and global equality tests, with dag automata and Horn clause formalisms is also provided.  相似文献   

17.
分类回归树多吸引子细胞自动机分类方法及过拟合研究   总被引:1,自引:0,他引:1  
基于多吸引子细胞自动机的分类方法多是二分类算法,难以克服过度拟合问题,在生成多吸引子细胞自动机时如何有效地处理多分类及过度拟合问题还缺乏可行的方法.从细胞空间角度对模式空间进行分割是一种均匀分割,难以适应空间非均匀分割的需要.将CART算法同多吸引子细胞自动机相结合构造树型结构的分类器,以解决空间的非均匀分割及过度拟合问题,并基于粒子群优化方法提出树节点的最优多吸引子细胞自动机特征矩阵的构造方法.基于该方法构造的多吸引子细胞自动机分类器能够以较少的伪穷举域比特数获得好的分类性能,减少了分类器中的空盆数量,在保证分类正确率的同时改善了过拟合问题,缩短了分类时间.实验分析证明了所提出方法的可行性和有效性.  相似文献   

18.
Automata for unranked trees form a foundation for XML Schemas, querying and pattern languages. We study the problem of efficiently minimizing such automata. First, we study unranked tree automata that are standard in database theory, assuming bottom-up determinism and that horizontal recursion is represented by deterministic finite automata. We show that minimal automata in that class are not unique and that minimization is np-complete. Second, we study more recent automata classes that do allow for polynomial time minimization. Among those, we show that bottom-up deterministic stepwise tree automata yield the most succinct representations. Third, we investigate abstractions of XML schema languages. In particular, we show that the class of one-pass preorder typeable schemas allows for polynomial time minimization and unique minimal models.  相似文献   

19.
如何在XML数据流上高效地执行XPath查询,是XML数据流管理的关键问题。DTD结构信息对提高XML查询效率有很大帮助,已有的大部分算法没有利用这一资源。提出了一种使用DTD进行XML数据流查询处理的方法,具有以下特征:利用树自动机表示XPath;通过XPath树自动机与DTD树匹配,预先标识不匹配查询结构的DTD节点;给出一种利用DTD的XML流索引方法DBXSI;执行查询时,根据流索引信息直接跳过某些与查询不匹配的节点及子树。实验结果表明:该方法可有效支持Xpath查询,效率优于传统算法。  相似文献   

20.
虞蕾  陈火旺 《软件学报》2010,21(1):34-46
PSL(property specification language)是一种用于描述并行系统的属性规约语言,包括线性时序逻辑FL(foundation language)和分支时序逻辑OBE(optional branching extension)两部分.由于OBE就是CTL(computation tree logic),并且具有时钟声明的公式很容易改写成非时钟公式,因此重点研究了非时钟FL逻辑.为便于进行模型检验,每个FL公式必须转化成为一种可验证形式,通常是自动机(非确定自动机).构造非确定自动机的过程主要是通过中间构建交换自动机来实现.详细给出了由非时钟FL构造双向交换自动机的构造规则.构造规则的核心逻辑不仅仅局限于是在LTL(linear temporal logic)基础上的正规表达式,而且全面而充分地考虑了各种FL操作算子的可能性.并且给出了将双向交换自动机转化为非确定自动机的一种方法.最后,编写了将PSL转化为上述自动机的实现工具.FL双向交换自动机的构造规则计算复杂度仅是FL公式长度的线性表达式,验证了构造规则的正确性.在此基础上,证明了双向交换自动机与其转化的等价的非确定自动机接受的语言相同.上述工作对解决复杂并行系统建模和模型验证问题具有重要的理论意义和应用价值.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号