基于结构分析和实体识别的信息集成 Information Integration Based on Structural Analysis and Entity Recognition期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于结构分析和实体识别的信息集成

引用本文：	苏志华,杨冬青,唐世渭,王腾蛟. 基于结构分析和实体识别的信息集成[J]. 计算机研究与发展, 2004, 41(10): 1823-1828

作者姓名：	苏志华杨冬青唐世渭王腾蛟

作者单位：	北京大学计算机科学与技术系,北京,100871;北京大学计算机科学与技术系,北京,100871;北京大学视觉听觉与信息处理国家重点实验室,北京,100871

基金项目：	国家“九七三”重点基础研究发展规划基金项目 (G19990 3 2 70 5 )，国家“八六三”高技术研究发展计划基金项目数据库管理系统及其应用重大专项课题 ( 2 0 0 2AA4Z3 440 )

摘要：	针对海量的web数据，提出了一种基于文档结构分析和实体识别的web信息提取和集成方法，利用XML强大的数据描述能力，灵活组织集成的web文档信息内容．方法首先将半结构化的HTML文档转化成具有模式结构的XML文档，然后使用实体识别的技术对不同主题区域进一步抽取出格式良好的数据，最后将得到的多数据类型的信息集成到数据库中，以支持进一步的分析和查询．实验结果证明了该方法的实用和有效性．
关键词：	信息提取信息集成 XML Wrapper 实体识别
Information Integration Based on Structural Analysis and Entity Recognition

SU Zhi Ha ,YANG Dong Qing ,TANG Shi Wei ,,and WANG Teng Jiao. Information Integration Based on Structural Analysis and Entity Recognition[J]. Journal of Computer Research and Development, 2004, 41(10): 1823-1828

Authors:	SU Zhi Ha YANG Dong Qing TANG Shi Wei and WANG Teng Jiao

Affiliation:	SU Zhi Ha 1,YANG Dong Qing 1,TANG Shi Wei 1,2,and WANG Teng Jiao 1 1

Abstract:	Web information is expanding quickly with the dramatic expanse of Internet In this paper a Web information extraction and integration method is proposed, which is based on structure analysis and entity extraction Firstly it converts the semi structured HTML documents to formal XML documents with schema using XML technology Then significative information can be extracted from interesting area through entity recognition process Finally tremendous formal information can be integrated into database, which can support advanced query and analysis This approach also defines some patterns which can deal with heterogeneity of Web documents and achieve individuation of integrated documents The results of experiments validate the feasibility of the approach

Keywords:	information extraction information integration XML wrapper entity extraction
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏