半结构化文档中非标记化表格的抽取 Untagged Table Extraction in Semi-structured Documents期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

半结构化文档中非标记化表格的抽取

引用本文：	宋强,徐鹏,李涓子. 半结构化文档中非标记化表格的抽取[J]. 计算机工程, 2005, 31(18): 81-83,171

作者姓名：	宋强徐鹏李涓子

作者单位：	清华大学计算机系,北京,100084;清华大学计算机系,北京,100084;清华大学计算机系,北京,100084

摘要：	对非标记化表格进行数据建模,利用非标记化表格在文档中的结构分布特征,给出了非标记化表格的抽取算法.对非标记化表格进行行列划分,然后进行标题归纳和单元格合并.实验结果表明,论文提出的算法的正确性令人满意.
关键词：	非标记化表格信息抽取分层聚类
文章编号：	1000-3428（2005）18-0081-03
收稿时间：	2004-08-05
修稿时间：	2004-08-05
Untagged Table Extraction in Semi-structured Documents

SONG Qiang,XU Peng,Li Juanzi. Untagged Table Extraction in Semi-structured Documents[J]. Computer Engineering, 2005, 31(18): 81-83,171

Authors:	SONG Qiang XU Peng Li Juanzi

Affiliation:	Department of Computer Science, Tsinghua University, Beijing 100084

Abstract:	Based on the data modeling of the untagged table, this paper proposes an extraction algorithm by using its structural distribution features in documents. It splits the untagged table into rows and columns, and then inducts headers and merges cells. Experimental results indicate that the accuracy of the algorithm is satisfactory.

Keywords:	Untagged table Information extraction Hierarchical clustering
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏