首页 | 本学科首页   官方微博 | 高级检索  
     

基于最小生成树聚类的中文版面分割法
引用本文:张充,苗秀芬,司建辉,史青宣,田学东.基于最小生成树聚类的中文版面分割法[J].计算机工程,2008,34(15):211-213.
作者姓名:张充  苗秀芬  司建辉  史青宣  田学东
作者单位:河北大学数学与计算机学院,保定,071002
基金项目:国家自然科学基金资助项目 , 河北省科学技术研究与发展计划基金资助项目
摘    要:针对中文版面多横竖混排的特点,提出一种基于最小生成树聚类的版面分割方法。对原图像进行水平和垂直游程平滑,并对平滑后所得的连通域进行预分类处理,将文本进行横排、竖排分类。对预分类后的各类文本采用最小生成树聚类算法进行聚类处理。经实验,准确率达97%。实验表明,该方法对中文文档有良好的分割效果。

关 键 词:版面分割  游程平滑  最小生成树聚类

Chinese Document Layout Segmentation Method Based on Minimal Spanning Tree Clustering
ZHANG Chong,MIAO Xiu-fen,SI Jian-hui,SHI Qing-xuan,TIAN Xue-dong.Chinese Document Layout Segmentation Method Based on Minimal Spanning Tree Clustering[J].Computer Engineering,2008,34(15):211-213.
Authors:ZHANG Chong  MIAO Xiu-fen  SI Jian-hui  SHI Qing-xuan  TIAN Xue-dong
Affiliation:(College of Mathematics and Computer, Hebei University, Baoding 071002)
Abstract:Aiming at the feature that transverse documents and vertical documents blend mostly in Chinese document layout, a menthod based on minimal spanning tree clustering is presented. Apply run_length smoothing algorithm on the document in horizontal direction, and vertical direction. Then, a pre_classification step is applied to the connected components generated after classifying run_length smoothing to body text into horizontally aligned and vertically aligned. Minimal spanning tree clustering algorithm is applied to the body text that are generated after pre_classification. Via experiment, the accurate rate reaches 97%. As is shown from the experiment, the method has a good effect on segmentation of Chinese documents.
Keywords:layout segmentation  run_length smoothing  minimal spanning tree clustering
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号