首页 | 本学科首页   官方微博 | 高级检索  
     

基于邮件过滤的中文邮件语料库构建
引用本文:李军辉,朱巧明,李培峰.基于邮件过滤的中文邮件语料库构建[J].计算机应用与软件,2007,24(8):56-58,121.
作者姓名:李军辉  朱巧明  李培峰
作者单位:苏州大学计算机科学与技术学院,江苏,苏州,215006
基金项目:江苏省高技术研究发展计划项目 , 江苏省教育厅自然科学基金
摘    要:首先分析了现阶段邮件过滤的主要技术和邮件语料库建设的现状,并提出了建设中文邮件语料库的相关问题,建议在邮件建设过程中保留邮件信头信息、不排斥邮件副本.然后给出了邮件语料库系统的实现框架,分为邮件源代码的解析与预处理、邮件的初次标注、词分类和邮件的二次标注四个步骤,并通过提供一个管理工具来管理邮件语料.最后,介绍了目前已经建设的一个邮件语料库的情况.

关 键 词:邮件过滤  中文邮件语料库  标注  XML  邮件过滤  中文邮件语料库  语料库构建  FILTERING  BASED  CORPUS  CHINESE  APPLICATION  情况  管理工具  分类  预处理  源代码  实现框架  语料库系统  信息  信头  建设过程  问题  相关
修稿时间:2005-08-04

CONSTRUCTION AND APPLICATION OF CHINESE E-MAIL CORPUS BASED ON E-MAIL FILTERING
Li Junhui,Zhu Qiaoming,Li Peifeng.CONSTRUCTION AND APPLICATION OF CHINESE E-MAIL CORPUS BASED ON E-MAIL FILTERING[J].Computer Applications and Software,2007,24(8):56-58,121.
Authors:Li Junhui  Zhu Qiaoming  Li Peifeng
Affiliation:School of Computer Science and Technology,Soochow University,Suzhou 215006, Jiangsu, China
Abstract:The techniques in e-mail filtering and the status of the e-mail corpus are analyzed, and also some relating problems in constructing chinese e-mail corpus are proposed. It is argued that information in message header should be saved and it is unnecessary to exclude email duplications. The structure of the chinese e-mail corpus system is introduced. This system involves four steps including parsing and predisposing of the mail source, first mail annotation ,word categorization and second mail annotation, and also a management tool is provided to manage mail corpus. Finally the mail corpus constructed is described.
Keywords:E-mail filtering Chinese e-mail corpus Annotation XML
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号