首页 | 本学科首页   官方微博 | 高级检索  
     


Normalization of Chinese chat language
Authors:Kam-Fai Wong  Yunqing Xia
Affiliation:(1) Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Shatin, NT, Hong Kong;(2) Centre for Speech and Language Technologies, RIIT, Tsinghua University, Beijing, 100084, China
Abstract:
Real-time communication platforms such as ICQ, MSN and online chat rooms are getting more popular than ever on the Internet. There are, however, real risks where criminals and terrorists can perpetrate illegal and criminal abuses. This highlights the security significance of accurate detection and translation of the chat language to its stand language counterpart. The language used on these platforms differs significantly from the standard language. This language, referred to as chat language, is comparatively informal, anomalous and dynamic. Such features render conventional language resources such as dictionaries, and processing tools such as parsers ineffective. In this paper, we present the NIL corpus, a chat language text collection annotated to facilitate training and testing of chat language processing algorithms. We analyse the NIL corpus to study the linguistic characteristics and contextual behaviour of a chat language. First we observe that majority of the chat terms, i.e. informal words in a chat text, is formed by phonetic mapping. We then propose the eXtended Source Channel Model (XSCM) for the normalization of the chat language, which is a process to convert messages expressed in a chat language to its standard language counterpart. Experimental results indicate that the performance of XSCM in terms of chat term recognition and normalization accuracy is superior to its Source Channel Model (SCM) counterparts, and is also more consistent over time.
Contact Information Yunqing Xia (Corresponding author)Email:
Keywords:Chinese chat language  Phonetic mapping  Chat language Modelling  Chat term normalization  Natural language processing
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号