首页 | 本学科首页   官方微博 | 高级检索  
     


Domain phrase identification using atomic word formation in Chinese text
Authors:Qingtang Liu  Linjing Wu  Zongkai Yang  Yaoyao Liu
Affiliation:1. National Engineering Research Center for E-Learning, Huazhong Normal University, Wuhan, Hubei 430079, China;2. Department of Information Technology, Huazhong Normal University, Wuhan, Hubei 430079, China;1. CORIA-UMR 6614- Normandie Université, CNRS-Université et INSA de Rouen, Campus Universitaire du Madrillet, 76800 Saint-Etienne-du Rouvray, France;2. Department of Physics, Cleveland State University, Cleveland, OH 44115, USA;3. Department of Electrical and Computer Engineering, São Carlos School of Engineering, University of São Paulo, 400, Trabalhador São-Carlense Avenue, São Carlos, SP 13566-590, Brazil;4. School of Physics and Optoelectronic Engineering, Xidian University, Xi?an, Shannxi 710071, China;1. Research Institute for Signals, Systems and Computational Intelligence, sinc(i), Facultad de Ingeniería, Universidad Nacional del Litoral–CONICET CC217, Ciudad Universitaria, Paraje El Pozo, S3000, Santa Fe, Argentina;2. Dpto. de Ingeniería Eléctrica, UAM-Iztapalapa, Mexico;3. Laboratorio de Cibernética, Facultad de Ingeniería-Universidad Nacional de Entre Ríos, Argentina;4. CONICET, Argentina;2. Hospital for Special Surgery, New York, New York;3. Department of Anesthesiology & Critical Care Medicine, The Johns Hopkins Hospital, Baltimore, Maryland
Abstract:Chinese word segmentation is a difficult and challenging job because Chinese has no white space to mark word boundaries. Its result largely depends on the quality of the segmentation dictionary. Many domain phrases are cut into single words for they are not contained in the general dictionary. This paper demonstrates a Chinese domain phrase identification algorithm based on atomic word formation. First, atomic word formation algorithm is used to extract candidate strings from corpus after pretreatment. These extracted strings are stored as the candidate domain phrase set. Second, a lot of strategies such as repeated substring screening, part of speech (POS) combination filtering, and prefix and suffix filtering and so on are used to filter the candidate domain phrases. Third, a domain phrase refining method is used to determine whether a string is a domain phrase or not by calculating the domain relevance of this string. Finally, sort all the identified strings and then export them to users. With the help of morphological rules, this method uses the combination of statistical information and rules instead of corpus machine learning. Experiments proved that this method can obtain better results than traditional n-gram methods.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号