首页 | 本学科首页   官方微博 | 高级检索  
     


A text representation model using Sequential Pattern-Growth method
Authors:Suraya Alias  Siti Khaotijah Mohammad  Gan Keng Hoon  Tan Tien Ping
Affiliation:1.Faculty of Computing and Informatics,UMS,Kota Kinabalu,Malaysia;2.School of Computer Sciences,USM,Gelugor,Malaysia
Abstract:Text representation is an essential task in transforming the input from text into features that can be later used for further Text Mining and Information Retrieval tasks. The commonly used text representation model is Bags-of-Words (BOW) and the N-gram model. Nevertheless, some known issues of these models, which are inaccurate semantic representation of text and high dimensionality of word size combination, should be investigated. A pattern-based model named Frequent Adjacent Sequential Pattern (FASP) is introduced to represent the text using a set of sequence adjacent words that are frequently used across the document collection. The purpose of this study is to discover the similarity of textual pattern between documents that can be later converted to a set of rules to describe the main news event. The FASP is based on the Pattern-Growth’s divide-and-conquer strategy where the main difference between FASP and the prior technique is in the Pattern Generation phase. This approach is tested against the BOW and N-gram text representation model using Malay and English language news dataset with different term weightings in the Vector Space Model (VSM). The findings demonstrate that the FASP model has a promising performance in finding similarities between documents with the average vector size reduction of 34% against the BOW and 77% against the N-gram model using the Malay dataset. Results using the English dataset is also consistent, indicating that the FASP approach is also language independent.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号