首页 | 本学科首页   官方微博 | 高级检索  
     


Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering
Affiliation:1. School of Computer Sciences, Universiti Sains Malaysia, 11800 Pinang, Malaysia;2. Department of Information Technology, Al-Huson University College, Al-Balqa Applied University, P.O. Box 50, Al-Huson, Irbid, Jordan;1. Department of Industrial and Systems Engineering, Rutgers, The State University of New Jersey, 96 Frelinghuysen Road, Piscataway, NJ 08854, USA;2. Division of SMEs Innovation, Korea Institute of Science and Technology Information, 66 Hoegiro, Dongdaemun-gu, Seoul 02456, Republic of Korea;3. Division of Advanced Information Convergence, Korea Institute of Science and Technology Information, 66 Hoegiro, Dongdaemun-gu, Seoul 02456, Republic of Korea;1. Institute of Management Science and Engineering, Business School, Henan University, 475004, Jinming District, Kaifeng, Henan Province, China;2. Salford Business School, University of Salford, 43 Crescent, Salford M5 4WT, UK;3. Department of Business Transformation and Sustainable Enterprise, Surrey Business School, University of Surrey, Guildford, Surrey, GU2 7XH, UK;4. School of Computing and Intelligent Systems, Ulster University, Magee campus, Northland Rd, Londonderry Northern Ireland, UK, BT48 7JL;5. Faculty of Software, Fujian Normal University, Upper 3rd Rd, Cangshan, Fuzhou, Fujian Province, 350108, China;6. Guangxi Key Lab of Multi-Source Information Mining & Security, Faculty of Electronic Engineering, Guangxi Normal University, Diecai, Guilin, Guangxi, China, 541000;1. Department of Computer Science and Engineering of Systems, University of Zaragoza, Escuela Universitaria Politécnica de Teruel, c/ Ciudad Escolar s/n, 44003 Teruel, Spain;2. Instituto de Investigación Sanitaria Aragón, University of Zaragoza, Zaragoza;3. Department of Electronic Engineering and Communications, University of Zaragoza, Escuela Universitaria Politécnica de Teruel, c/ Ciudad Escolar s/n, 44003 Teruel, Spain;1. Department of Information Systems, Institute of Cybernetic Intelligent Systems of the National Research Nuclear University MEPHI (IATE NRNU MEPHI), Obninsk - Moscow, Russian Federation;2. Department of Computer Science, University of Jaén, Jaén 23071, Spain;3. School of Management, Wuhan University of Technology, Wuhan 430070, PR China;1. Department of Cognitive Science, Xiamen University, Xiamen, 361005, PR China;2. Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen University, Xiamen, 361005, PR China;3. College of Traditional Chinese Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou, 350122, PR China;4. School of Computer Science, Minnan Normal University, Zhangzhou, 363000, PR China
Abstract:This paper proposes three feature selection algorithms with feature weight scheme and dynamic dimension reduction for the text document clustering problem. Text document clustering is a new trend in text mining; in this process, text documents are separated into several coherent clusters according to carefully selected informative features by using proper evaluation function, which usually depends on term frequency. Informative features in each document are selected using feature selection methods. Genetic algorithm (GA), harmony search (HS) algorithm, and particle swarm optimization (PSO) algorithm are the most successful feature selection methods established using a novel weighting scheme, namely, length feature weight (LFW), which depends on term frequency and appearance of features in other documents. A new dynamic dimension reduction (DDR) method is also provided to reduce the number of features used in clustering and thus improve the performance of the algorithms. Finally, k-mean, which is a popular clustering method, is used to cluster the set of text documents based on the terms (or features) obtained by dynamic reduction. Seven text mining benchmark text datasets of different sizes and complexities are evaluated. Analysis with k-mean shows that particle swarm optimization with length feature weight and dynamic reduction produces the optimal outcomes for almost all datasets tested. This paper provides new alternatives for text mining community to cluster text documents by using cohesive and informative features.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号