Using clustering and dynamic mutual information for topic feature selection |
| |
Authors: | Jian‐min Xu Shu fang Wu Jie Zhu |
| |
Affiliation: | 1. College of Management, Hebei University, , Baoding, 071002 China;2. Department of Information Engineering, Hebei Software Institute, , Baoding, 071000 China;3. Department of Information Management, The Central Institute for Correctional Police, , Baoding, 071000 China |
| |
Abstract: | A good feature selection method should take into account both category information and high‐frequency information to select useful features that can effectively display the information of a target. Because basic mutual information (BMI) prefers low‐frequency features and ignores high‐frequency features, clustering mutual information is proposed, which is based on clustering and makes effective high‐frequency features become unique, better integrating category information and useful high‐frequency information. Time is an important factor in topic detection and tracking (TDT). In order to improve the performance of TDT, time difference is integrated into clustering mutual information to dynamically adjust the mutual information, and then another algorithm called the dynamic clustering mutual information (DCMI) is given. In order to obtain the optimal subsets to display topics information, an objective function is proposed, which is based on the idea that a good feature subset should have the smallest distance within‐class and the largest distance across‐class. Experiments on TDT4 corpora using this objective function are performed; then, comparing the performances of BMI, DCMI, and the only existed topic feature selection algorithm Incremental Term Frequency‐Inverted Document Frequency (ITF‐IDF), these performance information will be displayed by four figures. Computation time of DCMI is previously lower than BMI and ITF‐IDF. The optimal normalized‐detection performance (Cdet)norm of DCMI is decreased by 0.3044 and 0.0970 compared with those of BMI and ITF‐IDF, respectively. |
| |
Keywords: | feature selection topic mutual information time difference dynamic |
|
|