Feature Selection Method Based on the Term Distribution Among Paragraphs and Categories
YANG Feng-qin1,2,FAN Na1,SUN Hong-guang1,2,SUN Tie-li1,2,PENG Yang1
1(School of Computer Science and Information Technology,Northeast Normal University,Changchun 130117,China)
2(Key Lab of Intelligent Information Processing of Jilin Universities,Changchun 130117,China)
Abstract:Feature selection is an important step to solve the high dimensional problem in automatic text classification.The existing feature selection methods are mainly based on the term frequency or the document frequency.In some extent,they can quantify the significance of the terms,but cannot characterize how the terms distribute in a certain document.Aiming at this problem,this paper proposes the paragraph frequency of the terms,which can measure how evenly the terms distribute in a document,regarding each paragraph as a statistical unit.A novel feature selection method (called FSPC) is proposed by integrating the paragraph frequency into the innercategory and intercategory distribution information.To verify our method,FSPC is compared with CHI Square,DF,IG and CMFS on the Fudan corpus and SogouCS corpus using support vector machine and Naive Bayes as classifiers.Experimental results show that,in terms of F1 measure,the performance of the proposed method is prior to the competitive methods.
杨凤芹,樊娜,孙红光,孙铁利,彭杨. 段落及类别分布的特征选择方法[J]. 小型微型计算机系统, 2018, 39(1): 17-22.
YANG Feng-qin,FAN Na,SUN Hong-guang,SUN Tie-li,PENG Yang. Feature Selection Method Based on the Term Distribution Among Paragraphs and Categories. Journal of Chinese Computer Systems, 2018, 39(1): 17-22.