1(College of Mathematics and Information Science,Hebei University,Baoding 071000,China)2(Hebei Key Laboratory of Machine Learning and Computational Intelligence,Hebei University,Baoding 071000,China)
Abstract:With the explosive growth of data,the problem of big data has attracted more and more attention.However,due to the characteristics of big data,such as high dimension,complex data and rapid change,the traditional machine learning algorithm is no longer applicable,so it is urgent to solve the problem of big data feature selection.Based on voting mechanism and decision tree algorithm,this paper proposes a voting feature selection algorithm in big data environment.The specific steps are:Randomly divide the large data set U into L subsets,send the divided L subsets to L map nodes,and use the decision tree algorithm to select features on each map node.In the reduce node,the features selected by each map node are used to vote,and the features with more votes than the threshold are selected.The proposed algorithm is tested on two opensource big data platforms,Hadoop and Spark,and it is found that there are many similarities and differences in the operation mechanism of the two big data platforms.In addition,the feature selection algorithm based on genetic algorithm and univariate feature selection algorithm are compared with the proposed big data voting feature algorithm on five highdimensional data sets.Through the analysis of the experimental results,it is found that the proposed algorithm has better performance in classification accuracy and execution efficiency than the two related algorithms.It is proved that the proposed algorithm is superior to the two algorithms and can effectively solve the problem of feature selection of highdimensional data.