Investigating random undersampling and feature selection on bioinformatics big data

T Hasanin, TM Khoshgoftaar, J Leevy… - 2019 IEEE Fifth …, 2019 - ieeexplore.ieee.org
2019 IEEE Fifth International Conference on Big Data Computing …, 2019ieeexplore.ieee.org
This paper aims to address a key research issue regarding the ECBDL'14 bioinformatics big
data competition. The ECBDL'14 dataset was the big data target in the competition, and it
consisted of 631 attributes and about 32 million instances, of which about 98% belonged to
the negative class. The ECBDL'14 competition dataset has recently been used in the
literature to assess the effect of class imbalance on big data analytics. The contribution of
our paper is two-fold. First, a survey of several literature works that utilized the EDBDL'14 …
This paper aims to address a key research issue regarding the ECBDL'14 bioinformatics big data competition. The ECBDL'14 dataset was the big data target in the competition, and it consisted of 631 attributes and about 32 million instances, of which about 98% belonged to the negative class. The ECBDL'14 competition dataset has recently been used in the literature to assess the effect of class imbalance on big data analytics. The contribution of our paper is two-fold. First, a survey of several literature works that utilized the EDBDL'14 dataset, either fully or partially, is presented. Second, compared to the Random Oversampling approach used by the winning algorithm for the competition, we utilize a Random Undersampling approach in conjunction with a Feature Selection approach. Through Random Undersampling, different class distributions were generated, ranging from slightly imbalanced to balanced. Prior to sampling, we perform Feature Selection by computing Feature Importance with the Random Forest learner within the Apache Spark framework. Subsequently, classification performance is computed for Random Forest, Logistic Regression, and Gradient-Boosted Trees in the same big data analytics framework. The key results of our study indicate that our proposed solution had a higher prediction performance (minimally) compared to that of the highest value of the winning algorithm. However, it is important to note that Random Undersampling, compared to Random Oversampling, imposes a lower computational burden and results in a faster training time, which is beneficial to data analytics. We conclude that our solution clearly outperforms the ECBLD'14 winning algorithm.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果