Resampling algorithm for imbalanced data based on their neighbor relationship

LI Rui-feng; LI Wen-hai; SUN Yan-li; WU Yang-yong

doi:10.13374/j.issn2095-9389.2020.04.05.002

Volume 43 Issue 6

Jun. 2021

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Engineering > 2021 > 43(6): 862-869

LI Rui-feng, LI Wen-hai, SUN Yan-li, WU Yang-yong. Resampling algorithm for imbalanced data based on their neighbor relationship[J]. Chinese Journal of Engineering, 2021, 43(6): 862-869. doi: 10.13374/j.issn2095-9389.2020.04.05.002

Citation:

LI Rui-feng, LI Wen-hai, SUN Yan-li, WU Yang-yong. Resampling algorithm for imbalanced data based on their neighbor relationship[J]. Chinese Journal of Engineering, 2021, 43(6): 862-869. doi: 10.13374/j.issn2095-9389.2020.04.05.002

Citation:

PDF( 992 KB)

Resampling algorithm for imbalanced data based on their neighbor relationship

doi: 10.13374/j.issn2095-9389.2020.04.05.002

Naval Aviation University, Yantai 264001, China

More Information

Corresponding author: E-mail: dongzhi1110@foxmail.com
Received Date: 2020-04-05
Publish Date: 2021-06-25

Abstract

Abstract

The classification of imbalanced data has become a crucial and significant research issue in many data-intensive applications. The minority samples in such applications usually contain important information. This information plays an important role in data analysis. At present, two methods (improved algorithm and data set reconstruction) are used in machine learning and data mining to address the data set imbalance. Data set reconstruction is also known as the resampling method, which can modify the proportion of every class in the training data set without modifying the classification algorithm and has been widely used. As artificially increasing or reducing samples inevitably results in the increase in noise and loss of original data information, thus reducing the classification accuracy. A reasonable oversampling and undersampling algorithm are the core of the resampling method. To improve the classification accuracy of imbalanced data sets, a resampling algorithm based on the neighbor relationship of sample space was proposed. This method first evaluated the security level according to the spatial neighbor relations of minority samples and oversampled them through the synthetic minority oversampling technique guided by their security level. Then, the local density of majority samples was calculated according to their spatial neighbor relation to undersample the majority samples in a sample-intensive area. By the above two means, the data set can be balanced and the data size can be controlled to prevent overfitting to realize the classification equalization of the two categories. The training set and test set were generated via the method of 5 × 10 fold cross validation. After resampling the training set, the kernel extreme learning machine (KELM) was used as the classifier for training, and the test set was used for verification. The experimental results on a UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms.
- imbalanced data,
- neighbor relationship,
- resample,
- local density,
- classification

FullText(HTML)

References(29)

References

[1]	Chen S, He H B, Garcia E A. RAMOBoost: Ranked minority oversampling in boosting. IEEE Trans Neural Networks, 2010, 21(10): 1624 doi: 10.1109/TNN.2010.2066988
[2]	Xiao Y C, Wang H G, Zhang L, et al. Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection. Knowledge-Based Syst, 2014, 59: 75 doi: 10.1016/j.knosys.2014.01.020
[3]	Miao Z M, Zhao L W, Yuan W W, et al. Multi-class imbalanced learning implemented in network intrusion detection // 2011 International Conference on Computer Science and Service System (CSSS). Nanjing, 2011: 1395
[4]	Smailovi? J, Gr?ar M, Lavra? N, et al. Stream-based active learning for sentiment analysis in the financial domain. Inform Sci, 2014, 285: 181 doi: 10.1016/j.ins.2014.04.034
[5]	Liu Y Q, Wang C, Zhang L. Decision tree based predictive models for breast cancer survivability on imbalanced data // 2009 3rd International Conference on Bioinformatics and Biomedical Engineering. Beijing, 2009: 1
[6]	高明哲, 許愛強, 許晴. SL-SMOTE和CS-RVM結合的電子設備故障檢測方法. 計算機工程與應用, 2019, 55(4):185 doi: 10.3778/j.issn.1002-8331.1708-0032 Gao M Z, Xu A Q, Xu Q. Fault detection method of electronic equipment based on SL-SMOTE and CS-RVM. Comput Eng Appl, 2019, 55(4): 185 doi: 10.3778/j.issn.1002-8331.1708-0032
[7]	馮宏偉, 姚博, 高原, 等. 基于邊界混合采樣的非均衡數據處理算法. 控制與決策, 2017, 32(10):1831 Feng H W, Yao B, Gao Y, et al. Imbalanced data processing algorithm based on boundary mixed sampling. Control Decis, 2017, 32(10): 1831
[8]	Gao M, Hong X, Chen S, et al. A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing, 2011, 74(17): 3456 doi: 10.1016/j.neucom.2011.06.010
[9]	古平, 歐陽源遊. 基于混合采樣的非平衡數據集分類研究. 計算機應用研究, 2015, 32(2):379 doi: 10.3969/j.issn.1001-3695.2015.02.014 Gu P, Ouyang Y Y. Classification research for unbalanced data based on mixed-sampling. Appl Res Comput, 2015, 32(2): 379 doi: 10.3969/j.issn.1001-3695.2015.02.014
[10]	Yu H L, Yang X B, Zheng S, et al. Active learning from imbalanced data: A solution of online weighted extreme learning machine. IEEE Trans Neural Networks Learn Syst, 2019, 30(4): 1088 doi: 10.1109/TNNLS.2018.2855446
[11]	蔡艷艷, 宋曉東. 針對非平衡數據分類的新型模糊SVM模型. 西安電子科技大學學報(自然科學版), 2015, 42(5):120 Cai Y Y, Song X D. New fuzzy SVM model used in imbalanced datasets. J Xidian Univ Nat Sci, 2015, 42(5): 120
[12]	王春玉, 蘇宏業, 渠瑜, 等. 一種基于過抽樣技術的非平衡數據集分類方法. 計算機工程與應用, 2011, 47(1):139 doi: 10.3778/j.issn.1002-8331.2011.01.038 Wang C Y, Su H Y, Qu Y, et al. Imbalanced data sets classification method based on over-sampling technique. Comput Eng Appl, 2011, 47(1): 139 doi: 10.3778/j.issn.1002-8331.2011.01.038
[13]	張銀峰, 郭華平, 職為梅, 等. 一種面向不平衡數據分類的組合剪枝方法. 計算機工程, 2014, 40(6):157 doi: 10.3969/j.issn.1000-3428.2014.06.034 Zhang Y F, Guo H P, Zhi W M, et al. An ensemble pruning method for imbalanced data classification. Comput Eng, 2014, 40(6): 157 doi: 10.3969/j.issn.1000-3428.2014.06.034
[14]	Vong C M, Ip W F, Wong P K, et al. Predicting minority class for suspended particulate matters level by extreme learning machine. Neurocomputing, 2014, 128: 136 doi: 10.1016/j.neucom.2012.11.056
[15]	翟云, 楊炳儒, 王樹鵬, 等. 基于協同進化機制的欠采樣方法. 北京科技大學學報, 2011, 33(12):1550 Zhai Y, Yang B R, Wang S P, et al. Under-sampling method based on cooperative co-evolutionary mechanism. J Univ Sci Technol Beijing, 2011, 33(12): 1550
[16]	Yang Y, Liu F, Jin Z Y, et al. Aliasing artefact suppression in compressed sensing MRI for random phase-encode undersampling. IEEE Trans Bio-Med Eng, 2015, 62(9): 2215 doi: 10.1109/TBME.2015.2419372
[17]	Jia C Z, Zuo Y. S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theoret Biol, 2017, 422: 84 doi: 10.1016/j.jtbi.2017.03.031
[18]	Wilson D L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern, 2007, SMC-2(3): 408
[19]	趙自翔, 王廣亮, 李曉東. 基于支持向量機的不平衡數據分類的改進欠采樣方法. 中山大學學報(自然科學版), 2012, 51(6):10 Zhao Z X, Wang G L, Li X D. An improved SVM based under-sampling method for classifying imbalanced data. Acta Sci Nat Univ Sunyatseni, 2012, 51(6): 10
[20]	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res, 2002, 16: 321 doi: 10.1613/jair.953
[21]	劉余霞, 劉三民, 劉濤, 等. 一種新的過采樣算法DB_SMOTE. 計算機工程與應用, 2014, 50(6):92 doi: 10.3778/j.issn.1002-8331.1308-0099 Liu Y X, Liu S M, Liu T, et al. New oversampling algorithm DB_SMOTE. Comput Eng Appl, 2014, 50(6): 92 doi: 10.3778/j.issn.1002-8331.1308-0099
[22]	谷瓊, 袁磊, 寧彬, 等. 一種基于混合重取樣策略的非均衡數據集分類算法. 計算機工程與科學, 2012, 34(10):128 doi: 10.3969/j.issn.1007-130X.2012.09.024 Gu Q, Yuan L, Ning B, et al. A novel classification algorithm for imbalanced datasets based on hybrid resampling strategy. Comput Eng Sci, 2012, 34(10): 128 doi: 10.3969/j.issn.1007-130X.2012.09.024
[23]	陶新民, 郝思媛, 張冬雪, 等. 基于樣本特性欠取樣的不均衡支持向量機. 控制與決策, 2013, 28(7):978 Tao X M, Hao S Y, Zhang D X, et al. Support vector machine for unbalanced data based on sample properties under-sampling approaches. Control Decis, 2013, 28(7): 978
[24]	Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem // Proceedings of Advances in Knowledge Discovery and Data Mining Conference. Bangkok, 2009: 475
[25]	Huang G B, Zhou H M, Ding X J, et al. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B Cybern, 2012, 42(2): 513 doi: 10.1109/TSMCB.2011.2168604
[26]	Gautam C, Tiwari A, Leng Q. On the construction of extreme learning machine for online and offline one-class classification-an expanded toolbox. Neurocomputing, 2017, 261: 126 doi: 10.1016/j.neucom.2016.04.070
[27]	朱敏, 劉奇, 劉星, 等. 基于LMK和OC-ELM的航空電子部件故障檢測方法. 系統工程與電子技術, 2020, 42(6):1424 doi: 10.3969/j.issn.1001-506X.2020.06.29 Zhu M, Liu Q, Liu X, et al. Fault detection method for avionics based on LMK and OC-ELM. Syst Eng Electron, 2020, 42(6): 1424 doi: 10.3969/j.issn.1001-506X.2020.06.29
[28]	薛麗香, 邱保志. 基于變異系數的邊界點檢測算法. 模式識別與人工智能, 2009, 22(5):799 doi: 10.3969/j.issn.1003-6059.2009.05.020 Xue L X, Qiu B Z. Boundary points detection algorithm based on coefficient of variation. Pattern Recognit Artif Intell, 2009, 22(5): 799 doi: 10.3969/j.issn.1003-6059.2009.05.020
[29]	張鎮, 段哲民, 龍英. 基于小波包的開關電流電路故障診斷. 工程科學學報, 2017, 39(7):1101 Zhang Z, Duan Z M, Long Y. Fault detection in switched current circuits based on preferred wavelet packet. Chin J Eng, 2017, 39(7): 1101