|Table of Contents|

Classification Methods on Imbalanced Data: a Survey(PDF)

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

Issue:
2008年04期
Page:
7-12
Research Field:
Publishing date:

Info

Title:
Classification Methods on Imbalanced Data: a Survey
Author(s):
Yang MingYin JunmeiJi Genlin
School of Mathematics and Computer Science,Nanjing Normal University,Nanjing 210097,China
Keywords:
imba lanced data over-sam pling under-samp ling cost-sensitive one c lassifie r feature se lection subspace
PACS:
TP311.13
DOI:
-
Abstract:
C lassifica tion is one of the mo st im po rtant research contents in m achine lea rn ing, and the trad itiona l classif-i ca tion m ethods are re lative ly m ature, when dea ling w ith w el-l ba lanced data they can m ake good perform ance. But in real w orld the data is usua lly im ba lanced. The design o f the ex isting class ification me thods is often based on the assumption tha t the tra in ing sets are we l-l balanced, so it m ay lead to the descend ing capability o f the c lassification m ethods when dealing w ith im balanced da ta. M ak ing researches on imba lanced data is qu ite important. In order to he lp readers to have a clear idea o f the curren tly propo sed and futurew ork on the issue o f unba lanced da ta class ification, w e make a sim ple survey of the stud ies of th is issue and g ive som e key problem s attracting resea rchers in th is paper.

References:


[ 1] W e iss G M, Provost F. Lea rning when tra in ing data are co stly: the effect o f class d istr ibu tion on tree induction[ J]. Journa l o f A rtific ia l In telligence Research, 2003, 19: 315-354.
[ 2] Zadronzny B, E lkan C. Learn ing and m ak ing decisions when costs and probabilities are bo th unknown[ C] / / Pro ceedings o f the 7th Internationa l Confe rence on Know ledge D iscovery and DataM in ing. New York, USA: ACM, 2001: 204-213.
[ 3] 缪志敏. 基于单分类器的数据不平衡问题研究[ D]. 南京: 中国人民解放军理工大学指挥自动化学院, 2008.
M iao Zh im in. Research on imba lanced data based on one-c lass c lassifiers[ D]. Nanjing: Institute o f Autom a tion Comm and,PLA University of Sc ience and Techno logy, 2008. ( in Ch inese)
[ 4] H o lteR C, Acker L E, Po rterBW. Concept learn ing and the problem of sm a ll d isjuncts[ C ] / / Proceed ings of the 11 th International Jo int Conference on Artific ia l Inte lligence. Austin: M o rgan Kaufm ann, 1989: 813-818.
[ 5] Sun Y M, Kame lM S, W ong A K C, e t a.l Cost-sensitive boosting fo r classification of im ba lance da ta[ J]. Pa ttern Recognition, 2007, 40: 3 358-3 378.
[ 6] M a loo fM A. Learning when data sets are imba lanced and when costs are unequal and unknown[ C ] / / ICML- 2003W orkshop
on Learn ing from Imba lanced Data Sets II. W ashing ton DC: AAA I Press, 2003.
[ 7] Chaw laN, Bow yerK, H allL, e t a.l SMOTE: syntheticm ino rity over-samp ling technique[ J]. Journa l o fArtific ia l Inte lligence Research, 2002, 16: 321-357.
[ 8] Zhou ZH, L iu X Y. T ra in ing cost- sens itive neural ne tw orks w ith m ethods addressing the c lass im balance problem [ J]. IEEE Trans Know l Data Eng, 2006, 18( 1): 63-77.
[ 9] W e iss G M. M in ing w ith ra rity: a unify ing fram ew ork[ J]. ACM SIGKDD Explorations, 2004, 6( 1): 7-19.
[ 10] Drow n D J, Khoshgo ftaar T M, Narayanan R. U sing evo lutionary sam pling to m ine imba lanced data[ C] / / The 6th Internationa
l Con fe rence onM ach ine Learn ing and App lications. W ash ington DC: IEEE Com pute r So ciety, 2007: 363-368.
[ 11] Y en S J, Lee Y S. C luster-based under-sam pling approaches fo r im balanced data d istr ibutions[ C] / / Pro ceedings of the 8 th
In ternational Conference. Ber lin: Springer, 2006: 427-436.
[ 12] C iracoM, Rog alew sk iM, W e iss G. Im prov ing class ifier utility by a ltering the m isclassification co st ra tio[ C ] / / Proceed ings of the 1st Inte rnationalW orkshop on U tility-based DataM in ing. New York: ACM, 2005: 46-52.
[ 13] Raskutti B, Kow a lczyk A. Extrem e reba lanc ing for SVM s: a case study [ J] . New s lette r of theACM Spec ia l InterestG roup on Know ledge D iscovery and DataM in ing, 2004, 6( 1): 61-69.
[ 14] LeeY, L in Y, W ahbaG. M u lticategory support vectorm ach ines: theory and application to the c lassification ofm icroarray data
and sate llite rad iance data[ R] . W isconsin: Un iversity o fW isconsin, 2002.
[ 15] 谢纪刚, 裘正定. 不平衡数据集Fishe r线性判别模型[ J]. 北京交通大学学报, 2006, 30( 5) : 15-18.
X ie Jigang, Q iu Zhengding. Fishe r linear d iscrim inan tmode lw ith c lass imba lance[ J] . Journal of Be ijing JiaotongUn iv ers ity,2006, 30( 5): 15-18. ( in Ch inese)
[ 16] Karag iannopou losM G, Any fantis D S, Kotsiantis S B, et a.l Local co st sens itive learn ing fo r handling im ba lanced data sets
[ C ] / / 2007M ed iterranean Conference on Contro l and Autom a tion. Athens: IEEE Press, 2007: 1-6.
[ 17] Yu Sh iX in. Feature se lection and classifier ensem bles: a study on hyperspec tra l remo te sensing da ta[ D]. Flanders: Un iversity o f Autw erp, 2003.
[ 18] Schap ire R E, Singer Y. Improved boo sting algor ithm s using confidence-rated predictions[ J] . M ach ine Learn ing, 1999, 37( 3): 297-336.
[ 19] FanW, S to lfo S J, Zhang J, et a.l AdaCost: m isc lassifica tion cost-sensitive boosting[ C] / / B ra tko I, Dzero ski S. Proc o f the
16th Inte rn Conf onM eachine Learn ing. M organ Kaufm ann, 1999: 97-105.
[ 20] Josh iM V, Kum ar V, Ag arw al R C. Eva luating boosting a lgor ithm s to c lassify rare classes: com parison and improvem ents
[ C] / / C erconeN, L in T Y, Wu X. Pro o f the 2001 IEEE Intern Conf on Da taM in ing. W ash ington DC: IEEE Com puter Society Press, 2001: 257-264.
[ 21] Chaw la N V, Japkow icz, Ko lcz A. Editor ia :l spec ia l issue on learn ing from im balanced data sets[ J]. SIGKDD Explorations
Specia l Issue on Learn ing from Imba lanced Datase ts, 2004, 6( 1): 1-6.
[ 22] Chaw la lN V, Lazarev icA, H allL O. SMOTEBoost: im prov ing pred iction of them ino rity class in boosting [ C] / / The 7th European
Conf on Pr inciples and Practice o fKnow ledge D iscovery in Databases. Ber lin: Springer, 2003: 107-119.
[ 23] H e Guoxun, H an H u,iW angW enyuan. An over-sam pling expert system for learn ing from im ba lanced data se ts[ J]. Neural Netwo rks and B ra in, 2005, 1: 537-541.
[ 24] 尹军梅, 杨明. 一种面向单个正例的Fisher线性判别分类方法[ J]. 南京师范大学学报: 工程技术版, 2008, 8( 3): 61-65.
Y in Junm e,i YangM ing. A fisher d iscrim inant c lassifica tion approach dealing w ith sing le positive samp le[ J]. Journal o fNanjing
Norm a lUn iversity: Eng ineer ing and Technology Edition, 2008, 8( 3): 61-65. ( in Chinese)
[ 25] Tao Ban, Sh igeoAbe. Imp lem entingm ult-i c lass c lassifiers by one-c lass c lassifica tion m ethods[ C] / / 2006 Internationa l Jo int
Conference on Neura l Netwo rks Sheraton VancouverW a ll Centre H ote .l Vancouver, BC: IEEE Press, 2006: 16-21, 327-
332.
[ 26] Sun Y. Cost-sensitive boosting fo r classification of im ba lanced data[ D] . C anada: Un iversity o fW aterloo, 2007.
[ 27] Constantinopoulos C, L ikas A. Sem -i supe rv ised and ac tive learn ingw ith the probab ilisticRBF c lassifie r[ J]. A rtific ialNeural
Netwo rks, 2008, 71( 13): 2 489-2 498.
[ 28] Chen C, L iaw A, Breim an L. Us ing random forests to learn unba lanced data[ R]. C aliforn ia: Un iversity of Ca lifo rnia, 2004.
[ 29] Ahn H, Moon H, Fazzar iM J, et a .l C lassification by ensemb les from random pa rtitions o f high-dim ensional da ta[ J] . Computationa
l Sta tistics & Data Analysis, 2007, 51: 6 166-6 177.
[ 30] Zheng Z, Wu X, Sr ihar iR. Feature se lection for tex t ca tego rization on im ba lanced data[ J]. SIGKDD Explorations, 2004,6( 1): 80-89.
[ 31] M laden ic D, G roben ikM. Feature selection for unba lanced c lass distribution and Na-ve Bayes[ C] / / Proceedings o f the 16 th
In ternational Conference onM ach ine Lea rning. San Francisco: M organ Kaufm ann, 1999: 258-267.
[ 32] TaoQ, Wu G, W ang F Y, et a.l Po ster io r probab ility support vectorm achines for unbalanced data[ J]. IEEE Trans on Neural Netw orks, 2005, 16( 6): 1 561-1573.
[ 33] B radley A P. The use o f the area under the ROC curv e in the eva luation o fm ach ine learn ing algorithm s[ J]. Pattern Recognition, 1997, 30( 7): 1 145-1 159.
[ 34] Faw cett T. ROC G raphs: No tes and Practical Considerations for Researchers, HPL- 2003- 4 [ R /OL]. [ 2008-06-18] h-ttp: / /www. pur.l o rg /NET / tfaw cett/ papers /ROC101. pdf.

Memo

Memo:
-
Last Update: 2013-04-24