classification - Azure machine learning even sampling -
i'm trying basic multi-label classification in azure ml. have basic data in following format:
value_x value_y label x1 y1 label1 x2 y2 label1 x3 y3 label2 .....
my problem in data labels (out of total of five) overrepresented, 40% of data label1, 20% label 2 , rest around 10%.
i sampling out of these train model, each label represented in equal amounts.
tried stratification option in sampling module on labels column, gives me sampling same distribution of labels in initial dataset.
any idea how module?
i able using combination of split data, partition , sample, , add rows modules. there may easier way it, did confirm works. :) published work @ http://gallery.azureml.net/details/1245147fd7004e91bc7a3683cda19cc7 can grab directly there, , run confirm expect.
since said wanted sampling of data, reduced each of labels 10% have labels represented equally. since have understanding of distribution in dataset, leave label 3, 4, , 5 @ 10%, , reduce label 1 1/4 , label 2 1/2 10% of them well.
to explain did in workspace linked above:
- i used "split data" modules filter out label1 , label2 data. in split data module, change splitting mode "regular expression" , set regular expression \"label" ^label1 (to label1 data, example).
- then used "partition , sample" modules reduce size of label1 , label2 data appropriately.
- finally, used "add rows" modules join of data again.
finally, didn't include in work, can @ smote module. increase number of low-occurring samples using synthetic minority oversampling.
Comments
Post a Comment