在我的项目中也遇到了数据不平衡的现象,其实在病理方面很容易遇到健康样本多、疾病样本少,或者体虚较多,体实较少的现象。通过一定的数据平衡方法可以大大改进实验的可靠性,以及提升查全率、查准率。
论文:杨明, 尹军梅, 吉根林. 不平衡数据分类方法综述[J]. 南京师范大学学报(工程技术版), 2008, 8(4):7-12.
其中,在机器学习库sklearn中,svm针对数据不平衡有两种不同的处理方法:
对于样本,加上权重,例如你的疾病样本蛮少,你给部分有显著特征的疾病样本加上大的权重,在与健康样本区分不大的疾病样本加上一个小一点的权重。具体使用方法是:
import numpy as np import matplotlib.pyplot as plt from sklearn import svm def plot_decision_function(classifier, sample_weight, axis, title): # plot the decision function xx, yy = np.meshgrid(np.linspace(-4, 5, 500), np.linspace(-4, 5, 500)) Z = classifier.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) # plot the line, the points, and the nearest vectors to the plane axis.contourf(xx, yy, Z, alpha=0.75, cmap=plt.cm.bone) axis.scatter(X[:, 0], X[:, 1], c=y, s=100 * sample_weight, alpha=0.9, cmap=plt.cm.bone, edgecolors='black') axis.axis('off') axis.set_title(title) # we create 20 points np.random.seed(0) X = np.r_[np.random.randn(10, 2) + [1, 1], np.random.randn(10, 2)] y = [1] * 10 + [-1] * 10 # print(y) sample_weight_last_ten = abs(np.random.randn(len(X))) sample_weight_constant = np.ones(len(X)) # and bigger weights to some outliers sample_weight_last_ten[15:] *= 5 sample_weight_last_ten[9] *= 15 # for reference, first fit without class weights # fit the model clf_weights = svm.SVC() # print(sample_weight_last_ten) # print(X.shape) # print(sample_weight_last_ten.shape) clf_weights.fit(X, y, sample_weight=sample_weight_last_ten) clf_no_weights = svm.SVC() clf_no_weights.fit(X, y) fig, axes = plt.subplots(1, 2, figsize=(14, 6)) plot_decision_function(clf_no_weights, sample_weight_constant, axes[0], "Constant weights") plot_decision_function(clf_weights, sample_weight_last_ten, axes[1], "Modified weights") # plt.show()
做出来长这样:
左边是没有权重之前,右边是加了权重,可以看看区别。不过注意,
文档中的sample_weight定义:
sample_weight : array-like, shape = [n_samples], optional Sample weights.
输入的X(也就是它的坐标)长这样:
[[ 2.76405235 1.40015721]
[ 1.97873798 3.2408932 ]
[ 2.86755799 0.02272212]
[ 1.95008842 0.84864279]
[ 0.89678115 1.4105985 ]
[ 1.14404357 2.45427351]
[ 1.76103773 1.12167502]
[ 1.44386323 1.33367433]
[ 2.49407907 0.79484174]
[ 1.3130677 0.14590426]
[-2.55298982 0.6536186 ]
[ 0.8644362 -0.74216502]
[ 2.26975462 -1.45436567]
[ 0.04575852 -0.18718385]
[ 1.53277921 1.46935877]
[ 0.15494743 0.37816252]
[-0.88778575 -1.98079647]
[-0.34791215 0.15634897]
[ 1.23029068 1.20237985]
[-0.38732682 -0.30230275]]
而它的权重长这样:
[ 1.04855297 1.42001794 1.70627019 1.9507754 0.50965218 0.4380743
1.25279536 0.77749036 1.61389785 3.1911042 0.89546656 0.3869025
0.51080514 1.18063218 0.02818223 2.14165935 0.33258611 1.51235949
3.17161047 1.81370583]
显而易见,这个class权重,就是针对的某一类样本很少的情况,例如,患病这一类很少,就将疾病这一类的class权重加大。sklearn中是这样定义的:
class_weight : {dict, ‘balanced’}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
注意噢,它是一个字典的形式,例如:
svc = svm.SVC(kernel='linear', C=C,class_weight={1:6,0:1}).fit(X_train, y_train)
好啦,关于数据平衡的方法就介绍到这里啦。
不错哦