【机器学习】不平衡数据分类方法

joanna 2018年7月12日

在我的项目中也遇到了数据不平衡的现象，其实在病理方面很容易遇到健康样本多、疾病样本少，或者体虚较多，体实较少的现象。通过一定的数据平衡方法可以大大改进实验的可靠性，以及提升查全率、查准率。

常用解决方法：
- 改变数据分布
- 设计新的分类方法
- 设计新的分类器性能评价准则
难点：
- 经典的分类精度评价准则不能适用于不平衡的分类器性能判别
- 仅有很少的少数类样本数据
- 数据碎片，很多算法采用分治法，陈胜很多子空间包含很少的少量样本
- 不恰当的归纳偏置
方法：
- 数据层面：
  - 过抽样：在小数据的种类上多抽样
  - 欠抽样：减少多数样本来提高少数类的分类性能，尽量不删除有用的样本，多数类样本被分为“噪音样本”、“边界样本”和“安全样本”。将边界样本和噪音样本从多数类中删除。
- 算法层面：
  - 代价敏感方法：
    - 重构训练集：根据样本的不同错分代价重构训练集，给每一个样本加权不改变已有的算法。缺：丢失了一些有用的信息。
    - 引入代价敏感因子：错分代价不同。
  - 集成学习方法：按照基本分类器之间的种类关系分为：
    - 异态集成学习：使用各种不同的分类器进行集成
    - 同态集成学习：集成的基本分类器都是同一种分类器，只是分类器之间参数不同。(投票)
      Adaboost,AdaCost,RareBoost
  - 单类分类器方法：对单一类数据进行训练，比如说SVM的one Class
  - 面向单个正类的FLDA算法，Fisher线性判别方法，找出单个正类在负类中的k个近邻，按照一定规则依次在单个正例和它各个近邻的连线上产生合成样本，将合成样本添加到原始正类中（正少负多）
  - 多类不平衡问题的解决方法：采用已有的多类分类方法和两类不平衡分类策略相结合。
  - 其他：主动学习、随机森林、子空间方法、特征选择方法、SVM模型下的后验概率求解方法。
不平衡数据分类的评价准则：
- 查全率(recall)、查准率(precision)、F-value、G-mean值、AUC

论文：杨明, 尹军梅, 吉根林. 不平衡数据分类方法综述[J]. 南京师范大学学报(工程技术版), 2008, 8(4):7-12.

其中，在机器学习库sklearn中，svm针对数据不平衡有两种不同的处理方法：

sample_weight

对于样本，加上权重，例如你的疾病样本蛮少，你给部分有显著特征的疾病样本加上大的权重，在与健康样本区分不大的疾病样本加上一个小一点的权重。具体使用方法是：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm


def plot_decision_function(classifier, sample_weight, axis, title):
    # plot the decision function
    xx, yy = np.meshgrid(np.linspace(-4, 5, 500), np.linspace(-4, 5, 500))

    Z = classifier.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # plot the line, the points, and the nearest vectors to the plane
    axis.contourf(xx, yy, Z, alpha=0.75, cmap=plt.cm.bone)
    axis.scatter(X[:, 0], X[:, 1], c=y, s=100 * sample_weight, alpha=0.9,
                 cmap=plt.cm.bone, edgecolors='black')

    axis.axis('off')
    axis.set_title(title)


# we create 20 points
np.random.seed(0)
X = np.r_[np.random.randn(10, 2) + [1, 1], np.random.randn(10, 2)]
y = [1] * 10 + [-1] * 10
# print(y)
sample_weight_last_ten = abs(np.random.randn(len(X)))
sample_weight_constant = np.ones(len(X))
# and bigger weights to some outliers
sample_weight_last_ten[15:] *= 5
sample_weight_last_ten[9] *= 15

# for reference, first fit without class weights

# fit the model
clf_weights = svm.SVC()
# print(sample_weight_last_ten)
# print(X.shape)
# print(sample_weight_last_ten.shape)
clf_weights.fit(X, y, sample_weight=sample_weight_last_ten)

clf_no_weights = svm.SVC()
clf_no_weights.fit(X, y)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
plot_decision_function(clf_no_weights, sample_weight_constant, axes[0],
                       "Constant weights")
plot_decision_function(clf_weights, sample_weight_last_ten, axes[1],
                       "Modified weights")

# plt.show()

做出来长这样：

左边是没有权重之前，右边是加了权重，可以看看区别。不过注意，
文档中的sample_weight定义：
sample_weight : array-like, shape = [n_samples], optional Sample weights.
输入的X（也就是它的坐标）长这样：
[[ 2.76405235 1.40015721]
[ 1.97873798 3.2408932 ]
[ 2.86755799 0.02272212]
[ 1.95008842 0.84864279]
[ 0.89678115 1.4105985 ]
[ 1.14404357 2.45427351]
[ 1.76103773 1.12167502]
[ 1.44386323 1.33367433]
[ 2.49407907 0.79484174]
[ 1.3130677 0.14590426]
[-2.55298982 0.6536186 ]
[ 0.8644362 -0.74216502]
[ 2.26975462 -1.45436567]
[ 0.04575852 -0.18718385]
[ 1.53277921 1.46935877]
[ 0.15494743 0.37816252]
[-0.88778575 -1.98079647]
[-0.34791215 0.15634897]
[ 1.23029068 1.20237985]
[-0.38732682 -0.30230275]]

而它的权重长这样：
[ 1.04855297 1.42001794 1.70627019 1.9507754 0.50965218 0.4380743
1.25279536 0.77749036 1.61389785 3.1911042 0.89546656 0.3869025
0.51080514 1.18063218 0.02818223 2.14165935 0.33258611 1.51235949
3.17161047 1.81370583]

class_weight

显而易见，这个class权重，就是针对的某一类样本很少的情况，例如，患病这一类很少，就将疾病这一类的class权重加大。sklearn中是这样定义的：
class_weight : {dict, ‘balanced’}, optional

Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

注意噢，它是一个字典的形式，例如：

svc =  svm.SVC(kernel='linear', C=C,class_weight={1:6,0:1}).fit(X_train, y_train)

好啦，关于数据平衡的方法就介绍到这里啦。

One comment

不错哦

Taste Stars

TASTE STARS

【机器学习】不平衡数据分类方法

sample_weight

class_weight

Leave a Comment 取消回复

About Author

JOANNA JIANG

心愿

近期文章

分类