二分类器的RP曲线,ROC曲线

MNIST

Setup

# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)
### 训练二元分类器 ### 性能考核 - 自行实施交叉验证StratifiedKFold - cross_val_score()返回准确度(评估分数)
#返回精确度
cross_cal_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring='accuracy')

混淆矩阵

  • cross_val_predict()返回预测结果

    y_train_pred = cross_val_predict(sgd_clf, X_train, y_tarin, cv=3)

  • confusion_matrix()返回混淆矩阵

    confusion_matrix(y_train_5, y_train_pred)

RP曲线

  • 精确率

    precision_score(y_train_5, y_train_pred)

  • 召回率

    recall_score(y_train_5, y_train_pred)

  • F1分数

    f1_score

  • 制作RP曲线

    cross_val_predict获得分数


    #代码是MNIST的二元分类器的

    First step, Get Score ,preparing for Thresholds

    y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
    method='decision_function')
    
    #当cross_val_predict含method时返回函数值
    #随机梯度下降的decision_function函数返回的是分数
    #随机森林的predict_proba返回概率,不过也可切片后用作分数,见下ROC

    precision_recall_curve计算精度和召回率

    Next,caculate

    from sklearn.matrics import precision_recall_curve(y_train,y_scores)
    precisions, recalls,thresholds = precision_recall_curve(y_train_5, y_scores)
    #recalls.shape = precisions.shape = thresholds.shape

    (Nums_of_Samples,)

    绘制RP曲线,寻找最优阈值

    Third,plot

    def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], “b–”, label=”Precision”, linewidth=2)
    plt.plot(thresholds, recalls[:-1], “g-“, label=”Recall”, linewidth=2)
    plt.xlabel(“Threshold”, fontsize=16)
    plt.legend(loc=”upper left”, fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])
save_fig(“precision_recall_vs_threshold_plot”)
plt.show()

优化,阈值调整,重新训练

#Change
y_train_pred_90 = (y_score > 70000)

ROC曲线

概念:

召回率/假正类率FPR(错误被分为正类的负类实例比率)

得到score

y_probs_forest = cross_val_predict(froest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]

roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

plot

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
#save_fig("roc_curve_plot")
plt.show()
  • 另类:ROC AUC曲线下面积

ROC与RP的优异,选取

RP曲线

  • 当正类非常少见或者你更关注假正类而不是假负类

    ROC曲线

  • 当正类非常多见或者你更关注假负类而不是假正类