Scikit-learn2.0上手

Imputer ->Impute

在Boston房价数据处理中

  • 旧版对于数值型的缺失值使用Imputer填充缺失值,调用方式:

    from sklearn.preprocessing import Imputer

imputer =Imputer(strategy=”median”)

或者使用常量

sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
  • 新版

    from sklearn.impute import SimpleImputer

imputer =SimpleImputer(strategy=”median”)

或者使用常量

si = SimpleImputer(strategy=’constant’, fill_value=’MISSING’)

scikitlearn Gotcha必须有2D数据

大多数Scikit-Learn估计器严格要求数据是的2D的。从技术角度讲,如果我们选择上面的列作为train[“HouseStyle”],Pandas Series是数据的单一维度。我们可以强制Pandas创建一个单列DataFrame,方法是将一个单项列表传递到方括号

eg:

hs_train = train[[‘HouseStyle’]].copy()
hs_train.ndim

#2

注意!在copy的时候改变维度,别再在fit_transform()里加个’[]’

future_encoders 去除

过去:


from future_encoders import OrdinalEncoder(或LabelBinarizer)
from future_encoders import OneHotEncoder
from future_encoders import ColumnTransformer

from
现在:

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

注:

#将文本标签转化为数字的转换器,老版使用LabelEncoder,但是换成 OrdinalEncoder

#且现在OneHotEncoder能处理字符串,也就是处理字符串只用OneHotEncoder就行。

转换流水线

新版—ColumnTransformer

  • Pipline()里喂进去各个API
  • full_pipeline里ColumnTransformer()喂进去各个API,响应的column名的集合
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

旧版–OldDataFrameSelector,FeatureUnion

  • Pipline()里先OldDataFrameSelector取出column对应值再喂进去各个API
  • full_pipeline里喂进去API
#OldDataFrameSelector()自制,传属性取值
#核心: return  X[self.attribute_names].values
old_num_pipeline = Pipeline([
        ('selector', OldDataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

from sklearn.pipeline import FeatureUnion

old_full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", old_num_pipeline),
        ("cat_pipeline", old_cat_pipeline),
    ])   

今天刚接触2.0的特性···是与书上和Jupytr教程的新老对比来的,对2.0了解还没有多少,慢更