Imputer ->Impute
在Boston房价数据处理中
- 旧版对于数值型的缺失值使用Imputer填充缺失值,调用方式:
from sklearn.preprocessing import Imputer
imputer =Imputer(strategy=”median”)
或者使用常量
sample_incomplete_rows.dropna(subset=["total_bedrooms"]) # option 1 sample_incomplete_rows.drop("total_bedrooms", axis=1) # option 2 median = housing["total_bedrooms"].median() sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
- 新版
from sklearn.impute import SimpleImputer
imputer =SimpleImputer(strategy=”median”)
或者使用常量
si = SimpleImputer(strategy=’constant’, fill_value=’MISSING’)
scikitlearn Gotcha必须有2D数据
大多数Scikit-Learn估计器严格要求数据是的2D的。从技术角度讲,如果我们选择上面的列作为train[“HouseStyle”],Pandas Series是数据的单一维度。我们可以强制Pandas创建一个单列DataFrame,方法是将一个单项列表传递到方括号中
eg:
hs_train = train[[‘HouseStyle’]].copy()
hs_train.ndim
#2
注意!在copy的时候改变维度,别再在fit_transform()里加个’[]’
future_encoders 去除
过去:
from future_encoders import OrdinalEncoder(或LabelBinarizer)
from future_encoders import OneHotEncoder
from future_encoders import ColumnTransformer
from
现在:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
注:
#将文本标签转化为数字的转换器,老版使用LabelEncoder,但是换成 OrdinalEncoder
#且现在OneHotEncoder能处理字符串,也就是处理字符串只用OneHotEncoder就行。
转换流水线
新版—ColumnTransformer
- Pipline()里喂进去各个API
- full_pipeline里ColumnTransformer()喂进去各个API,响应的column名的集合
num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()), ]) from sklearn.compose import ColumnTransformer num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] full_pipeline = ColumnTransformer([ ("num", num_pipeline, num_attribs), ("cat", OneHotEncoder(), cat_attribs), ]) housing_prepared = full_pipeline.fit_transform(housing)
旧版–OldDataFrameSelector,FeatureUnion
- Pipline()里先OldDataFrameSelector取出column对应值再喂进去各个API
- full_pipeline里喂进去API
#OldDataFrameSelector()自制,传属性取值 #核心: return X[self.attribute_names].values old_num_pipeline = Pipeline([ ('selector', OldDataFrameSelector(num_attribs)), ('imputer', SimpleImputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()), ]) from sklearn.pipeline import FeatureUnion old_full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", old_num_pipeline), ("cat_pipeline", old_cat_pipeline), ])
结
今天刚接触2.0的特性···是与书上和Jupytr教程的新老对比来的,对2.0了解还没有多少,慢更