- ①见Hands-on-ml 03
- ②见 Kaggle-Appache
归纳通用的
按照属性取数据
- 用于后继Pipeline(出自①)
#Now let’s build our preprocessing pipelines. We will reuse the DataframeSelector we built in the previous chapter to select specific attributes from the DataFrame:
from sklearn.base import BaseEstimator, TransformerMixin
A class to select numerical or categorical columns
since Scikit-Learn doesn’t handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def init(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names]
Let’s build the pipeline for the numerical attributes:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy=”median”)
num_pipeline = Pipeline([
(“select_numeric”, DataFrameSelector([“Age”, “SibSp”, “Parch”, “Fare”])),
(“imputer”, Imputer(strategy=”median”)),
])
一些函数
- loc
感jio要赋值了第二个参数就不加[],否则对DataFrame操作加[]
train_test.loc[train_test[‘Name2_sum’] == 1 , ‘Name2_new’] = ‘one’
train.loc[train[‘TotalBsmtSF’]==0,’TotalBsmtSF’] = 1
train.loc[train[‘TotalBsmtSF’]==0,’TotalBsmtSF’] = 1
print(train.loc[train[‘TotalBsmtSF’]==0, [‘TotalBsmtSF’]].count())
tmp = np.array(tmp.loc[tmp[‘TotalBsmtSF’]>0, [‘TotalBsmtSF’]])[:, 0]
- get_dummies
pd.get_dummies(train_test,columns=[“Sex”])
- isnull
(DataFrame[DataFrame[..].isnull().values==False]
- mean
train.groupby([‘Pclass’])[‘Pclass’,’Survived’].mean()
- drop和del
(drop删除列,axis=1,inplace=True)
train = train.drop(train[(train[‘GrLivArea’]>4000) & (train[‘SalePrice’]<300000)].index)
del train_test[‘Name2’]
missing_age_X_train = missing_age_train.drop([‘Age’], axis=1)
- apply
train_test['Ticket_Letter'] = train_test['Ticket_Letter'].apply(lambda x:np.nan if x.isnumeric() else x)
- count()
age_null_count = dataset[‘Age’].isnull().sum()
print(train.loc[train[‘TotalBsmtSF’]==0, [‘TotalBsmtSF’]].count())
percent = (train_test.isnull().sum()/train_test.isnull().count()).sort_values(ascending=False)
- value_counts()
Name2_sum = train_test[‘Name2’].value_counts().reset_index()
分析做法不同的
Kaggle
数据分析
- info(),describe(),value_counts()
- corr(), heatmap
- 进一步探索分析各个数据与结果的关系
train.groupby([‘Pclass’])[‘Pclass’,’Survived’].mean()
train[[‘Parch’,’Survived’]].groupby([‘Parch’]).mean().plot.bar()
g = sns.FacetGrid(train, col=’Survived’,size=5)
g.map(plt.hist, ‘Age’, bins=40)
sns.countplot(‘Embarked’,hue=’Survived’,data=train)
特征工程
总体来说,1.将少量的缺失值用分层级后的平均值填充,2.文字特征提取后分类分列,3.将大量的缺失值用回归模型预测值填充
- pd.get_dummies分列
- 在数据的Name项中包含了对该乘客的称呼,将这些关键词提取出来,然后做分列处理
- ⑤ Name
- 从姓名中提取出姓做特征
- ⑥ Fare
- 该特征有缺失值,先找出缺失值的那调数据,然后用平均数填充
- ⑦ Ticket
- 该列和名字做类似的处理,先提取,然后分列
- ⑧ Age
- 该列有大量缺失值,考虑用一个回归模型进行填充;在模型修改的时候,考虑到年龄缺失值可能影响死亡情况,用年龄是否缺失值来构造新特征
- ⑨Cabin
- cabin项缺失太多,只能将有无Cain首字母进行分类,缺失值为一类,作为特征值进行建模,也可以考虑直接舍去该特征
- ⑩ 特征工程处理完了,划分数据集
建立模型
Hands-ml
数据分析
- info(),describe(),value_counts()
特征工程
- ①处理数值属性的缺失值
- 使用Imputer将缺失值替换成中位数(我觉得没有考虑层级关系,不如Kaggle对Fare的做法)
- ②处理文本属性的缺失值
- 使用OneHotEncoder
- 使用流水线:1.对数值型数据,先处理缺省值,然后增加三属性,最后特征缩放2.对文本OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
(‘imputer’, SimpleImputer(strategy=”median”)),
(‘attribs_adder’, CombinedAttributesAdder()),
(‘std_scaler’, StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = [“ocean_proximity”]
full_pipeline = ColumnTransformer([
(“num”, num_pipeline, num_attribs),
(“cat”, OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
建立模型
- 线性回归LinearRegression()
- DecisionTreeRegressor决策树
- 使用交叉验证评估
- RandomForestRegressor随机森林
- 网格下降微调
- RandomForestRegressor可以指出每个属性的相对重要程度,调整属性