泰坦尼克

归纳通用的

按照属性取数据

  • 用于后继Pipeline(出自①)

    #Now let’s build our preprocessing pipelines. We will reuse the DataframeSelector we built in the previous chapter to select specific attributes from the DataFrame:

from sklearn.base import BaseEstimator, TransformerMixin

A class to select numerical or categorical columns

since Scikit-Learn doesn’t handle DataFrames yet

class DataFrameSelector(BaseEstimator, TransformerMixin):
def init(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names]

Let’s build the pipeline for the numerical attributes:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer

imputer = Imputer(strategy=”median”)

num_pipeline = Pipeline([
(“select_numeric”, DataFrameSelector([“Age”, “SibSp”, “Parch”, “Fare”])),
(“imputer”, Imputer(strategy=”median”)),
])

一些函数

  • loc
    感jio要赋值了第二个参数就不加[],否则对DataFrame操作加[]

    train_test.loc[train_test[‘Name2_sum’] == 1 , ‘Name2_new’] = ‘one’

train.loc[train[‘TotalBsmtSF’]==0,’TotalBsmtSF’] = 1

train.loc[train[‘TotalBsmtSF’]==0,’TotalBsmtSF’] = 1

print(train.loc[train[‘TotalBsmtSF’]==0, [‘TotalBsmtSF’]].count())

tmp = np.array(tmp.loc[tmp[‘TotalBsmtSF’]>0, [‘TotalBsmtSF’]])[:, 0]

  • get_dummies

    pd.get_dummies(train_test,columns=[“Sex”])

  • isnull

    (DataFrame[DataFrame[..].isnull().values==False]

  • mean

    train.groupby([‘Pclass’])[‘Pclass’,’Survived’].mean()

  • drop和del

    (drop删除列,axis=1,inplace=True)

    train = train.drop(train[(train[‘GrLivArea’]>4000) & (train[‘SalePrice’]<300000)].index)

del train_test[‘Name2’]

missing_age_X_train = missing_age_train.drop([‘Age’], axis=1)

  • apply
    train_test['Ticket_Letter'] = train_test['Ticket_Letter'].apply(lambda x:np.nan if x.isnumeric() else x)
    
  • count()

    age_null_count = dataset[‘Age’].isnull().sum()

print(train.loc[train[‘TotalBsmtSF’]==0, [‘TotalBsmtSF’]].count())

percent = (train_test.isnull().sum()/train_test.isnull().count()).sort_values(ascending=False)

  • value_counts()

    Name2_sum = train_test[‘Name2’].value_counts().reset_index()

分析做法不同的

Kaggle

数据分析

  • info(),describe(),value_counts()
  • corr(), heatmap
  • 进一步探索分析各个数据与结果的关系

    train.groupby([‘Pclass’])[‘Pclass’,’Survived’].mean()

train[[‘Parch’,’Survived’]].groupby([‘Parch’]).mean().plot.bar()

g = sns.FacetGrid(train, col=’Survived’,size=5)
g.map(plt.hist, ‘Age’, bins=40)

sns.countplot(‘Embarked’,hue=’Survived’,data=train)

特征工程

总体来说,1.将少量的缺失值用分层级后的平均值填充,2.文字特征提取后分类分列,3.将大量的缺失值用回归模型预测值填充
  • pd.get_dummies分列
  • 在数据的Name项中包含了对该乘客的称呼,将这些关键词提取出来,然后做分列处理
  • ⑤ Name
  • 从姓名中提取出姓做特征
  • ⑥ Fare
  • 该特征有缺失值,先找出缺失值的那调数据,然后用平均数填充
  • ⑦ Ticket
  • 该列和名字做类似的处理,先提取,然后分列
  • ⑧ Age
  • 该列有大量缺失值,考虑用一个回归模型进行填充;在模型修改的时候,考虑到年龄缺失值可能影响死亡情况,用年龄是否缺失值来构造新特征
  • ⑨Cabin
  • cabin项缺失太多,只能将有无Cain首字母进行分类,缺失值为一类,作为特征值进行建模,也可以考虑直接舍去该特征
  • ⑩ 特征工程处理完了,划分数据集

建立模型

  • SVM
  • 随机森林
  • Logisitic
  • CBDT
  • xgboost
  • 注:3和1网格调参
    模型融合 voting(除了SVM)
    模型融合 stacking(除了SVM)

Hands-ml

数据分析

  • info(),describe(),value_counts()

特征工程

  • ①处理数值属性的缺失值
  • 使用Imputer将缺失值替换成中位数(我觉得没有考虑层级关系,不如Kaggle对Fare的做法)
  • ②处理文本属性的缺失值
  • 使用OneHotEncoder
  • 使用流水线:1.对数值型数据,先处理缺省值,然后增加三属性,最后特征缩放2.对文本OneHotEncoder

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
(‘imputer’, SimpleImputer(strategy=”median”)),
(‘attribs_adder’, CombinedAttributesAdder()),
(‘std_scaler’, StandardScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = [“ocean_proximity”]

full_pipeline = ColumnTransformer([
(“num”, num_pipeline, num_attribs),
(“cat”, OneHotEncoder(), cat_attribs),
])

housing_prepared = full_pipeline.fit_transform(housing)

建立模型

  • 线性回归LinearRegression()
  • DecisionTreeRegressor决策树
  • 使用交叉验证评估
  • RandomForestRegressor随机森林
  • 网格下降微调
  • RandomForestRegressor可以指出每个属性的相对重要程度,调整属性