【ML】达观杯实验

lsa特征实验
lda特征实验
Tfidf特征实验
lda原理
Tfidf原理网上一堆····

各种classfier


“””修改clf_name可对学习算法进行选择；修改base_clf改变集成学习的基分类器”””
clf_name = ‘svm’

base_clf = LinearSVC()

clfs = {
‘lr’: LogisticRegression(penalty=’l2’, C=1.0),
‘svm’: LinearSVC(penalty=’l2’, dual=True),
‘bagging’: BaggingClassifier(base_estimator=base_clf, n_estimators=60, max_samples=1.0, max_features=1.0, random_state=1,
n_jobs=1, verbose=1),
‘rf’: RandomForestClassifier(n_estimators=10, criterion=’gini’),
‘adaboost’: AdaBoostClassifier(base_estimator=base_clf, n_estimators=50),
‘gbdt’: GradientBoostingClassifier(),
‘xgb’: xgb.XGBClassifier(max_depth=6, learning_rate=0.1, n_estimators=100, silent=True, objective=’multi:softmax’,
nthread=1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1,
colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0,
missing=None),
‘lgb’: lgb.LGBMClassifier(boosting_type=’gbdt’, num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=250,
max_bin=255, subsample_for_bin=200000, objective=None, min_split_gain=0.0, min_child_weight=0.001,
min_child_samples=20, subsample=1.0, subsample_freq=1, colsample_bytree=1.0, reg_alpha=0.0,
reg_lambda=0.5, random_state=None, n_jobs=-1, silent=True)
}
clf = clfs[clf_name]

StratfordKFlod， GridSearchCV

一些小技巧

自己感想：

使用pandas读取大型csv
sklearn中 F1-micro 与 F1-macro区别和计算原理
文本类的标签，最好从0开始，0-n来做特征工程,方便融合vote时使用np.bincount()
pandas.DataFrame的shuffle方法
TfiDfVectorizer使用&&参数