情感分析

很意外打得还行，结识了一些小伙伴~初赛A榜Rank15，B榜Rank11；

复赛经历了队友骗子在其他队，封号两次，机器占满用不了等操蛋事情，结果：A榜Rank17，B榜Rank15

唉要是复赛时期平安无事排名肯定上去TUT

0x00 一些尝试

各种模型BERT,BERT-wwm（次优）,roBerta（最优），【TODO】Xlnet
清洗（没有提升）
EDA
规则
blending
做further-pretrain（不过没有提升）
自己写的keras，pytorch版本和别人的对比，唉
更改输入：①bert的【seq】②增加提示性文本：“标题：”、“正文：”
【TODO】增加指标：mean_absolute_error平均绝对误差
【TODO】Stacking
【TODO】调节概率的阈值
【TODO】取最后4层的CLS输出拼接作分类

0x01 模型改进学习—郭大方案

下面截取了郭大对于该开源代码改写的介绍：

1.该模型将文本截成k段，分别输入语言模型，然后顶层用GRU拼接起来。好处在于设置小的max_length和更大的k来降低显存占用，因为显存占用是关于长度平方级增长的，而关于k是线性增长的。

2.支持多GPU联合训练，实际batch size 大小= per_gpu_train_batch_size * numbers of gpu

3.支持梯度累积更新，如果显存太小，可以设置gradient_accumulation_steps参数，比如gradient_accumulation_steps=2，batch size=4，那么就会运行2次，每次batch size为2，累计梯度后更新，等价于batch size=4，但速度会慢两倍。而且迭代次数也要相应提高两倍，即train_steps设为10000

下面是我对郭大的代码改写的理解…郭大牛逼o(￣▽￣)ｄ

modeling_utils.py 中 from_pretrained 方法

1. 删去参数中force_download，proxies，使得config为None时不再下载
2. 在model的init的参数config中添加参数：
		config.lstm_hidden_size=args.lstm_hidden_size
        config.lstm_layers=args.lstm_layers
        config.lstm_dropout=args.lstm_dropout
   		...
  # Instantiate model.
  model = cls(config)
3. Load model时候参数archive_file，如果模型名不在cls.pretrained_model_archive_map，则只加载TensorFlow checkpoint，去除加载tensorflow2.0和pytorch的checkpoint
4. Instantiate model部分的
    model = cls(config, *model_args, **model_kwargs)   改为   model = cls(config)
5. 针对PyTorch state_dict，多加了一些键的替换（适应model.bin的键值对）

modeling_bert.py 中 BertForSequenceClassification 类

原模型forword方法内是取bert的outputs，然后
...
		pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
...

现在模型forword方法是
1. 先将bert的各个输入flatten，原因见下面run_bert.py内解析
···
flat_input_ids = input_ids.view(-1, input_ids.size(-1))
···
2.取bert的输出outputs
···
pooled_output = outputs[1]
#注意这里关键一步，把flatten后的重新升维成[n,split_num,input_ids]这样，感觉可以理解为split_num就是gru的时间维度这样哈哈···郭大好厉害！
output = pooled_output.reshape(input_ids.size(0),input_ids.size(1),-1).contiguous()
        
for w,gru in zip(self.W,self.gru):
	gru.flatten_parameters()
	output, hidden = gru(output)
	output = self.dropout(output)
#本来0维是时间维度，即spilt_num，现在替换为1维，0维为n。
hidden=hidden.permute(1,0,2).reshape(input_ids.size(0),-1).contiguous()
#hidden=output.mean(1)
#hidden=nn.functional.tanh(self.pooling(hidden))
#hidden=self.dropout(hidden)
logits = self.classifier(hidden)

run_bert.py

这里改的多辣···
先从数据输入说起，结合上面模型输入bert前的flatten

1. 拿一个文本举例，首先被split成split_num份，然后每份处理为bert的格式，加入choices_features：
choices_features.append((tokens, input_ids, input_mask, segment_ids))
2. InputFeatures类适用于数据输入的类，由于每个文本中含多份，所以改为：
class InputFeatures(object):
    def __init__(self,
                 example_id,
                 choices_features,
                 label

    ):
        self.example_id = example_id
        self.choices_features = [
            {
                'input_ids': input_ids,
                'input_mask': input_mask,
                'segment_ids': segment_ids
            }
            for _, input_ids, input_mask, segment_ids in choices_features
        ]
        self.label = label
        
 3. 举个例子，所以input_ids输入bert时候由[n,input_ids] 这样变成了 [n,split_num,input_ids] 
  原来的话使用[f.input_ids  for f in features]构造输入Tensor
  现在是构造：
    def select_field(features, field):
        return [
            [
                choice[field]
                for choice in feature.choices_features
            ]
            for feature in features
        ]

  然后使用select_field(train_features, 'input_ids')构造输入Tensor

0x03 一些尝试点

队友大哥做了~提升没多少，这里记录下

【模型结构】拼接：bert_dense
【模型结构】拼接：bert_gru
【模型结构】拼接：bert_cnn

【模型结构】输出：取最后4层的CLS输出拼接作分类

#实现核心代码：
"""
        这是 BertModel : sequence_output, pooled_output, (hidden_states), (attentions)
            sequence_output: 最后一层的整个序列的输出
            pooled_output  : 是sequence_output送入BertPooler后的输出，在BertPooler中主要是对
                             sequence_output[0]也就是序列的第一个token <CLS> 的表示进行
                             dense(hidden_size, hidden_size) + TanH之后的输出;
            hidden_states  : 整个Bert所有层的hidden states， 在BertConfig中需要置output_hidden_states=True才会返回
            attentions     : 整个Bert所有层的attention， 在BertConfig中需要置output_attentions=True才会返回
        """
    def forward(self, input_ids, token_type_ids=None, attention_mask=None,
                labels=None, position_ids=None, head_mask=None):
        #这里flatten是因为使用郭大一样的方法将输入split了需要flatten。
        flat_input_ids = input_ids.view(-1, input_ids.size(-1))
        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(
            -1)) if token_type_ids is not None else None
        flat_attention_mask = attention_mask.view(-1, attention_mask.size(
            -1)) if attention_mask is not None else None

        outputs = self.bert(input_ids=flat_input_ids,
                            position_ids=flat_position_ids,
                            token_type_ids=flat_token_type_ids,
                            attention_mask=flat_attention_mask,
                            head_mask=head_mask)

        # BertModel : sequence_output, pooled_output, (hidden_states), (attentions)
        # print('outputs size: ', outputs[2][11].shape)      # [8, 300, 768] [batch_size, max_seq_len, hidden_size]
        # print('outputs size: ', outputs[2][11][0].shape)   # [300, 768]    [max len, hidden_size]
        all_hidden_states = outputs[2]  # all_hidden_states = [num_layers, batch size, max len, hidden size]

        # layer = [batch size, max len, hidden size] --> [max len, batch size, hidden size] --> [batch size,hidden size]
        # last_4_layers = [4, batch_size, hidden_size]
        last_4_layers = torch.cat([layer.permute(1, 0, 2)[0].unsqueeze(0) for layer in all_hidden_states[-4:]], dim=0)

        # last_4_layers = [4, batch_size, hidden_size] --> [batch_size, 4, hidden_size]
        last_4_layers = last_4_layers.permute(1, 0, 2)

        # last_4_layers = [batch_size, 4, hidden_size] --> [batch size, 4 * hidden_size]
        last_4_layers = last_4_layers.reshape(last_4_layers.size(0), -1).contiguous()

        pooled_output = self.pooler(last_4_layers)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        if labels is not None:
            if self.num_labels == 1:
                # we are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
                logits = F.softmax(logits, -1)
            outputs = (loss, logits)
        else:
            outputs = F.softmax(logits, -1)
        return outputs  # 训练时返回loss， 预测时返回结果的softmax

【模型结构】输出：取最后4层的CLS进行max-pooling在作分类

#核心代码：和上方类似，不过在last_4_layers = last_4_layers.permute(1, 0, 2)之后：

		# max_pooling_out = [batch_size, 4, hidden_size] --> [batch_size, hidden_size]
        max_pooling_out = F.max_pool2d(last_4_layers, kernel_size=(last_4_layers.shape[1], 1)).squeeze(1)
        pooled_output = self.pooler(max_pooling_out)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

【模型结构】输出：取最后4层的CLS进行mean-pooling在作分类

#核心代码：和上方类似，不过在last_4_layers = last_4_layers.permute(1, 0, 2)之后：

		# avg_pooling_out = [batch_size, hidden_size]
        avg_pooling_out = F.avg_pool2d(last_4_layers, kernel_size=(last_4_layers.shape[1], 1)).squeeze(1)

        pooled_output = self.pooler(avg_pooling_out)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

【模型结构】输出：直接取最后一层的CLS输出作分类

#核心代码：
        labels=None, position_ids=None, head_mask=None):
        flat_input_ids = input_ids.view(-1, input_ids.size(-1))
        flat_position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(
            -1)) if token_type_ids is not None else None
        flat_attention_mask = attention_mask.view(-1, attention_mask.size(
            -1)) if attention_mask is not None else None

        # BertModel的输出：sequence_output, pooled_output, (hidden_states), (attentions)
        outputs = self.bert(input_ids=flat_input_ids,
                            position_ids=flat_position_ids,
                            token_type_ids=flat_token_type_ids,
                            attention_mask=flat_attention_mask,
                            head_mask=head_mask)
        pooled_output = self.dropout(outputs[1])
        logits = self.classifier(pooled_output)

【模型输出】Stacking

1 2	#法①：直接对一层结果进行平均之后再进行预测 #法②：先对上一层每一折的预测结果进行预测，再对结果进行投票(voting)

【调参技巧】

#知乎上看的，链接见上~
- Simple fine-tuning
- Snapshot Ensemble：论文、代码，原理主要是网络训练时，可能会收敛到不同的局部最优点，通过集成这些局部最优的模型进行预测。当时在参加某个比赛时也有类似的做法，就是保存最后几个epoch的模型权重，因为前几个epoch模型可能没学到什么东西，然后根据验证集上的分数作为权重，对测试集进行加权求和；
- stochastic weighted average：通过结合相同网络结构不同训练阶段的权重获得集成模型，然后进行预测，该方法优于Snapshot Ensemble。论文、代码
- 取最后四层[CLS]代表的特征：平均融合、加权融合、pooling
- Multi-Sample Dropout：提高模型的泛化能力
- Add LSTM char-level model

BERT Finetune 相关论文：
How to Fine-Tune BERT for Text Classification?（https://arxiv.org/abs/1905.05583）

0x04 总结

CCF比赛初赛期间跑了不少关于BERT的代码，BERT在NLP比赛中真厉害。

自己按照理解写的代码和别人的是有一定差距的，学习学习。

多学习别人的idea和code，这次队友大哥教会了我很多~