Paper_CV_4 语义分割、实例分割


Segmentation Model

  • semantic segmentation
  • Instance segmentation
Model Submitted
ResNet ENet,FRRN,AdapNey…
Unet 2015.5
Make R-CNN 2017.3
Mask Scoring R-CNN
Fast R-CNN
FPN 2017.4 特征金字塔(Feature Pyramid)特征金字塔是不同尺度目标识别系统的基本组成部分。
Human Part Segmentation
FCN 2014.11
Segnet 2015.11 剑桥提出的旨在解决自动驾驶或者智能机器人的图像语义分割深度网络,SegNet基于FCN,与FCN的思路十分相似
DeepLab v3(Vgg-16 / resnet-101)
LIP-JPPNet 2018.4
SS-JPPNet 2018.4
PGN 2018 “Instance-level Human Parsing via Part Grouping Network”, ECCV 2018 (Oral).
Parsing R-CNN 2018.11
Self-Correction for Human Parsing CVPR2019 ranks 1st in CVPR2019 LIP Challenge.

FCN提出可以把后面几个全连接都换成卷积,这样就可以获得一张2维的feature map,后接softmax获得每个像素点的分类信息,从而解决了分割问题



  1. 对于FCN-32s,直接对pool5 feature进行32倍上采样获得32x upsampled feature,再对32x upsampled feature每个点做softmax prediction获得32x upsampled feature prediction(即分割图)。
  2. 对于FCN-16s,首先对pool5 feature进行2倍上采样获得2x upsampled feature,再把pool4 feature和2x upsampled feature逐点相加,然后对相加的feature进行16倍上采样,并softmax prediction,获得16x upsampled feature prediction。
  3. 对于FCN-8s,首先进行pool4+2x upsampled feature逐点相加,然后又进行pool3+2x upsampled逐点相加,即进行更多次特征融合。具体过程与16s类似,不再赘述。

作者在原文种给出3种网络结果对比,明显可以看出效果:FCN-32s < FCN-16s < FCN-8s,即使用多层feature融合有利于提高分割准确性



  1. FCN式的逐点相加,对应caffe的EltwiseLayer层,对应tensorflow的tf.add()
  2. U-Net式的channel维度拼接融合,对应caffe的ConcatLayer层,对应tensorflow的tf.concat()


  1. 下采样+上采样:Convlution + Deconvlution/Resize
  2. 多尺度特征融合:特征逐点相加/特征channel维度拼接
  3. 获得像素级别的segement map:对每一个像素点进行判断类别

DeepLab v3




Bayesian SegNet


《Fully Convolutional Networks for Semantic Segmentation 》 Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell; UC Berkeley

caffe official 重定向GitHub



label_colours = [(0,0,0)                
# 0=Background,

# 1=Hat, 2=Hair, 3=Glove, 4=Sunglasses, 5=UpperClothes

# 6=Dress, 7=Coat, 8=Socks, 9=Pants, 10=Jumpsuits

# 11=Scarf, 12=Skirt, 13=Face, 14=LeftArm, 15=RightArm

# 16=LeftLeg, 17=RightLeg, 18=LeftShoe, 19=RightShoe

PGN 2018



BiSeNet v1/v2 2018

旷世科技 ECCV 2018



Mark R-CNN 2017


RoiAlign——重对齐 RoIPool 以使其更准确



Searching for MobileNetV3 ICCV 2019

Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation ECCV 2020

Swin Transformer(微软)

