Mask R-CNN

Link to the paper: https://arxiv.org/abs/1703.06870

First, the introduction

Mask R-CNN is a masterpiece of He Kaiming in 2017. It performs segmentation while performing target detection and achieves excellent results. It won the COCO 2016 competition without any tricks.The design of the network is also relatively simple. On the basis of Faster R-CNN, a branch is added to the original two branches (classification + coordinate regression) for semantic segmentation.,As shown below:

Second, Mask R-CNN details

So why is the network so good, and what are the network details? The following is a detailed introduction.

Before introducing Mask R-CNN, first understand what is segmentation, because Mask R-CNN is doing this, so this first has to be clear. Look at the figure below, mainly introducing several different segments, among which Mask RCNN does. Is one of theminstance segmentation.

Semantic segmentation: classifies pixels by pixel in an image.
Instance segmentation: Detects the object in the image and splits the detected object.
Panoptic segmentation: Describes all objects in an image.

The following picture is a good representation of the difference between these segments, as shown in the following figure, the panorama is the most difficult:

How does Mask R-CNN get good results?

The first difficulty with instance segmentation is:Need to detect the location of the target at the same time and segment the targetTherefore, this requires a fusion of target detection (boxing the position of the target) and semantic segmentation (classifying the pixels, segmenting the target). Prior to Mask R-CNN, Faster R-CNN performed better in the target detection field, while FCN performed better in the semantic segmentation field. So the natural way is to combine Faster R-CNN with FCN. The author does the same, but the author uses a clever way to combine and achieve the results of amazing.

In the previous instance segmentation, it is often divided and then identified. This is often inefficient and has low accuracy. For example, Dai [mentioned in the paper] adopts the cascade method and first generates through bounding-boxes. The segment area is then classified.

So how does Mask R-CNN do it?

Mask R-CNN is based on Faster R-CNN. Let us first review the Faster R-CNN. The Faster R-CNN is a typical two-stage target detection method. First, the RPN candidate region is generated, and then the candidate region passes through Roi. Pooling for target detection (including target classification and coordinate regression),Classification and regression share the previous network。
What improvements did Mask R-CNN do? The Mask R-CNN is also a two stage, and the generated RPN part is the same as the Faster R-CNN. Then, the Mask R-CNN adds a third branch based on the Faster R-CNN, and outputs the Mask of each ROI (Here is the biggest difference from the traditional method. The traditional method is to use the algorithm to generate the mask and then classify it.）

Naturally, this becomes a multitasking problem

The network structure is as follows

The following figure shows two typical Mask R-CNN network structures.FPN (do not know FPN can click on this blog post)The idea of designing two network structures separately, the left is the adoptionResNet or ResNeXtAs the backbone extraction feature of the network, the network on the right usesFPNThe network is used as a backbone for feature extraction.And the author pointed out that the effect of using FPN as the basic network is actually the best.

The design of the loss function is the essence of the network.

The loss function of Mask R-CNN is: $L=L_{cls}+L_{box}+L_{mask}$

Here is a brief introduction $L_{mask}$ ， $L_{mask}$ Is to classify each pixel, which contains $K*m*m$ The output of the dimension, K represents the number of categories, and m*m is the size of the extracted ROI image. $L_{mask}$ It is defined as average binary cross-entropy loss. Here is an explanation of how it is calculated.First, the split layer outputs the Mask whose channel is K. Each Mask corresponds to a category. The sigmoid function is used to perform two classifications to determine whether it is this category. Then, when calculating the loss, if the category of the ground-truth corresponding to the ROI is $K_{i}$ ，Calculation $K_{i}$ The loss corresponding to the mask, the other mask does not contribute to the loss. The formula for calculating the binary cross entropy is as follows:. What is different from FCN here is that FCN performs softmax classification on each pixel, divides into K categories, and then calculates softmax loss. Which mask is selected as the final output during inference? The author judges based on the prediction results of the classification branch, is it very magical, and the author explains that using this method is better than softmax effect [I think it is affirmative here, because the loss is simplified and the classification information is used, there should be Elevated].

Another innovation: ROI Align

In addition, since the segmentation requires a more accurate pixel position, in the Faster R-CNN method, two quantization operations are required before the Roi-Pooling (the first time is the scaling of the target in the original image to conv5, such as zooming 32 times) The target size is 600. The result is not an integer. It needs to be quantified. The second quantization is, for example, the feature map target is 5*5, and the ROI-pooling is 2*2. Since 5 is not a multiple of 2, it needs to be performed again. Quantization, so the result after Roi Pooling is quite different from the original image position), so the author improved ROI-Pooling and proposed the RoI Align method. When downsampling, the pixels are aligned so that the pixels More accurate.

How is ROI Align done?

ROI-Align cancels all quantization operations, no longer rounding up, as shown in the following figure, the dotted line represents the feature map, where the black box represents the position of the object, and the position of the object is no longer an integer. It may be in the middle, and then perform 2*2 align-pooling. The number of sampling points in the figure is 4, so you can calculate 4 positions, and then average the values of the 4 coordinates closest to each position. . How is the number of sampling points calculated? This can be set by itself. The default is to set 4 points. 2*2 is 4 bins.

[Additional knowledge] ROI-Warp: Add a layer in front of Roi-Pooling, scale the Roi area to a fixed size, and then perform roi-pooling, which reduces the quantization operation.

Network training

This is basically the same as Faster R-CNN, ISU > 0.5 is a positive sample, and $L_{mask}$ Calculated only when the sample is positive,The image is transformed to the short side 800, the positive and negative sample ratio is 1:3, and the RPN uses 5 scales and 3 aspect ratios.

Inference details

Using ResNet as the Backbone's Mask R-CNN, 300 candidate regions are generated for classification and regression. The FPN method is used to generate 1000 candidate regions for classification and regression, and then non-maximum suppression operations are performed. Mask detection**,There is no parallel operation like training here, the author explains that it can improve accuracy and efficiency.Then, the mask branch can predict the masks of k categories, but according to the result of the classification, the corresponding kth category is selected, the corresponding mask is obtained, and then resize to the size of the ROI, and then the threshold is 0.5 for binarization. . (Here, since resize requires interpolation, it needs to be binarized again. The size of m can refer to the above figure. The mask is not the ROI size at the end, but a relatively small graph, so resize operation is required.）

Third, the experimental results:

The experimental effect is still the lever, Mask R-CNN easily defeated the upper champion FCIS (which uses multi-scale training, horizontal flip test, OHEM, etc.)

Dissolution experiment:

The following picture basically explains all the comparison questions:

Table (a) shows that the deeper the network, the better the effect. And the FPN effect is better.
Table (b), sigmoid is better than softmax.
Table (c, d), roi-align effect has improved, especially AP75 is the most obvious improvement, which is useful for improving accuracy.
Table (e), mask banch uses FCN better (Because FCN does not destroy spatial relationships）
In addition, the author experiment, the mask branch uses different methods, method one: predict a mask for each category, method two: all predict a mask, the experimental results for each class predict a mask will be better 30.3 vs 29.7

For the result of the target test:

Comparing the following table, it can be seen that even if the mask branch is not used in the prediction, the result precision is very high. In the figure below, 'Faster R-CNN, ROIAlign' is the result of using ROI Align instead of ROI Pooling, compared with ROI Pooling. The result is about 0.9 points higher, but it is still 0.9 points lower than MaskR-CNN.This improvement, the author attributed it to the improvement of multi-task training, due to the addition of the mask branch, the loss changes, indirectly affect the effect of the backbone network.

For time consumption, the Mask R-CNN FPN network is 195ms faster than the Mask R-CNN and 400ms of the ResNet network.

Human key detection:

What is the difference between Mask detection and Mask R-CNN?

For human key point detection, the author performs one-hot encoding on the last m*m mask, and only one pixel in the mask is the foreground and the other is the background.
Human key detection, the final output is m^2-way softmax, no longer Sigmoid, the author explained that this is conducive to the detection of a single point.
Human key point detection, the final mask resolution is 56 * 56, no longer 28 * 28, the author explained that higher resolution is conducive to the detection of key points in the human body.

Intelligent Recommendation

Mask R-CNN model

data preparation To train the Mask R-CNN instance division model, we must first prepare the image mask (MASK), use the annotation toollabelme(Support for Windows and Ubuntu, use (SUDO) PIP Install Lab...

Mask R CNN stepping on

The Mask RCNN recovery process is too painful to record each deep pit. 1.Apex installation failed CUDA version 10.0 Pytorch1.0.0 Input according to official install.md Report: Query N websites and fin...

From R-CNN to Mask R-CNN

From R-CNN to Mask R-CNN Article directory: First, R-CNN Second, Fast R-CNN Third, Faster R-CNN Fourth, Mask R-CNN Fifth, expand 1、FCN Since the CNN-based approach in the 2012 ILSVRC competition has b...

Mask R-CNN paper notes

Essay topic:Mask R-CNN Paper link:Paper link Paper code:FacebookCode link;Tensorflow versionCode link；Keras and TensorFlow versionCode link;MxNet versionCode link 1. What is Mask R-CNN and what can be...

PyTorch—Mask R-CNN (overview)

Image segmentation and recognition platform, 10 months ago, Facebook has released a version calledDetecrons project. It also includes Mask R-CNN. However, it is based on the Caffe 2 deep learning fram...

Mask R-CNN

First, the introduction

Second, Mask R-CNN details

How does Mask R-CNN get good results?

Third, the experimental results:

Dissolution experiment:

Intelligent Recommendation

Mask R-CNN model

Mask R CNN stepping on

From R-CNN to Mask R-CNN

Mask R-CNN paper notes

PyTorch—Mask R-CNN (overview)

More Recommendation

A small program acquires the data on the previous page

53. The maximum and subsequence

13 non-overlap intervals (Leecode 435)

Java-commodity project

MP3 soft solution

Copyright DMCA © 2018-2026 - All Rights Reserved - www.programmersought.com User Notice