Mask R-CNN learning

tags: Target Detection artificial intelligence Deep learning

at last! Started learning Mask R-CNN! ! Worship the big brother!
First introduce the Faster R-CNN model:

It is a two-stage classifier. The first settlement is the regional candidate network (Region Proposal Network, RPN) used to propose candidate bounding boxes; the second stage is the core, that is, using RoIPool from each candidate box to extract features and perform classification and bounding-box regression . The features used in the two phases are shared
RoIPool is used in Faster R-CNN, which can extract small feature maps from each RoI. RoIPool first quantizes the floating point RoI into the discrete granularity of the feature map, then subdivides this quantized RoI into spatial containers, these spatial containers themselves are also quantized, and finally summarizes the feature values covered by each container (usually through maximum pooling )

Next enter the topic: Mask R-CNN
4. The Mask R-CNN model can accomplish the following two things at the same time: effectively detect objects in the image and generate a high-quality segmentation mask for each instance
5. The main application scenario of this article is instance segmentation. While detecting all objects in the image correctly, each instance is accurately segmented accurately. (Equivalent to linking the two contents of target detection and semantic segmentation)
6. As can be seen from the figure below, on the basis of the original Faster R-CNN, a branch is added to generate a mask image corresponding to the RoI of the region of interest. This branch is a simple FCN fully convolutional network, and a segmentation mask is predicted through the pixel-by-pixel method. At the same time, this network can extract a more refined spatial layout of an object on the spatial layout
7. At the same time, there is no complete pixel-to-pixel alignment between the input and output of the Faster R-CNN network. Therefore, the Mask R-CNN network also proposes a RoIAlign module for maintaining the exact spatial position. This small module can improve the mask accuracy from about 10% to 50%, and shows a greater gain in more stringent positioning metrics. At the same time, the RoIAlign module can decouple the two processes of mask prediction and category prediction. The specific method is to predict a binary mask for each category separately without competing among multiple categories. The RoI classification branch of the network to predict the category
8. L = Lcls+Lbox+Lmask. For each RoI, the output of the mask branch is a Km ^ 2-dimensional quantity, where K represents the total number of categories and m ^ 2 represents the resolution of the generated mask is mxm. For this output, we use the pixel-level sigmoid function, and define Lmask as the average binary cross-entropy loss function. Among them, if a RoI has the corresponding gold standard category k, then Lmask is only defined as the kth mask
9. From 5 we can see that for each category, the network will generate a mask for each category, and there will be no competition between categories. Use preset classification branches to predict the category of the class and select the corresponding output mask
10. The mask indicates that when using a fully connected layer to handle label or box offset, it is inevitable that it will be folded into a short output vector, however, the pixels provided by convolution- The pixel correspondence can naturally solve the extraction of the spatial structure of the mask. Specifically, for each RoI area, FCN is used to predict an mxm mask, which allows each layer of the mask branch to maintain an explicit mxm object space layout instead of using a vector representation because the latter Will lose space dimension.
11. From 7 to each of the RoI features obtained above (each of them is a valid feature map), the spatial correspondence of each pixel can be maintained.
12. The reason for introducing RoIAlign: From the explanation of RoIPool above, we can know that during this process, the quantization operation will be performed, and the quantization operation will introduce the RoI region and the extracted features. Misalignment. Although this misalignment may have no effect on the classification problem (because it can effectively deal with small translations for the classification problem), it will have a large negative effect on the mask image with the correct pixel prediction
13. The content of RoIAlign: According to the content mentioned in 11, it is found that the reason for the misalignment in RoIPool is the introduction of quantization, so it is natural that the improvement to it is to avoid Any quantification of RoI boundaries or bins. There are 4 regularly sampled positions in each RoI bin, and then bilinear interpolation is used to calculate the exact values of the input features at the above four positions. Then, aggregate these values (such as using max or average)

The blue dotted line indicates the feature map. For each sampled point in the RoI bin, the nearest grid vertex of the RoI feature map will be used, which is calculated using bilinear interpolation The value of each sampling point. Thus, there is no need to use quantization operations.
At the same time, the article pointed out that as long as no quantization is performed, the result will not be sensitive to the exact sampling position or the number of sampling points.
14. The article divides the network into two parts: backbone backbone network (for feature extraction on the entire image), head (for bbox regression and mask prediction), Among them, the head is used independently on each RoI. The term network-depth-features is used to denote backbone architecture.

Intelligent Recommendation

Mask R-CNN summary

Papers Address:https://arxiv.org/abs/1712.00726 Content of the article: Paper Overview Algorithms points Bilinear interpolation Paper Summary: Mask R-CNN is adding a branch in Faster R-CNN basis to pr...

Mask R-CNN notes

Mask R-CNN is an instance segmentation algorithm that can be used for target detection, target instance segmentation, and target key point detection. The difficulty of instance segmentation is that al...

Mask R-CNN overview

This column will interpret some classic neural network models in the CV direction and their contributions and significance from the perspective of the paper, in order to deepen your own impression. Yo...

Mask R-CNN understanding

Introduction to the MASK RCNN algorithm: Mask-RCNN is another masterpiece of He Kaiming after Faster-RCNN. It integrates two functions of object detection and instance segmentation, and surpasses Fast...

Mask R-CNN model

data preparation To train the Mask R-CNN instance division model, we must first prepare the image mask (MASK), use the annotation toollabelme(Support for Windows and Ubuntu, use (SUDO) PIP Install Lab...