Mask R-CNN summary

tags: segmentation

Papers Address:https://arxiv.org/abs/1712.00726

Content of the article:

  • Paper Overview

  • Algorithms points

  • Bilinear interpolation

Paper Summary:

Mask R-CNN is adding a branch in Faster R-CNN basis to predict the ROI segmentation mask, it is a branch of the classification and regression branch parallel. The main improvement is the ROI Pooling changed ROI Align, to enhance the accuracy of the mask.

Algorithms points:

Mask R-CNN in two stages: a first stage the RPN by the proposed candidate block, the second stage is to do classification, regression, segmentation.

Loss:

Multitasking loss function

The first two and loss function as Faster R-CNN, L (mask) is the average value of the two cross-entropy, dividing branch for each ROI will have a K * m * m-dimensional output, there are K in the m * m resolution a binary masks, each pixel are applied sigmoid, so that the benefits for each category are generated mask, no competition between classes, mutual non-interference. It should be noted that the calculation of loss of time, not sigmoid output of each category are calculated binary cross-entropy loss, but the pixel belongs to which category, sigmoid output of which class only to calculate losses. And at the time of the test, to select the appropriate mask is predicted by the branch prediction classification categories. In this way, mask prediction and classification predict completely decoupled.


ROI Align:

Faster R-CNN there is a problem: the original image is a characteristic diagram misaligned (mis-alignment), it will affect the detection accuracy. Mask R-CNN the proposed method to replace RoIAlign ROI pooling, RoIAlign can retain more accurate spatial position.

First of all, why do we use ROIAlign it?
ROI Align is presented in Mask-RCNN paper in a regional manner aggregation characteristics, solves the two regions ROI Pooling quantization operation due to a mismatch (MIS- alignment) of problem.

These two were quantified:

region proposal of x, y, w, hx, y, w, hx, y, w, h typically fractional, but for convenience it will operate integers.
the boundary region after the average of an integer k × k into cells, each cell boundary will be integers.
two integers (quantized) process as shown below know almost :( images from Articles)

In fact, after the two above-mentioned quantization, and this time the candidate frame has begun to return the most out of a position a certain deviation, the deviation may affect the accuracy of the detection or segmentation. In the paper, the author summed it up as "mismatch (mis-alignment).

To solve this problem, ROI Align way to cancel integer operation, hold fractional bilinear interpolation method to obtain the value of pixel coordinates float. However, in actual operation, ROI Align not simply supplement the coordinate points on the boundary of the candidate area, then pooled, but re-design. Let illustrated by two examples:

Example 1
as shown below, the dotted line represents feature map, the solid line represents the ROI, where the ROI is cut into 2x2 cells. If the sampling points is 4, we first grid are each cell into four smaller squares (red line), the center of each small square is the sample point. The coordinates of these sampling points usually a float, it is necessary for bilinear interpolation pixel sample points (four as shown by the arrows), you can obtain the value of the pixel point. Then maxpooling four sampling points in each cell, can be obtained a final result of ROIAlign.

It should be noted that, in the experiments, the authors found that the sampling point is set to 4 will get the best performance, set to 1 or even directly on the performance is almost the same. In fact, ROI Align does not traverse in the number of sampling points ROIPooling so much, but it can get better performance, thanks largely solved the problem of misalignment.

Example 2
we then through a more intuitive example of the above-described specific analysis area below the mismatch, as shown below:

As shown above, this is a Faster-RCNN the detection frame. Enter a 800 * 800 images, there is a 665 * 665 frame surrounding the picture (frame with a dog). After the image feature extraction via the backbone network, wherein FIG scaling step (a stride of) 32. Thus, when the image and the bounding box of the input side length is 1 / 32.800 32 may be exactly divisible by 25 turns. However, after 665 divided by 32 to get 20.78, with decimals. Thus it ROI Pooling directly quantized to 20. Next, the box requires a pool of features into a size of 7 * 7, and thus the above-described average bounding box is divided into rectangular regions 7 * 7. Obviously, each side length of the rectangular region is 2.86, and containing a decimal. Then again it ROI Pooling quantized to 2. After thisTwo quantify, Candidate region have been obvious deviation (as shown in FIG green part). More importantly, the upper layer is characterized in FIG deviation of 0.1 pixels, the original image is scaled to 3.2 pixels. Then the deviation of 0.8, on the original pixels is nearly 30 points difference, this difference can not be underestimated.

ROI Align very simple idea: to cancel the quantization operation, obtains the coordinates of the image pixel values ​​on the use of floating point number bilinear interpolation, so that the entire feature aggregation process into a continuous operation. as follows:

ROI Align operationsas follows:

Through each candidate area, to keep floating point boundary not quantified.
candidate region is divided into k × k cells, the boundaries of each cell do not quantized.
four calculated coordinate position is fixed in each cell, calculates the value of these four locations by bilinear interpolation method, and the maximum pool operation.

Described herein make some point of the third step: This means that a fixed position of each rectangle unit (bin) is determined according to a fixed regular positions. For example, if the number of sampling points is 1, then that is the central point of this unit. If the number of sampling points is the unit 4 so that the average is divided into four smaller squares which are respectively after the center point. The coordinates of these sample points apparently usually a float, it is necessary to use interpolation methods to obtain its pixel value. In a related experiment, the authors found that the sampling point is set to 4 will get the best performance, set to 1 or even directly on the performance is almost the same. In fact, ROI Align ROIPooling not so much in the number of sampling points to traverse, but it can get better performance, thanks largely solve the problem of misalignment

ROI Align the back-propagation

Conventional ROI Pooling backpropagation following formula:

 

Here, pixels on the xi represents pooled before feature maps; j-th point r th candidate area after yrj Representative pooled; origin (the largest pool of I (r, j) representative point yrj pixel values ​​* when selected maximum pixel value of the coordinates of the point where). As it can be seen from the above equation, only after a certain point pooled pixel value using the pixel values ​​of the current point Xi is the process of cell (i.e. satisfies i = i * (r, j)), only the back xi Biography gradient.

Analogous to ROIPooling, ROIAlign backpropagation slight modifications need to be made: First, in the ROIAlign, xi * (r, j) is the coordinate position of a float (calculated forward propagation time samples), prior to pooling FIG feature, each of the xi * (r, j) are smaller than the horizontal and vertical coordinates of a point corresponding thereto should receive the return point yrj gradient, so the ROI Align backpropagation following formula:
   

In the above formula, d (.) Represents the distance between two points, and [Delta] h represents Δw xi and xi * (r, j) of the difference between horizontal and vertical coordinates, herein incorporated by bilinear interpolation coefficient in the original gradient on.

Bilinear interpolation:

Why use bilinear interpolation

In the enlarged image in and out of the process, the need to calculate the location of the new image pixels in the original image, if the calculated position is not an integer, you need to use the interpolated image, we need to find a new master drawing was recently assigned to pixels of pixels, this method is very simple nearest neighbor interpolation method, this method is easy to understand, simple, but not practical, will produce a true phenomenon, producing a checkerboard effect, more practical approach is bilinear interpolation.

One-dimensional linear interpolation

We already know that (x0, y0) value (x1, y1), and the known values ​​of x, y is the value required. According to the knowledge of junior high school:

We can get:.     

make:

then:

 

Derivation bilinear interpolation

Bilinear interpolation is doneThe second one-dimensional linear interpolation,we useFour nearest neighbor estimateGiven gradation.

Calculation


First, a linear interpolation is performed twice in the X direction, and then once in the Y-direction interpolation.
 
at the time of image processing, according to our
  srcX=dstX* (srcWidth/dstWidth),
  srcY = dstY * (srcHeight/dstHeight)
is calculated position of the target pixel in the source image, and srcY srcX calculated here are generally float, such as f (1.2, 3.4) This virtual pixel is present, first find pixel and its four neighboring actually exists
  (1,3) (2,3)
  (1,4) (2,4)
is written in the form f (i + u, j + v), then u = 0.2, v = 0.4, i = 1, j = 3
when the difference between the interpolation along the X-direction, f (R1) = u (f (Q21) -f (Q11)) + f (Q11)
Similarly calculated along the Y direction.
or directly finishing step of the calculation,
f(i+u,j+v) = (1-u)(1-v)f(i,j) + (1-u)vf(i,j+1) + u(1-v)f(i+1,j) + uvf(i+1,j+1)


Reference Articles:

(Bilinear interpolation)

Intelligent Recommendation

Mask R-CNN Intensive

1. Introduction In principle Mask R-CNN is an intuitive extension ofFaster R-CNN, yet constructing the mask branch properlyis critical for good results. Most importantly, Faster RCNNwas not designed f...

Mask R-CNN notes

Mask R-CNN is an instance segmentation algorithm that can be used for target detection, target instance segmentation, and target key point detection. The difficulty of instance segmentation is that al...

Mask R-CNN learning

at last! Started learning Mask R-CNN! ! Worship the big brother! First introduce the Faster R-CNN model: It is a two-stage classifier. The first settlement is the regional candidate network (Region Pr...

Mask R-CNN overview

This column will interpret some classic neural network models in the CV direction and their contributions and significance from the perspective of the paper, in order to deepen your own impression. Yo...

Mask R-CNN understanding

Introduction to the MASK RCNN algorithm: Mask-RCNN is another masterpiece of He Kaiming after Faster-RCNN. It integrates two functions of object detection and instance segmentation, and surpasses Fast...

More Recommendation

Mask R-CNN model

data preparation To train the Mask R-CNN instance division model, we must first prepare the image mask (MASK), use the annotation toollabelme(Support for Windows and Ubuntu, use (SUDO) PIP Install Lab...

A small program acquires the data on the previous page

A small program acquires the data on the previous page...

53. The maximum and subsequence

Dynamic Programming CurSum rightmost position of the maximum recording sequence, res record the actual maximum. After a partition to do, first of all about the array is divided into three parts, to fi...

13 non-overlap intervals (Leecode 435)

1 problem A collection of a range, find the minimum number of intervals to remove, so that the remaining intervals do not overlap. note: It can be considered that the end point of the interval is alwa...

Java-commodity project

Create an Articleset class, open a commodity warehouse, used to store product elements Create an ArticleManage class, call the Article class and Articles, And achieve a rendezvous change...

Copyright  DMCA © 2018-2026 - All Rights Reserved - www.programmersought.com  User Notice

Top