"TextBoxes++: A Single-Shot Oriented Scene Text Detector" paper notes

1 Overview

The method given in this article is to solve the problem of rotating text detection. Therefore, the method TextBoxes++ of the article can detect slanted text. The method of detecting text is represented by an angled rectangular box or a quadrilateral box. Since this method is derived from SSD, this network is straight-through, and it is not an understanding network similar to Faster R-CNN. The natural speed is very fast. The author has a resolution of $1024*1024$ Resolution ICDAR 2015 data set measured 11.6FPS and F-Measure=0.817, in $768*768$ The COCO text data set measured 19.8 FPS and F-Measure=0.5591.

Prior to this, the author of the paper also published a version called TextBoxes. The method of the paper has four improvements compared to the previous TextBoxes:

1) The original TextBoxes support horizontal detection, and now support the detection of angled text;
2) Optimize the network structure and training process, which further improves performance;
3) In order to show that Textboxes++ has better performance of text detection at any angle in natural scenes, more comparative experiments were done;
4) Integrate the detection and recognition information to optimize text detection and character recognition;

The relationship between SSD and TextBoxes++:
TextBoxes++ is derived from SSD. SSD does not perform well when detecting some texts with extreme length to width ratios. In TextBoxes++, a specially designed textbox layer is used to solve this problem. Therefore, compared with SSD, the detection performance has been further improved.
SSD can only generate candidate boxes in the horizontal direction, and TextBoxes++ can generate a rectangular text detection box with a rotation angle or a general quadrilateral detection box to adapt to the text with rotation.

The basic idea of returning:
In fact, TextBoxes++ still uses the rectangular box when matching with the GT box. The candidate box formed by the anchor returns to the horizontal rectangular box and quadrilateral box surrounding the GT. The benefits brought by this are The optimization strategy is simple and there are few candidates for each region.

The main contribution of the article:

1) Propose a new curved text detection model TextBoxes++, which has the characteristics of fast, accurate and end-to-end training
2) Provide research on detection frame representation, model configuration, and effects of different evaluation methods;
3) Use the results of recognition to optimize the results of text detection, which was not available in previous research.

2. Implementation method

2.1 Method network structure

As can be seen from the above structure, this structure is very similar to the original SSD method. The author's main focus is on the detection optimization of rotating text and the adaptation of the extreme length-width ratio detection frame. Since the network is composed of convolution and pooling layers, the benefit is that the network can accept images of any scale as input without worrying about the size of the input image.

2.2 Boundary box expression and regression

First define three boxes here: $q,r,b$ . It represents quadrilateral prediction box, rotating rectangular box and minimum horizontal bounding box respectively, and the minimum horizontal bounding box is obtained by regression of the default bounding box. What needs to be explained here is:Rotating rectangular frame and quadrilateral rectangular frame are separately regression, they are paired with horizontal rectangular frame to return. The expression of these three boxes is:

1） $b_0=(x_0,y_0,w_0,h_0)$ ， $(x_0,y_0)$ Represents the center of the bounding box, and the nature behind represents the width and height of the bounding box.
2） $q_0=(x_{01}^q,y_{01}^q,x_{02}^q,y_{02}^q,x_{03}^q,y_{03}^q,x_{04}^q,y_{04}^q)$
3） $r_0=(x_{01}^r,y_{01}^r,x_{02}^r,y_{02}^r,h_0^r)$

2.2.1 Processing of quadrilateral frames

Then for the quadrilateral box, the variables it needs to return are: $(\Delta_{x},\Delta_{y},\Delta_{w},\Delta_{w},\Delta_{x_1},\Delta_{y_1},\Delta_{x_2},\Delta_{y_2},\Delta_{x_3},\Delta_{y_3},\Delta_{x_4},\Delta_{y_4},\Delta_{c})$ , Its kind $c$ Is confidence. Then for a quadrilateral box $q=(x_{1}^q,y_{1}^q,x_{2}^q,y_{2}^q,x_{3}^q,y_{3}^q,x_{4}^q,y_{4}^q)$ The regression of can be expressed as:

2.2.2 Processing of rotating rectangular frame

What needs to be explained here is that for a rotating rectangular frame, the meaning is to use two points to determine the top two vertices of the rotating rectangular frame, plus a height to indicate a rotating rectangular frame. Its four vertices are expressed as:

Then the variables that need to be returned for rotating the rectangular frame are: $(\Delta_{x},\Delta_{y},\Delta_{w},\Delta_{w},\Delta_{x_1},\Delta_{y_1},\Delta_{x_2},\Delta_{y_2},\Delta_{h^r},\Delta_{c})$ , Its kind $c$ Is confidence.

2.2.3 Treatment of dense detection area

The aspect ratio used in the paper for the generated default bounding box is: $1,2,3,5,\frac{1}{2},\frac{1}{3},\frac{1}{5}$ . For some dense cases, see Figure 4 below. The default bounding box does not well frame the text area. Therefore, the article adds an offset to the default bounding box to adapt accordingly.

2.2.4 Choice of convolution kernel shape

For the horizontal box, the shape of the convolution kernel is $1*5$ , But for the article with rotation, the choice is $3*5$ 。

2.3 Network training

The loss function of the network: The loss function is obtained by adding the positioning loss and the classification loss.

Set here for quick convergence $\alpha=0.2$ 。
Data augmentation: Random cropping using the original data will produce the results in Figure 5 (a, b). Naturally, such results are difficult to conform to the form of text in the real scene, so the article has improved it.

This is used $B$ Represents the bounding box, $G$ Represents the GT box, $J$ On behalf of Jaccard overlap, $C$ Represents object coverage. Then the relationship between them is:

Multi-scale training: As mentioned above, because the framework proposed in the article only contains convolution and pooling operations, multi-scale training can be used to allocate the proportion of large and small targets in the training data set, thus showing better adaptation to small targets. Sex.
**Bounding box regression adds identification information: **(a, b) in the following figure 6 has the same IoU but the recognition results are not the same, so an idea given in the article It is to use the recognition results to further optimize the detection results, and finally obtain the results like (c).

3. Experimental results

3.1 The performance of the network on some data sets

Positioning performance:

operation hours:

F-measures：

3.2 The performance of network combination recognition

Intelligent Recommendation

Paper interpretation: SSD: Single Shot MultiBox Detector

Codes provided in the paper is available at: https://github.com/weiliu89/caffe/tree/ssd Papers link:SSD：Single Shot MultiBox Detector Video Tutorial: The paper's contributions: Faster than the mo...

"SSD-Single Shot MultiBox Detector" paper interpretation

"SSD-Single Shot MultiBox Detector" paper interpretation table of Contents "SSD-Single Shot MultiBox Detector" paper interpretation brief introduction SSD300 Architecture Detection...

[Paper notes] Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Arbitrary-Oriented Scene Text Detection via Rotation Proposals Paper address:https://arxiv.org/abs/1703.01086 github address:https://github.com/mjq11302010044/RRPN This paper is based on the faster-rc...

[Interpretation of the paper] Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks

Perface Recently, I was curious about the detection problem of large and long text lines in the text detection of the scene. So I investigated the detection results of the ICDAR2017MLT data set and fo...

[Paper] RRPN: Arbitrary-Oriented Scene Text Detection

I. Introduction (1) Main content This paper introduces a rotation-based approach and an end-to-end arbitrary-orientation text detection system that can generate arbitrary-orientation candidate boxes d...