tags: Text detection and recognition Deep learning
The text detection model EAST introduced in this article simplifies the intermediate process steps, directly realizes end-to-end text detection, elegant and concise, and the accuracy and speed of detection have been further improved. As shown below:

Among them, (a), (b), (c), (d) are several common text detection processes. The typical detection process includes candidate box extraction, candidate box filtering, bouding In the stages of box regression and candidate box merging, the intermediate process is relatively lengthy. And (e) is the EAST model detection process introduced in this article. As can be seen from the above figure, the process is simplified to only the FCN stage (full convolutional network), NMS stage (non-maximum suppression), the intermediate process is greatly reduced, and The output result supports multiple angle detection of text lines and words, which is not only efficient and accurate, but also adaptable to a variety of natural application scenarios. (D) is the CTPN model. Although the detection process is similar to the (e) EAST model, it only supports horizontal text detection, and the applicable scene is not as good as the EAST model. As shown below:

The network structure of the EAST model is as follows:

The network structure of EAST model is divided into three parts: feature extraction layer, feature fusion layer and output layer.
Expand below to introduce:
Based on PVANet (a target detection model) as the backbone of the network structure, feature maps are extracted from the convolutional layers of stage1, stage2, stage3, and stage4 respectively. The size of the convolutional layer is halved in sequence, but the number of convolution kernels is in turn. Doubled, this is a kind of "pyramid feature network" (FPN, feature pyramid network) idea. In this way, feature maps of different scales can be extracted to achieve detection of text lines of different scales (large feature maps are good at detecting small objects, and small feature maps are good at detecting large objects). This idea is very similar to the SegLink model;
The previously extracted feature maps are merged according to certain rules. The merge rule here uses the U-net method. The rules are as follows:
Finally output the following 5 parts of information, respectively



(a) Text rectangle (yellow dotted line) and reduced rectangle (solid green line); (b) Text score feature map; (c) RBOX frame geometry map; (d) Each The distance of 4 channels from pixels to the border of the rectangle; (e) Rotation angle.
Generally reduce the size of the labeling frame by 0.3 to train (reducing labeling error), as shown in (a); for the rectangle Q, where pi is the clockwise vertices of the rectangle. In order to reduce Q, the length between vertices needs to be calculated:

D(pi,pj) is the L2 distance between pi and pj. First shrink the two longer sides of a rectangle, then shrink the two shorter sides. For each pair of sides, the "longer" pair is determined by comparing the average length. For each edge, shrink by moving its two endpoints inward by 0.3ri and 0.3r (i mod 4)+1, respectively.
First, generate a rotating rectangle to cover the area with the smallest area; then, for the RBOX label box, calculate the distance between each pixel with a positive score and the 4 borders of the text box; for QUAD Annotate the box and calculate the coordinate offset between each pixel with a positive score and the four vertices of the text box.
The loss function formula is:

Among them, Ls and Lg represent the loss of the score graph and the geometric graph, respectively, and λg represents the importance between the two losses (λg=1 in this experiment).
In the current method, most of the training images use balanced sampling and hard negative mining to solve the uneven distribution of targets. Doing so may improve network performance. However, using this technique will inevitably introduce a stage and more parameters to adjust the pipeline, which contradicts the design principles of this article. In order to simplify the training process, this article uses class-balanced cross-entropy (used to solve class imbalance training, β=counterexample sample number/total sample number), the formula is as follows:
Among them, Y^= Fs is the prediction of the score graph, YIs the label value. The parameter β is the balance factor between positive and negative samples, the formula is as follows

One challenge of text detection is that the size of text in natural scene images varies greatly, and the direct use of L1 or L2 loss for regression will cause the loss to be biased towards larger and longer text areas. Therefore, for RBOX regression, the AoBB part of the IoU loss is used. For QUAD regression, scale-normalized is used to smooth L1 loss.
RBOX loss:
RBOX uses IoU loss for the AABB part, because it is unchanged for objects of different sizes:
Where R^ represents the predicted AABB geometry, RIs its corresponding callout box. Calculate the width and height of the intersecting rectangle:

where d1, d2, d3, and d4 represent the distance from a pixel to the top, right, bottom, and left borders of its corresponding rectangle, respectively. The formula of the union area is as follows:

Therefore, the IoU area can be easily calculated. Next, calculate the loss of rotation angle:

where θ^ is the prediction of the rotation angle, and θ* represents the label value. Finally, the overall geometric loss is the weighted sum of the AABB loss and the angle loss. The formula is as follows:

QUAD loss:

Because this article predicts tens of thousands of geometric boxes, the time complexity of a simple NMS algorithm is O(n^2), where n is the number of candidate boxes, which is complicated The degree is too high. Therefore, this article proposes to merge the geometric figures line by line, assuming that the geometric figures from nearby pixels tend to be highly correlated. When merging the geometric figures in the same line, iteratively merge the currently encountered geometric figure with the last merged figure. The improved time The complexity is O(n).

1. First combine all output box sets with corresponding thresholds (be larger than the threshold, they will be combined, and if they are less than the threshold, they will not be combined), and use the confidence score as the weighted combination to obtain the combined bbox set
2. Perform standard NMS operations on the merged bbox collection.
The effect of EAST text detection is shown in the figure below, in which part of the text lines with affine transformation are detected (such as billboards)


The advantage of the EAST model lies in the concise detection process, which is efficient and accurate, and can realize multi-angle text line detection. But there are also deficiencies, such as (1) the effect of detecting long text is relatively poor, mainly because the receptive field of the network is not large enough; (2) when detecting curvilinear text, the effect is not very ideal
In order to improve the shortcomings of EAST's long text detection effect, Advanced EAST has been proposed, which uses VGG16 as the backbone of the network structure, and is also composed of a feature extraction layer, a feature merge layer, and an output layer. Through experiments, Advanced EAST has better detection accuracy than EAST, especially on long text.
The network structure is as follows:

At the feature merging layer, feature maps of different scales are used, and the top-down merging method is carried out through corresponding rules to detect text lines of different scales
provides the direction information of the text and can detect the text in all directions
The method in this article performs poorly when detecting long text, which is mainly determined by the receptive field of the network (the feeling is not big enough)
When detecting curve text, the effect is not ideal
Mark, thank the author for sharing! English original link:https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/ Reminder: Author's implementation of Python's text detectio...
During this time, I read the EAST algorithm and the improvement on the EAST algorithm and completed the reappearance and application into other scenarios. Today's society has entered the era of image ...
Scene text detection - CTPN algorithm What is OCR? OCR's full name is "Optical Character Recognition" Chinese translation for optical character recognition. It uses optical technology and co...
Opencv3.4.2 began to support the EAST text detector, without installing complex dependencies, and running the trained detector in a few simple steps to test the effect. 1. Environment: python+opencv+i...
We live in an era: if any organization or company wants to scale up and remain relevant, it must change their perception of technology and quickly adapt to the changing environment. We already know ho...
Run EAST text detection algorithm under Windows10 source code Overview: EAST is an efficient scene text detection algorithm. Debugging and running the source code helps to understand the literature, b...
SIGAI contributing author: Hudong-San Original Statement: This article SIGAI original article, for personal learning use, without permission, shall not be reproduced, can not be used for commercial pu...
What is OCR? The full name of OCR is "Optical Character Recognition" Chinese translation for optical character recognition. It is the process of using optical technology and computer technol...
What is OCR? The full name of OCR is "Optical Character Recognition", which translates to optical character recognition in Chinese. It is the process of using optical technology and computer...
Series article catalog Chapter II Stroke Width Transform (SWT) Algorithm Principle and Source Code Analysis (1) Article catalog Chapter 1 Stroke Width Transform (SWT) Algorithm Principle...