PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text

tags: Text detection/recognition ocr text detection end2end text recog attention

PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text

Basic Information

Source: TPAMI
Author: Wenhai Wang, Enze Xie…
Tags: text detection, end2end, text spotter, curve text, recognition
Paper address:arxiv

Detailed explanation of the paper

core tasks

Propose an end-to-end network that is fast, effective, and supports regular text and curved text detection/recognition.

background motivation

Most scholars treat text detection and text recognition tasks differently, ignoring the complementarity between detection and recognition tasks, and increasing the amount of calculation (crop_roi...)
The current end-to-end model is more designed for horizontal or oriented rectangular text detection/recognition, and rarely considers the end-to-end detection/recognition scenario of curved text.
The current end-to-end model often adopts a relatively complex structural design, which is computationally inefficient.

model structure

The article has a lot of old-fashioned content, and the detection part is almost the same as PAN. For PANNet study notes, see:PANNet paper notes

In essence, it is a shared feature Head, and a multi-task learning network that detects and identifies dual branches is constructed from the Head.
detection branch
- The detection branch is almost the same as PANNet. It learns kernel representation, text representation and PA module at the same time, but the loss function is optimized, which will be discussed later.
- Added 3x3, bn, relu before the output module (PANNet does not have it)
- Loss part
  
  pixel clusteringThe loss is slightly different from PAN, and ln is added... But the overall idea is still the same, so I won't repeat it here.
  
  δ a g g \delta_agg δaggStill take 0.5, R ( ⋅ ) R(·) R(⋅)is the max function
  
  Inter-kernel exclusionloss and PAN are also different
  
  In PANNet, only the distance between the kernels should be as large as possible, and the background pixels are not considered. This time, the background pixels are also added to the loss for calculation. The purpose is to make the distance between the kernel and the background should also be as large as possible (Equation (4)). so L d i s L_{dis} LdisIt is the sum of the repulsion loss between the background and the kernel, and between the kernel and the kernel.
  
  So the total loss is:
  
  in α = 0.5 \alpha=0.5 α=0.5， β = 0.25 \beta=0.25 β=0.25, used to balance the losses of each part.
identify branch
- The role of the Masked ROI module: Turn text of different sizes and arbitrary shapes into a fixed-size feature patch (similar to the role of ROI Pooling and ROI Ailgn).
- The specific implementation steps of Masked ROI:
  - Calculate the bounding bbox of the text area (on the feature map, that is, the image that is 4 times the downsample, not the original image).
  - According to the bounging bbox, crop it on the feature map of the recognition head to get crop_patch.
  - According to the text area or gt (in training mode) calculated by the detection head, a mask with the same size as the bbox can be formed, and the pixel value corresponding to the text area is 1, and the pixel value not on the text area is 0. Make this mask and crop_path elementwise_mul, the purpose is to mask the noise pixels in the bbox, so that its contribution in the recognition process is 0.
  - Finally, it is simple and rude to resize the masked roi obtained in the previous step to a fixed size to obtain the final crop_patch (h=8, w=32 obtained in the article).
- identification head
  - The overall structure is recognized as a seq2seq structure with multi-head attention.
  - Recognition will first perform a 3x3 convolution operation on the Head, and the number of output channels is 128, which is recorded as f r f_r fr(Deepen feature extraction?). At this time, the complete mask text for detecting branch prediction will be used, in f r f_r frDo the Masked ROI operation to extract the corresponding features, and these extracted features (denoted as f r o i f_{roi} froi) for identification. Be careful not to put f r o i f_{roi} froiFlatten it into a vector, that is, length=8*32, and then do the recognition operation.
    
    The overall structure includes two substructures of "Starter" and "Decoder".
    
    Step1: "Starter" is used for feature mapping of "SOS". “SOS” onehot passed δ 1 \delta _1 δ1Do embedding (dim=128) to get S O S e m b SOS_{emb} SOSemb, then with f r o i f_{roi} froiDo multi head attention, here S O S e m b SOS_{emb} SOSembas query, f r o i f_{roi} froiActing as key and value is a regular cross attention operation to get a new vector with both visual information and semantic information f s f_s fs, and will f s f_s fsSend it into Decoder for decoding operation.
    
    Step2: The composition of the Decoder is also very simple, consisting of a double-layer LSTM layer (the author uses a one-way LSTM) and a multi head attention module. First in the LSTM cell h 0 h_0 h0， c 0 c_0 c0Adopt zero_init, then receive f s f_s fs, producing a new hidden layer state h 0 h_0 h0(In fact, it is to update h and c), and then it is the normal decoding operation of seq2seq, which receives the gt onehot of the character output in the previous time step (if it is the prediction stage, it is the onehot of the character result output in the previous time step), In the first time step, the input is still "SOS", and then the new embedding layer is obtained S O S e m b 1 SOS_{emb1} SOSemb1, it enters the lstm cell together with the previously updated h to generate output h 1 h_1 h1, get h 1 h_1 h1After that, instead of connecting to FC immediately, the h 1 h_1 h1As a query, after a key, value is f r o i f_{roi} froiThe cross attention module further integrates the visual and semantic features to get a new vector, and then use this vector to do a full connection, and the output of the full connection dim=size_of_chars, you can get the probability of the output character at the current time step distributed. The following time steps can be deduced in the same way. (In fact, the process is relatively simple. Hey, it is a bit laborious to explain it in words..)
- recognition loss
  
  Expressed as the cross entropy of each time step, it is very simple.

experiment

data set

Total-Text
ICDAR 2015
MSRA-TD500
RCTW-17
SynthText(pretrain)
COCO-Text(pretrain)
ICDAR 2017 MLT(pretrain)

Experimental results

after reading

Combine the previous work with a recognition module and put it on the top issue. . awesome
The speed is fast, and any module is designed with light weight in mind, and it has achieved SOTA indicators on multiple benchmarks.
The recognition module directly resizes the alignment of features to a fixed size, which directly determines that the recognition effect of this scheme on long text is definitely not good, but most end2end methods require a fixed size. This drawback is expected to be solved by new schemes in the future. It optimizes.
The recognition module is quite good, fully considering the visual information and semantic information, but this visual information is the visual information after flattening, and the spatial position relationship has been lost. Although it is not a big problem for self attention, it is more reasonable to learn it. Attention score map, but after all, there is a lack of spatial information. If the spatial position relationship can be maintained, will the index be higher?
Self attention is really awesome. It can not only recognize vertical text better, but also recognize the correct reading order... This is impossible for the method based on CTC recognition.

Intelligent Recommendation

An Efficient and Accurate Scene Text Detector

In this paper, we propose a fast and accurate scene text detection algorithm, only two steps. The full convolution algorithm uses the network model to generate predicted word or text line level direct...

End-to-end keywords spotting based on connectionist temporal classification for Mandarin

Conference: ISCSLP 2016 Paper:End-to-end keywords spotting based on connectionist temporal classification for Mandarin OF: Ye Bai; Jiangyan Yi; Hao Ni; Zhengqi Wen; Bin Liu; Ya Li; Jianhua Tao Abstrac...

The front -end interview questions are eight -shaped text DEFER and Async

3. DEFER and Async Reference blog post: https://blog.csdn.net/weixin_42561383/article/details/86564715 By default, the browser will load the JS script synchronously, that is, the rendering engine will...

EAST Interpretation - An Efficient and Accurate Scene Text Detector

Article directory Brief Existing work problem data set Network structure Feature extraction layer Feature merge layer Result output layer Tag generation Loss function Text segmentation Loss RBOX bound...

[ ]EAST: An Efficient and Accurate Scene Text Detector

1 Overview EASTThis is one of my favorite papers related to text detection this year, because a great god has reappeared[2]And easy to learn. The reason for liking is mainly in the results, there are ...