tags: Text detection/recognition ocr text detection end2end text recog attention

Propose an end-to-end network that is fast, effective, and supports regular text and curved text detection/recognition.

The article has a lot of old-fashioned content, and the detection part is almost the same as PAN. For PANNet study notes, see:PANNet paper notes
In essence, it is a shared feature Head, and a multi-task learning network that detects and identifies dual branches is constructed from the Head.
detection branch
The detection branch is almost the same as PANNet. It learns kernel representation, text representation and PA module at the same time, but the loss function is optimized, which will be discussed later.
Added 3x3, bn, relu before the output module (PANNet does not have it)

Loss part
pixel clusteringThe loss is slightly different from PAN, and ln is added... But the overall idea is still the same, so I won't repeat it here.

δ a g g \delta_agg δaggStill take 0.5, R ( ⋅ ) R(·) R(⋅)is the max function
Inter-kernel exclusionloss and PAN are also different

In PANNet, only the distance between the kernels should be as large as possible, and the background pixels are not considered. This time, the background pixels are also added to the loss for calculation. The purpose is to make the distance between the kernel and the background should also be as large as possible (Equation (4)). so L d i s L_{dis} LdisIt is the sum of the repulsion loss between the background and the kernel, and between the kernel and the kernel.
So the total loss is:

in
α
=
0.5
\alpha=0.5
α=0.5,
β
=
0.25
\beta=0.25
β=0.25, used to balance the losses of each part.
identify branch
The role of the Masked ROI module: Turn text of different sizes and arbitrary shapes into a fixed-size feature patch (similar to the role of ROI Pooling and ROI Ailgn).
The specific implementation steps of Masked ROI:

identification head
The overall structure is recognized as a seq2seq structure with multi-head attention.
Recognition will first perform a 3x3 convolution operation on the Head, and the number of output channels is 128, which is recorded as
f
r
f_r
fr(Deepen feature extraction?). At this time, the complete mask text for detecting branch prediction will be used, in
f
r
f_r
frDo the Masked ROI operation to extract the corresponding features, and these extracted features (denoted as
f
r
o
i
f_{roi}
froi) for identification. Be careful not to put
f
r
o
i
f_{roi}
froiFlatten it into a vector, that is, length=8*32, and then do the recognition operation.

The overall structure includes two substructures of "Starter" and "Decoder".
Step1: "Starter" is used for feature mapping of "SOS". “SOS” onehot passed
δ
1
\delta _1
δ1Do embedding (dim=128) to get
S
O
S
e
m
b
SOS_{emb}
SOSemb, then with
f
r
o
i
f_{roi}
froiDo multi head attention, here
S
O
S
e
m
b
SOS_{emb}
SOSembas query,
f
r
o
i
f_{roi}
froiActing as key and value is a regular cross attention operation to get a new vector with both visual information and semantic information
f
s
f_s
fs, and will
f
s
f_s
fsSend it into Decoder for decoding operation.

Step2: The composition of the Decoder is also very simple, consisting of a double-layer LSTM layer (the author uses a one-way LSTM) and a multi head attention module. First in the LSTM cell
h
0
h_0
h0,
c
0
c_0
c0Adopt zero_init, then receive
f
s
f_s
fs, producing a new hidden layer state
h
0
h_0
h0(In fact, it is to update h and c), and then it is the normal decoding operation of seq2seq, which receives the gt onehot of the character output in the previous time step (if it is the prediction stage, it is the onehot of the character result output in the previous time step), In the first time step, the input is still "SOS", and then the new embedding layer is obtained
S
O
S
e
m
b
1
SOS_{emb1}
SOSemb1, it enters the lstm cell together with the previously updated h to generate output
h
1
h_1
h1, get
h
1
h_1
h1After that, instead of connecting to FC immediately, the
h
1
h_1
h1As a query, after a key, value is
f
r
o
i
f_{roi}
froiThe cross attention module further integrates the visual and semantic features to get a new vector, and then use this vector to do a full connection, and the output of the full connection dim=size_of_chars, you can get the probability of the output character at the current time step distributed. The following time steps can be deduced in the same way. (In fact, the process is relatively simple. Hey, it is a bit laborious to explain it in words..)

recognition loss

Expressed as the cross entropy of each time step, it is very simple.




Combine the previous work with a recognition module and put it on the top issue. . awesome
The speed is fast, and any module is designed with light weight in mind, and it has achieved SOTA indicators on multiple benchmarks.
The recognition module directly resizes the alignment of features to a fixed size, which directly determines that the recognition effect of this scheme on long text is definitely not good, but most end2end methods require a fixed size. This drawback is expected to be solved by new schemes in the future. It optimizes.
The recognition module is quite good, fully considering the visual information and semantic information, but this visual information is the visual information after flattening, and the spatial position relationship has been lost. Although it is not a big problem for self attention, it is more reasonable to learn it. Attention score map, but after all, there is a lack of spatial information. If the spatial position relationship can be maintained, will the index be higher?
Self attention is really awesome. It can not only recognize vertical text better, but also recognize the correct reading order... This is impossible for the method based on CTC recognition.

In this paper, we propose a fast and accurate scene text detection algorithm, only two steps. The full convolution algorithm uses the network model to generate predicted word or text line level direct...
Conference: ISCSLP 2016 Paper:End-to-end keywords spotting based on connectionist temporal classification for Mandarin OF: Ye Bai; Jiangyan Yi; Hao Ni; Zhengqi Wen; Bin Liu; Ya Li; Jianhua Tao Abstrac...
3. DEFER and Async Reference blog post: https://blog.csdn.net/weixin_42561383/article/details/86564715 By default, the browser will load the JS script synchronously, that is, the rendering engine will...
Article directory Brief Existing work problem data set Network structure Feature extraction layer Feature merge layer Result output layer Tag generation Loss function Text segmentation Loss RBOX bound...
1 Overview EASTThis is one of my favorite papers related to text detection this year, because a great god has reappeared[2]And easy to learn. The reason for liking is mainly in the results, there are ...
Questyle's works at CVPR2017 Advantage: Provides direction information, can detect text in various directions Disadvantages: The detection effect is not good for long text, and the feeling field is no...
URL:https://github.com/argman/EAST Test with the trained model python eval.py --test_data_path=tmp/images/ --gpu_list=0 --checkpoint_path=tmp/east_icdar2015_resnet_v1_50_rbox/ –output_dir=tmp/ou...
Foreword Deque (double-ended queue, double queue) is a data structure with a queue and stack. The elements in the dual-end queue can pop up from both ends, which define the insertion and deletion oper...
This story about Adobe Flash may be a little different from what you often see. As we all know, Adobe's browser plug-in has completely fallen out of favor in the industry because it always has securit...
PC side Spacehold position Fool, but some places are useful CSS3 text alignment Increase the word spacing (the mobile side also applies) Pseudo-class (mobile terminal also applies) If there is only on...