Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

tags: Kernel  convolution  Clustering  Machine Learning  Deep Learning

1. Introduction:

TextSnake and PSENet are text instances designed to detect curves and are also widely seen in natural scenes. However, complex pipelines and a large number of convolutional operations, which usually slow down their inference

Pixel Aggregation Network (PAN), which is equipped with a low-computing cost segmentation head and a learnable post-processing. More specifically, the split head consists of a feature pyramid enhancement module (FPEM) and a feature fusion module (FFM). FPEM is a cascading U-shaped module that introduces multi-level information to guide better segmentation. FPEM can enhance features at different scales by fusing low-level and high-level information with minimal computational overhead. FFM can collect the features given by FPEMs of different depths into a final feature for segmentation. Learnable post-processing is achieved through pixel aggregation (PA), which can accurately aggregate text pixels through predicted similarity vectors.

To improve efficiency, the backbone of the segmented network must be lightweight. However, lightweight backbones usually have a smaller receptive field and weaker representational ability.Segmentation head proposed, Functional Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). Using the Feature Fusion Module (FFM)FPEMs at different depthsThe generated features are fused into the final features for segmentation, and the network also predicts the similarity vector for each text pixel, so in the same text instancePixels and coresThe distance between similarity vectors is very small

2. Graphic process

  1. backbone: ResNet-18

  2. use 1×1 convolution to reduce the channel number of each feature map to 128,and get a thin feature pyramid Fr

  3. nc enhanced feature pyramids F1 , F2 ,..., Fnc

  4. FFM fuses the nc enhanced feature pyramids into a feature map Ff, whose stride is 4 pixels and the channel number is 512

  5. Ff is used to predict text regions, kernels and similarity vectors

  6. PA post-processing

3. Feature Pyramid Enhancement Module (FPEM)

  1. up-scale enhancement and down-scale enhancement

  2. strides of 32, 16, 8, 4 pixels and strides of 4, 8, 16, 32 pixels

  3. FPEM is capable of enlarging the receptive field (3×3 depthwise convolution) and deepening the network (1×1 convolution) with a small computing overhead.(3x3 expands the receptive field, 1x1 deepens the network with small computing volume)

  4. The FLOPS of FPEM is about 1/5 of FPN

4. Feature Fusion Module (FFM)

F1, F2,...Fn has different depths. The semantic information of the lower and higher layers is equally important for semantic segmentation. Abandon concatenate after upward rounding (channel number 4x128xnc), slows down the final prediction speed, and takesFirst, combine the corresponding proportional feature maps through element addition. ThenUpsampling the added feature mapand connect it into one (number of channels 4×128)

5. Pixel Aggregation

Borrowing the idea of ​​clustering, reconstructing complete text instances from the kernel. The kernel of the text instance is the cluster center. Text pixels are samples to be clustered. The distance between the text pixel and the kernel of the same text instance is small

1. Adopt aggregation loss (L_agg)

In the second equation, one represents the similarity vector of pixel p and the other represents the similarity vector of kernel ki. The calculation method is

2. The kernels of different text instances should maintain sufficient distance (L_dis)

There should be sufficient distance between different cores, so the calculation formula is

Ldis controls that the diss between each core is not less than 3

During the testing phase, we directed pixels in the text area to the corresponding kernel using the predicted similarity vector. The PA post-processing steps are as follows:

i) Find the connected components in the kernel segmentation result, each connected component is a core.
ii) For each kernel Ki, the Euclidean distance of its similar vector is conditionally merged with adjacent text pixels (4 directions) in its predicted text area less than d
iii) Repeat step ii) until no qualified neighbor text pixels

6. Loss

Dice Loss

  1. [psenet] The ground truth of the kernels is generated by shrinking original ground truth polygon, to shrink the original polygon by ratio r
  2. Using online hard example mining (OHEM) ignores the use of simple non-text pixels when calculating L tex, only text pixels in gt

Intelligent Recommendation

EAST: An Efficient and Accurate Scene Text Detector

Questyle's works at CVPR2017 Advantage: Provides direction information, can detect text in various directions Disadvantages: The detection effect is not good for long text, and the feeling field is no...

EAST: An Efficient and Accurate Scene Text Detector implementation

URL:https://github.com/argman/EAST Test with the trained model python eval.py --test_data_path=tmp/images/ --gpu_list=0 --checkpoint_path=tmp/east_icdar2015_resnet_v1_50_rbox/ –output_dir=tmp/ou...

Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection paper reading

Deep relational reasoning graph network for arbitrary shape text detection Summary Due to the diversity and complexity of scene text, arbitrary shape text detection is a challenging task. This paper p...

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection interpretation (1)

Preface There are relatively few online materials about DRRG, but in fact, as the latest achievement of CVPR, everyone's expectations are still quite high. Personally, I was very distressed when I was...

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection Interpretation (2) GCN

Preface Here is a bit of supplement to DRRG, here is the GCN part, do not involve the theory of specific theories, please refer to this article: Graph Convolutional Network GCN Graph Convolutional Net...

More Recommendation

Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection: Code Interpretation (textnet)

Preface Looking forward to looking forward to, DRRG code interpretation has not yet been. This is waiting for me to die, oh, no way, no one writes, just explore it yourself. That's right, I'm bragging...

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection: Code Interpretation (LOSS)

Foreword Continue reading to write code today, is the interpretation of the following loss function loss, let us learn together, as to what the loss of functions yo, we can look at my translation of t...

TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting reading notes

Summary A method for manufacturing a proposed text detection and recognition of the relationship between micro operations may RoISlide, end to end so that the model becomes the model. Herein exhibit c...

[Paper] RRPN: Arbitrary-Oriented Scene Text Detection

I. Introduction (1) Main content This paper introduces a rotation-based approach and an end-to-end arbitrary-orientation text detection system that can generate arbitrary-orientation candidate boxes d...

4 text detection algorithms based on pixel segmentation

summary:Text detection is the first step in text reading recognition, and it has a significant impact on subsequent text recognition. In general scenarios, the detection and positioning of text lines ...

Copyright  DMCA © 2018-2026 - All Rights Reserved - www.programmersought.com  User Notice

Top