手机版
您的当前位置: 恒微文秘网 > 范文大全 > 专题范文 > Label,Enhancement,for,Scene,Text,Detection

Label,Enhancement,for,Scene,Text,Detection

来源:专题范文 时间:2024-01-13 18:57:01

MEI Junjun ,GUAN Tao ,TONG Junwen

(1.State Key Laboratory of Mobile Network and Mobile Multimedia Technology,Shenzhen 518055,China;
2.ZTE Corporation,Shenzhen 518057,China)

Abstract: Segmentation-based scene text detection has drawn a great deal of attention,as it can describe the text instance with arbitrary shapes based on its pixel-level prediction.However,most segmentation-based methods suffer from complex post-processing to separate the text instances which are close to each other,resulting in considerable time consumption during the inference procedure.A label enhancement method is proposed to construct two kinds of training labels for segmentation-based scene text detection in this paper.The label distribution learning (LDL) method is used to overcome the problem brought by pure shrunk text labels that might result in sub-optimal detection performance.The experimental results on three benchmarks demonstrate that the proposed method can consistently improve the performance without sacrificing inference speed.

Keywords: label enhancement;scene text detection;semantic segmentation

In recent years,with the wide application of scene text recognition,scene text detection,as a prerequisite step of scene text recognition,has drawn more and more attention from academia and industry due to the various scales and shapes of different text instances (e.g.,horizontal texts,multi-oriented texts and curved texts).

The purpose of scene text detection is to generate a proper region that can locate the corresponding text instance.In the early years of scene text detection,algorithms,usually inspired by general object detection,were designed to regress the text bounding boxes in form of rectangles or quadrangles with certain orientation.However,most of the regressionbased algorithms require complex anchor design for multioriented text detection,which may fail when handling with text instances of complex shapes,e.g.,the curved texts.In order to detect unregular text instances,segmentation-based detection,which can describe the text instance pixel-wisely,has drawn a great deal of attention.To improve the performance of segmentation-based detection,the existing methods can be roughly divided into two directions.On the one hand,several innovative models have been proposed for separating the text instances lying close to each other,such as Differentiable Binarization (DB)[1]and Progressive Scale Expansion Network(PSENet)[2].Take PSENet[2]as an example,WANG et al.designed a progressive scale expansion network to predict the different scales of kernels for each text instance.Then a postprocessing module was employed to gradually expand the minimal scale kernel to the text instance with complete shape.On the other hand,a lot of efforts are made to simplify the post-processing module for segmentation-based detection.Inspired by PSENet,a DB module introduced approximate function to binarization,which is differentiable to simplify the postprocessing pipeline.

From the perspective of the label utilized for supervision,traditional segmentation-based detection algorithms,such as PSENet and DB,usually utilize shrunk text regions as training labels.The shrunk text regions could produce large margins between different text instances,which makes it effective to separate different text instances lying close to each other and thus simplify the post-processing module.However,we want to argue that pure shrunk text regions may be accompanied with at least two problems.On the one hand,there is apparent difference between a shrunk text instance and a real one,which might incur some difficulties for the convergence of conventional segmentation-based models.On the other hand,the shrunk text regions are unfriendly to small text instances since too small text regions may be ignored during the training period.In this way,it may cause the decrease of positive labels and result in sub-optimal detection performance.

Inspired by label enhancement (LE)[3],we utilize the idea of label distribution learning (LDL)[4]and propose a label enhancement method to overcome the problem brought by pure shrunk text labels.Moreover,following DB[1],we use removable training branches which are supervised by enhanced labels to speed up our model.Compared with the baseline,the proposed method improves the performance significantly,which achieves F-measure of 87.3% on the MSRA Text Detection 500 Database (MSRA-TD500) and 85.6% on the Total-Text dataset.In summary,there are at least three contributions of our work:

·The label distribution learning method is used for characterizing text regions.

·The label enhancement method is improved to generate labels for segmentation-based text detection.

·The performance of the proposed method is comparable to the state of the art without sacrificing the inference speed.

Scene text detection has made significant progress in recent years and a large number of deep learning based methods have been proposed.Specifically speaking,those efforts can be divided into two categories: regression-based methods and segmentation-based methods.

2.1 Regression-Based Methods

Regression-based methods,inspired by general object detection initially,usually regard the regions of different text instances as bounding boxes with specific orientation.The anchor idea in general object detection was followed by Connectionist Text Proposal Network (CTPN)[5]for predicting the text slices which are then connected by a recurrent neural network.Based on CTPN[5],TextBoxes[6]modified the anchor scales and the shapes of convolutional kernels to predict the text instances with various aspect ratios.There are several representative works for multi-oriented text instances in the regression-based network.On the basis of the faster regionbased convolutional neural network (R-CNN)[7],the rotation region proposal network (RRPN)[8]developed rotation proposals of the region proposal network (RPN) part to detect the text with various orientations.The deep matching prior network(DMPNet)[9]and TextBoxes++[10]found another way to apply quadrilateral regression for the detection of multi-oriented text instance.Also,Efficient and Accurate Scene Text Detector(EAST)[11]and DeepReg[12]are anchor-free methods by directly predicting the rotation angles and quadrilateral text boxes for multi-oriented text instances.However,most of regressionbased networks may fall short of presenting accurate bounding boxes for the text instances with irregular shapes,which should be the basic element for scene text detection for a regression-based network.

2.2 Segmentation-Based Methods

Compared with regression-based models,segmentationbased methods can predict the proper regions for unregular text instances pixel-wisely.The pipeline of segmentationbased methods usually consists of two key parts: the first part is to make pixel-level prediction by fully convolutional networks (FCN)[13]and the second part is to convert them to proper text regions by pre-defined post-processing algorithms.ZHANG et al.[14]firstly utilized FCN to extract the text regions and then detected character candidates from these text regions by MSRA-based algorithms.Meanwhile,YAO et al.[15]utilized FCN to predict the text regions with respect to three classes:text/non-text,character classes and character linking orientations.Then,in order to distinguish different text instances lying close to each other correctly,PixelLink[16]and PSENet[2]were proposed.For PixelLink,the core idea is to replace traditional semantic segmentation by instance segmentation.While for PSENet,various kernels with different scales are utilized on different text instances to find a proper kernel with the minimal scale,and then appropriate post-processing is applied to acquire correct text instances.Recently,in order to speed up traditional segmentation-based methods to fit in real-world applications,LIAO et al.[1]proposed a DB module to avoid the complicated post-processing module during the inference stage.The DB module has two core advantages.On the one hand,DB utilizes the removable branch that can be removed during inference for simplicity.On the other hand,the ability of differentiability for the DB module can help to find the adaptive threshold for the post-processing module,which will simplify the procedure for post-processing.However,traditional DB modules only utilize shrunk text regions for supervision,which may cause sub-optimal detection performance.In order to improve the detection performance,we propose a framework that utilizes,besides shrunk text regions,the label ambiguity by adding a removable branch for LDL.The proposed framework results in better performance than the traditional ones.

In this section,we will describe the proposed model in detail.Technically speaking,we firstly introduce the idea of label enhancement and the way to extend the traditional label enhancement method.Then,we present the structure of our framework including the overall pipeline and the components.After that,we describe the way to generate labels for supervision and the optimization process of the whole network.

3.1 Label Enhancement

Compared with the traditional supervised learning methods that only learn a single label or multiple logical labels,LDL[4]or deep label distribution learning (DLDL)[17]learns the distribution among all the labels.In order to overcome the challenge that most training sets only contain single or multiple logical labels instead of label distribution,the idea of label enhancement (LE)[3]is proposed.Formally,label enhancement can be described as below:

Given a training setS={(xi,li)|1 ≤i≤n},wherexi∈χandli∈{0,1}c,LE recovers the label distributiondiofxifrom the logical label vectorli,and thus transformsSinto a LDL training setE={(xi,di)|1 ≤i≤n}.

The traditional methods proposed for label enhancement usually concentrate on an one-dimensional label such as age or emotion.However,the labels of semantic segmentation include the description of instances in two aspects: classification and localization.It is difficult for traditional LE to recover the label distribution in aspect of localization from their segmentation labels.In order to apply label distribution learning in our model,we extend traditional LE by constructing the correlated labels with respect to the localization and their semantic labels so that we can utilize multiple labels to describe one text instance in segmentation-based scene text detection.The detail of label distribution learning in our model can be found in Section 3.2 and the way to generate multiple labels in Section 3.3.

3.2 Network Design

The overall pipeline of the proposed model (Fig.1) includes two major submodules named Feature Pyramid Module and LDL Module.After an image is fed into our model,the feature pyramid module firstly processes this image as the input and the output is the fused feature mapF,which consists of multiple feature maps with various scales.After that,the fused feature mapFis fed into the LDL module to predict threescore maps named the probability mapP,the distribution mapDand the border mapBby two independent FCN respectively.For these three score maps,the border mapBis generated by a single FCN branch,while the probability mapPand the distribution mapDare generated by another single branch due to the similarity between the probability map and the distribution map.In order to speed up our model during inference,the branch to generate the distribution mapDand the border mapBare removable so that we could only predict the probability mapPduring inference phase.

The feature pyramid module,as the first core component in our model,is constructed to produce feature maps with multiple scales.Inherited from DB,this module is built based on the residual network (ResNet)[18]with stage conv3,conv4 and conv5 modulated by 3 × 3 deformable convolutional layers.For the implementation,once the feature maps are generated by conv3,conv4 and conv5 sequentially,they will be upsampled and added to the feature map in the previous stage.Then,all the added feature maps are scaled to 1/4 of the original image size and concatenated to obtain the fused featureF.

LDL[4]and DLDL[17]utilize constrained weights as the description degree of correlated labels to deal with the label ambiguity,which can also be viewed as weight normalization of correlated labels.In the training period,we utilize the labels to calculate training losses.Thus the weighted labels are actually a kind of weighted loss in loss calculation with labels.Based on this idea,the architecture of the LDL module in our model has only two FCN branches.Each FCN branch is stacked with one 3×3 convolutional layer and two transposed convolutional layers with BatchNorm and rectified at both the linear activation function (ReLU) layers inserted into the convolutional layers and Softmax layers utilized to generate the score maps at the end of FCN.

3.3 Label Generation

In order to perform label distribution learning in our model,there are at least three correlated labels utilized during the training phase for the three kinds of score maps named the probability mapP,distribution mapDand border mapB,as shown in Fig.2.The label generation for the probability map is inspired by DB[1].Given an original text polygon labelG,the shrunk text region is generated by shrinkingGtoGs.The shrinking offsetDis calculated from the polygon perimeterLand its areaAvia the Vatti clipping algorithm:

▲Figure 1.Overall pipeline of the proposed network

▲Figure 2.Label generation for text regions: (a) training image;(b)probability map;(c)border map;(d)distribution map

whereris the shrink ratio that is set to 0.4 empirically.In order to generate the corresponding labels for the distribution map,inspired by DLDL[17],the distance map is generated by Eq.2 with each value representing a description degree for the pixel to be positive.Then the distribution map is generated by adding the distance map to the probability map pixelwisely.

whereG(i,j) andGs(i,j)represent the pixel values for rowiand columnjin the corresponding mapsGandGsrespectively,and dist(i,j,G) represents the minimal distance between pixel (i,j) and the border ofG.On the basis of dist(i,j,G),dist(G,Gs)represents the distance between the borders ofGandGswith respect to the direction to obtain dist(i,j,G).For the border map,we firstly dilateGtoGdwith the same scaleD.Then two distance maps can be produced fromGsandGdtoG.Finally,the border map can be obtained by adding the two distance maps pixel by pixel.

3.4 Optimization

The loss functionLin our model can be formulated as a weighted sum of the loss for three score maps,namely the loss for the probability mapLs,the loss for the distribution mapLdand the loss for the border mapLb:

whereαis empirically set to 2.0 in order to give more prominence to the probability map whose ground truths are directly generated by the logical ground truth.Similar weight has been used in PSENet[2]for the probability map and its original logical ground truth.Following DB[1],we use the binary crossentropy (BCE) loss in bothLsandLd.To overcome the imbalanced distribution of positives and negatives,hard negative mining is applied to the BCE loss.To deal with the unbalance of the text and non-text regions,we only compute the loss inside the dilated text polygonGd.LsandLdare formulated as follows:

whereRsindicates the selected region.The ratio of positives and negatives is 1:3.We applyL1loss forLb,which is computed inside the dilated text polygonGd:

whereRdindicates the selected region of dilated polygonGdandy*is the label for the border map.

In this section,we will demonstrate the effectiveness of our framework by extensive experiments.We first introduce prevalent datasets for scene text detection briefly and then present the implementation of the proposed model.Ablation studies are also conducted to verify the effectiveness of our architecture.Finally,we compare our model with the existing state-ofthe-art methods on several benchmarks,such as ICDAR MLT-2017 dataset,Total-Text,MSRA-TD500 dataset and ICDAR-2015 dataset,to show the success of our model.

4.1 Datasets

ICDAR MLT-2017 dataset[19]is a multi-language dataset consisting of 7 200 training images,1 800 validation images and 9 000 testing images.Following DB[1],we utilize the training and validation set to pre-train our model.

Total-Text[20]is a classical dataset which contains sufficient text instances with arbitrary shapes,such as the curved text.In detail,1 255 images and 300 testing images are utilized for training and testing respectively with each text instance labelled with word-level annotation by polygon.

MSRA-TD500 dataset[21]is another multi-language dataset with English and Chinese scripts.It consists of 300 training images and 200 testing images with each text instance multioriented and labelled in the text-line level.For a better comparison with the previous methods,extra 400 training images from HUST-TR400 are included during our experiments.

ICDAR-2015 dataset[22]is the most commonly used benchmark featured by abundant text instances with various orientations.Since all the images are collected by Google glasses without considering image quality,position or viewpoints,all the text instances are more challenging for detection due to various brightness,scales and viewpoints.There are 1 000 images for training and 500 images for inference in this dataset with each text instance labelled with word-level annotation.

4.2 Implementation Details

4.3 Ablation Study

We conduct an ablation study on the MSRA-TD500 to verify the effectiveness of our proposed method.We take the model which is only supervised by the probability map label as our baseline.The results are shown in Table 1.

As shown in the table,our proposed method enhances the performance considerably for both ResNet-18 and ResNet-50 backbones on the MSRA-TD500 dataset.For the ResNet-18 backbone,the training with additional distribution map (Dis)and that with additional distribution and border maps (Dis+Bor) achieve 2.9% and 3.2% performance gain in terms of Fmeasure respectively.For the ResNet-50,they consistently bring 1.7% and 2.2% improvements respectively.Both of the enhanced labels can boost the detection performance significantly.

4.4 Comparisons with Previous Methods

In this section,we will verify the proposed method on three standard benchmarks including the MSRA-TD500,Total-Textand ICDAR-2015 datasets by comparing the results with previous state-of-the-art methods.

▼Table 1.Ablation study results with different settings on MSRATD500 dataset

1) Curved text detection

Following DB,we firstly pre-train our model on the MLT-2017 dataset for 100 000 iterations and then fine-tune it on the Total-Text for 1 200 epochs.The comparison results between our model and the previous methods are listed in Table 2,where we can conclude at least three highlights for our methods:·Compared with the baseline DB,our model achieves better result considering F-measure consistently,which can verify the effectiveness of our model.

·The precision of our model outperforms current state-of-theart methods by a large margin.Combining the results in Tables 1 and 2,we think the reason for the improvement is that label ambiguity is utilized by label distribution learning based on the enhanced label in our method,especially for the border map during the training phase.

·Our model achieves comparable performance to the existing state-of-the-art methods according to F-measure.

2) Multi-language text detection

We also evaluate the proposed method on the MSRATD500 to test its ability for multi-language text detection.As shown in Table 3,our method based on the ResNet-50 backbone achieves an F-measure of 87.3%,surpassing the sate-ofthe-art methods by more than 1.2%.

3) Multi-oriented text detection

In order to verify the generalization ability for the proposed method on multi-oriented scene text detection,we evaluate our network on the ICDAR-2015,a traditional dataset featured by multi-oriented text instances.The comparison results are listed in Table 4.As shown in the table,our model achieves higher precision than the previous state-of-the-art methods except Corner[23]whose recall is too low to fit in the real-world applications,which verifies our model can overcome false positives effectively.Take F-measure into account,our performance is still comparable to the previous state-of-the-art methods,although the relatively lower recall drags down our F-measure.This indicates the effectiveness of the proposed method with respect to multi-oriented scene text detection.

▼Table 2.Detection results on Total-Text dataset

▼Table 3.Detection results on MSRA-TD500 dataset

In this paper,we propose a label distribution learning method for text region detection.The label enhancement isused to construct two kinds of training labels for segmentationbased scene text detection.The experimental results on benchmarks demonstrate that the proposed method can consistently improve the model performance without sacrificing the inference speed.In the future,we will try to construct enhanced labels for different applications in text detection.

▼Table 4.Detection results on the ICDAR-2015 dataset.

推荐内容

恒微文秘网 https://www.sc-bjx.com Copyright © 2015-2024 . 恒微文秘网 版权所有

Powered by 恒微文秘网 © All Rights Reserved. 备案号:蜀ICP备15013507号-1

Top