Multi-script Text Extraction from Natural Scenes


Scene text extraction methodologies are usually based in classification of individual regions or patches, using a priori knowledge for a given script or language. Human perception of text, on the other hand, is based on perceptual organisation through which text emerges as a perceptually significant group of atomic objects. Therefore humans are able to detect text even in languages and scripts never seen before. In this paper, we argue that the text extraction problem could be posed as the detection of meaningful groups of regions. We present a method built around a perceptual organisation framework[1] that exploits collaboration of proximity and similarity laws to create text-group hypotheses. Experiments demonstrate that our algorithm is competitive with state of the art approaches on a standard dataset covering text in variable orientations and two languages.

Our method is inspired by the human perception of textual content, largely based on perceptual organisation. The proposed method requires practically no training as the perceptual organisation based analysis is parameter free. It is totally independent of the language and script in which text appears, it can deal efficiently with any type of font and text size, while it makes no assumptions about the orientation of the text. Qualitative results demonstrate competitive performance and faster computation.


Gomez L. and Karatzas D., "Multi-script Text Extraction from Natural Scenes", 12th International Conference on Document Analysis and Recognition, 2013. Download PDF

Source Code

The source code implementation of the paper can be found at .

On-line text extractor

You can test how our method perform in localizing text in an image provided by you. You can upload up to 600Kb images (jpg or png) using the form below. Just some words in case the output is not what you expect: our method assume that characters are non overlapping connected components of the image, with a constant colour, and a noticeable contrast with their immediate background. Besides, some parameters on the region decomposition (MSER algorithm) have been validated using the ICDAR2003 and MSRA-TD500 train sets and may not be the appropriate for your image. To get an idea of the kind of images in which our method performs well (and in which not) you can take a look on the gallery of qualitative results on the KAIST dataset [3] for the task of text segmentation, or on the MSRA-TD500 [4] and ICDAR2003 [5] datasets for the task of text localization.

Send an image file:


[1] A. Desolneux, L. Moisan, and J.-M. Morel, "A grouping principle and four applications", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.25, pp. 508-513, 2003.

[2] A. Fred and A. Jain, "Combining multiple clusterings using evidence accumulation", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.27, no.6, pp. 835-850, 2005.

[3] S. Lee, M. S. Cho, K. Jung, and J. H. Kim, “Scene text extraction with edge constraint and text collinearity,” in Proc. ICPR, 2010.

[4] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in Proc. CVPR, 2012.

[5] Simon M. Lucas et al., "ICDAR 2003 robust reading competitions: entries, results, and future directions". IJDAR, Volume 7, Issue 2-3, pp 105-122, 2005.

Lluís Gómez and Dimosthenis Karatzas
Computer Vision Center
Edifici O Campus UAB 08193 Bellaterra (Cerdanyola) Barcelona, Spain