Semantic text segmentation from synthetic images of full-text documents

Bureš, Lukáš; Gruber, Ivan; Neduchal, Petr; Hlaváč, Miroslav; Hrúz, Marek

Full metadata record

DC pole	Hodnota	Jazyk
dc.contributor.author	Bureš, Lukáš
dc.contributor.author	Gruber, Ivan
dc.contributor.author	Neduchal, Petr
dc.contributor.author	Hlaváč, Miroslav
dc.contributor.author	Hrúz, Marek
dc.date.accessioned	2020-04-13T10:00:16Z	-
dc.date.available	2020-04-13T10:00:16Z	-
dc.date.issued	2019
dc.identifier.citation	BUREŠ, L. ., GRUBER, I. ., NEDUCHAL, P. ., HLAVÁČ, M. ., HRÚZ, M. . Semantic text segmentation from synthetic images of full-text documents. SPIIRAS Proceedings, 2019, roč. 18, č. 6, s. 1380-1405. ISSN 2078-9181.	en
dc.identifier.issn	2078-9181
dc.identifier.uri	2-s2.0-85078454715
dc.identifier.uri	http://hdl.handle.net/11025/36868
dc.description.abstract	Je prezentován algoritmus (rozdělený do více modulů) pro generování obrázků fulltextových dokumentů. Tyto obrázky lze použít k trénování, testování a vyhodnocování modelů pro optické rozpoznávání znaků (OCR). Algoritmus je modulární, jednotlivé části lze měnit a vylepšovat tak, aby se vytvářely požadované obrázky. Je popsán způsob získávání obrázků pozadí papíru z již digitalizovaných dokumentů. K tomu byl použit nový přístup založený na variačním autoenkoderu (VAE) pro trénink generativního modelu. Tato pozadí umožňují generování podobných obrazků pozadí jako byly využity pro trénování a to za běhu algoritmu. Modul pro tisk textu používá velké textové korpusy, písmo a vhodný znakový šum a jas pro získání věrohodných výsledků (pro přirozeně vypadající staré dokumenty). Podporováno je několik typů rozvržení stránky. Systém generuje detailní strukturovanou anotaci syntetizovaného obrazu. Používá se Tesseract OCR k porovnání skutečných obrazů s generovanými obrázky. Míra rozpoznávání je velmi podobná, což ukazuje na správný vzhled syntetických obrazů. Kromě toho jsou chyby, které systém OCR udělal v obou případech, velmi podobné. Z generovaných obrazů byla natrénována plně konvoluční architektura neuronové sítě encoder-decoder pro sémantickou segmentaci jednotlivých znaků. S touto architekturou je dosaženo přesnosti rozpoznávání 99,28% na testovací sadě syntetických dokumentů.	cs
dc.description.abstract	An algorithm (divided into multiple modules) for generating images of fulltext documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR). The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly. The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents). A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.	en
dc.format	26 s.	cs
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.publisher	St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences	en
dc.relation.ispartofseries	SPIIRAS Proceedings	en
dc.rights	© St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences	en
dc.subject	Generování syntetických obrázků, segmentace sémantického textu, variační autoenkoder, VAE, optické rozpoznávání znaků, OCR, generování staře vypadajícího textu	cs
dc.title	Semantic text segmentation from synthetic images of full-text documents	en
dc.title.alternative	Sémantická segmentace textu ze syntetických obrazů fulltextových dokumentů	cs
dc.type	článek	cs
dc.type	article	en
dc.rights.access	openAccess	en
dc.type.version	publishedVersion	en
dc.subject.translated	Generation of Synthetic Images, Semantic Text Segmentation, Variational Autoencoder, VAE, Optical Character Recognition, OCR, Aged-Looking Text Generation	en
dc.identifier.doi	10.15622/sp.2019.18.6.1381-1406
dc.type.status	Peer-reviewed	en
dc.identifier.obd	43929266
dc.project.ID	LO1506/PUNTIS - Podpora udržitelnosti centra NTIS - Nové technologie pro informační společnost	cs
dc.project.ID	SGS-2019-027/Inteligentní metody strojového vnímání a porozumění 4	cs
Vyskytuje se v kolekcích:	Články / Articles (KKY) OBD

Soubory připojené k záznamu:

Soubor	Velikost	Formát
Bures_SemanticTextSegmentation_2019.pdf	9,33 MB	Adobe PDF	Zobrazit/otevřít

Zobrazit minimální záznam Zobrazit statistiky

Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/36868

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.

hledání

navigace