Building an efficient OCR system for historical documents with little training data

Martínek, Jiří; Lenc, Ladislav; Král, Pavel

Title:	Building an efficient OCR system for historical documents with little training data
Other Titles:	Vytvoření efektivního OCR systému pro historické dokumenty s malým množstvím trénovacích dat
Authors:	Martínek, Jiří Lenc, Ladislav Král, Pavel
Citation:	MARTÍNEK, J. LENC, L. KRÁL, P. Building an efficient OCR system for historical documents with little training data. Neural Computing and Applications, 2020, roč. 32, č. 23, s. 17209-17227. ISSN 1433-3058.
Issue Date:	2020
Publisher:	Springer
Document type:	článek article
URI:	2-s2.0-85084519412 http://hdl.handle.net/11025/42814
ISSN:	1433-3058
Keywords:	CNN;FCN;historické dokumenty;LSTM;neuronová síť;OCR;Porta fontium;syntetická data
Keywords in different language:	CNN;FCN;Historical documents;LSTM;Neural network;OCR;Porta fontium;Synthetic data
Abstract:	S rychlým nárůstem počtu digitalizovaných historických dokumentů vzniká potřeba umožnit efektivní vyhledávání informací a extrakci znalostí, aby bylo možné tato data zpřístupnit. Tyto úlohy jsou závislé na optickém rozpoznání znaků (OCR), které umožní převod dokumentů do textové podoby. Článek představuje sadu metod, které umožňují provedení OCR na historických dokumentech s minimálními nároky na množství reálných, manuálně anotovaných, dat. Prezentovaný OCR systém zahrnuje analýzu rozložení stránky spolu s detekcí textových bloků a segmentací řádek textu a také samotný OCR modul. Segmentační metody jsou založeny na plně konvolučních neuronových sítích a OCR modul využívá rekurentní sítě. Je ukázáno, že jak segmentace tak i OCR jsou možné s malým množstvím anotovaných dat. Cílem experimentů bylo nalézt efektivní postup pro dosažení dobrých výsledků s použitím malého množství trénovacích dat. Výsledky ukazují, že je možné dosáhnout srovnatelných, nebo i lepších výsledků, než poskytují nejlepší současné OCR systémy.
Abstract in different language:	As the number of digitized historical documents has increased rapidly it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. This paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our seg-mentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems.
Rights:	© Springer
Appears in Collections:	Články / Articles (NTIS) Články / Articles (KIV) OBD

Files in This Item:

File	Size	Format
Martínek2020_Article_BuildingAnEfficientOCRSystemFo.pdf	4,63 MB	Adobe PDF	View/Open

Show full item record

Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/42814

search

navigation