MADCAT (Multilingual Automatic Document Classification Analysis and Translation)
Arabic Corpus is a LDC dataset (LDC2012T15, LDC2013T09, LDC2013T15) for handwriting recognition.
The dataset contains abstracts from News related passages and blogs. The xml file for each page
provides line segmentation and word segmentation information and also provides the writing
condition (writing style, speed, carefulness) of the page. It is a large size dataset with
total 42k page images and 750k (600k training, 75k dev, 75k eval) line images and 305 writers.
The major text is in Arabic but it also contains English letters and numerals. The dataset contains
about 95k unique words and 160 unique characters. The dataset has been used in NIST 2010 and 2013
(Openhart Arabic large vocabulary unconstrained handwritten text recognition competition) evaluation
(maybe with different splits) for line level recognition task. 16.1% WER was obtained for line level
recognition in that competition.
More info: https://catalog.ldc.upenn.edu/LDC2012T15,
https://catalog.ldc.upenn.edu/LDC2013T09/,
https://catalog.ldc.upenn.edu/LDC2013T15/.