A HIGH LEVEL APPROACH TO ARABIC SENTENCE
RECOGNITION
ABDULWAHAB GEBRIL KRAYEM
A thesis submitted in partial fulfilment of the
requirements of Nottingham Trent University for the
degree of Doctor of Philosophy
September 2013
ii
This work is the intellectual property of the author Abdulwhab Gebril
Krayem. You may copy up to 5% of this work for private study, or personal,
non-commercial research. Any re-use of this information contained within this
document should be fully referenced, quoting the author, title, university, degree
level and pagination. Queries or requests for any other use, or if a more
substantial copy is required, should be directed in the owner of the Intellectual
Property Rights.
iv
ABSTRACT
The aim of this work is to develop sentence recognition system inspired
by the human reading process. Cognitive studies observed that the human
tended to read a word as a whole at a time. He considers the global word shapes
and uses contextual knowledge to infer and discriminate a word among other
possible words. The sentence recognition system is a fully integrated system; a
word level recogniser (baseline system) integrated with linguistic knowledge
post-processing module. The presented baseline system is holistic word-based
recognition approach characterised as probabilistic ranked task. The output of
the system is multiple recognition hypotheses (N-best word lattice). The basic
unit is the word rather than the character; it does not rely on any segmentation
or require baseline detection. The considered linguistic knowledge to re-rank the
output of the existing baseline system is the standard n-gram Statistical
Language Models (SLMs). The candidates are re-ranked through exploiting
phrase perplexity score. The system is an OCR system that depends on HMM
models utilizing the HTK Toolkit. The baseline system supported by global
transformation features extracted from binary word images. The adopted
features' extraction technique is the block-based Discrete Cosine Transform
(DCT) applied to the whole word image. Feature vectors extracted using block-
based DCT with non-overlapping sub-block of size 8x8 pixels. The applied HMMs
to the task are mono-model discrete one-dimensional HMMs (Bakis Model).
A balanced actual scanned and synthetic database of word-image has
been constructed to ensure an even distribution of word samples. The Arabic
words are typewritten in five fonts having a size 14 points in a plain style. The
statistical language models and lexicon words are extracted from The Holy
Qur‟an. The systems are applied on word images with no overlap between the
training and testing datasets. The actual scanned database is used to evaluate
the word recogniser. The synthetic database is a large amount of data acquired
for a reliable training of sentence recognition systems. This word recogniser
evaluated in mono-font and multi-font contexts. The two types of word
recogniser have been used to achieve a final recognition accuracy of99.30% and
73.47% in mono-font and multi-font, respectively. The achieved average
accuracy by the sentence recogniser is 67.24% improved to 78.35% on average
when using 5-gram post-processing. The complexity and accuracy of the post-
processing module are evaluated and found that 4-gram is more suitable than 5-
gram; it is much faster at an average improvement of 76.89%.
v
ACKNOWLEDGEMENTS
First of all, I am thankful to Allah, who gave me the strength and
the ability to complete my thesis without His divine help, I would have
never been able to complete it.
In fact, there are many people to whom I must express my
gratitude for their contributions to the successful completion of this
dissertation.
I would like to express my sincere appreciation and thanks to my
main supervisor, Prof. Nasser Sherkat, who greatly supported me in
completing my work. I will be grateful for his assistance and guidance
forever.
I would also like to thank my co-supervisors, Dr. Lindsay Evett and
Dr. Taha Osman. I appreciate their comments and encouragement
throughout this research work.
Thanks to all my friends and colleagues who have provided
invaluable help and exchanged fructiferous views from time to time that
has been vital to my progress.
My heartiest thanks are due to my highly respected Mother and
my dearest thanks go to my wife, without whose constant support and
encouragement, particularly in difficult times, the whole endeavour of
writing my thesis would not have been fruitful.
Moreover, I am also thankful to my sons and daughters, and other
relatives for their continuous supports and encouragements during my
PhD study.
Last but certainly not least, I wish to express my special gratitude
to my daughter Hajer for her regular insistence for keeping up my hard
work.
vi
TABLE OF CONTENTS
ABSTRACT ............................................................................................... IV
ACKNOWLEDGEMENTS ............................................................................. V
TABLE OF CONTENTS .............................................................................. VI
LIST OF FIGURES .................................................................................... XI
LIST OF TABLES ................................................................................... XIV
LIST OF EQUATIONS ......................................................................... XVIII
LIST OF ACRONYMS ............................................................................. XIX
CHAPTER 1 ......................................................................................... 1
INTRODUCTION ....................................................................................... 1
1.1 Introduction ................................................................................... 1
1.2 Features of Arabic script ................................................................ 2
1.3 Arabic Language Challenges ........................................................... 5
1.4 Problem Definition ......................................................................... 6
1.5 Research Objectives ....................................................................... 9
1.6 Structure of the Thesis ................................................................. 11
CHAPTER 2 ....................................................................................... 13
RELATED CONCEPTS AND LITERATURE REVIEW .................................... 13
2.1 Introduction ................................................................................. 13
2.2 Printed Arabic Text Databases ..................................................... 15
2.2.1 DARPA ...................................................................................... 16
vii
2.2.2 APTI ......................................................................................... 16
2.2.3 MMAC ....................................................................................... 16
2.3 Feature Extraction and 2D-DCT .................................................... 18
2.4 Classification and DHMM .............................................................. 23
2.5 HMM-based Printed Text Recogniser ............................................ 27
2.6 The Re-Ranking Post-Processing Techniques ............................... 31
2.7 Summary ...................................................................................... 36
CHAPTER 3 ....................................................................................... 38
ARABIC WORDS DATABASE PREPARATION ............................................ 38
3.1 Introduction ................................................................................. 38
3.2 Considerations in Building OCR Text Database ............................. 39
3.3 Related Works .............................................................................. 40
3.4 Corpus Acquisition........................................................................ 45
3.5 The Holy Qur'an Arabic ................................................................. 47
3.6 Database Implementation Methodology ....................................... 48
3.6.1 Data collection form ................................................................... 50
3.6.2 Data Preparation and Acquiring ................................................... 53
3.6.3 Pre-processing and Word Segmentation ........................................ 55
3.6.4 Ground Truth ............................................................................ 62
3.7 Summary ...................................................................................... 68
viii
CHAPTER 4 ....................................................................................... 70
HMM/DCT HOLISTIC WHOLE WORD RECOGNISER ................................. 70
4.1 Introduction ................................................................................. 70
4.2 Thesis Methodology ...................................................................... 71
4.3 Arabic Printed Datasets ................................................................ 73
4.3.1 Word Set Selection Criteria: ........................................................ 73
4.3.2 Samples Distribution: ................................................................. 75
4.3.3 Statistical analysis of the Arabic Qur‟an text corpus: ...................... 76
4.3.4 Database implementation Process ................................................ 79
4.4 Proposed Recogniser Methodology ............................................... 85
4.5 Overfitting Problem ...................................................................... 89
4.6 Experiment Parameters ................................................................ 91
4.7 Experimental Results.................................................................... 94
4.7.1 Performance Estimation .............................................................. 95
4.7.2 Classification accuracy ................................................................ 98
4.7.3 Classifier stability test .............................................................. 106
4.8 Comparison with other authors .................................................. 110
4.8.1 Mono-font comparison .............................................................. 117
4.8.2 Multi-font comparison ............................................................... 117
4.9 Summary and Conclusion ........................................................... 120
ix
CHAPTER 5 ..................................................................................... 122
LINGUISTIC POST-PROCESSING FOR SENTENCE RECOGNITION .......... 122
5.1 Introduction ............................................................................... 122
5.2 Proposed Re-ranking Post-Processing System ........................... 124
5.2.1 Word-based n-gram models ...................................................... 125
5.2.2 Perplexity ............................................................................... 127
5.3 Sentence recognition system...................................................... 128
5.4 Linguistic post-processing system .............................................. 129
5.5 Lexicon Dataset and Language Model Creation ........................... 129
5.5.1 Quranic-2279UW Database ....................................................... 130
5.5.2 Language Model Creation .......................................................... 132
5.6 Experiments and results ............................................................. 134
5.6.1 Word recogniser ...................................................................... 134
5.6.2 Linguistic Post-processing Evaluation.......................................... 135
5.6.3 Linguistic post-processing Complexity ........................................ 143
5.7 Comparison with Literature Review............................................ 147
5.8 Summary and Conclusion ........................................................... 149
CHAPTER 6 ..................................................................................... 151
CONCLUSION AND FUTURE WORK ....................................................... 151
6.1 Introduction ............................................................................... 151
x
6.2 Conclusion .................................................................................. 152
6.2.1 Databases ............................................................................... 153
6.2.2 SLMs ...................................................................................... 155
6.2.3 HMM/DCT Holistic Word Recogniser ............................................ 155
6.2.4 Integrated Linguistic Post-processing ......................................... 157
6.2.5 Sentence Recognition System .................................................... 157
6.3 Limitations ................................................................................. 158
6.4 Thesis Contributions ................................................................... 161
6.5 Future Work ............................................................................... 163
BIBLIOGRAPHY .................................................................................... 166
xi
LIST OF FIGURES
Figure 1.1: Arabic text: Printed and Handwritten sentences ............................ 4
Figure 1.2: The characteristics of Arabic text ................................................. 5
Figure 1.3: The word shape of cursive words alone contains sufficient
information to classify the image as one of the lexicon words [1] ............... 8
Figure 1.4: The Sentence recognition system .............................................. 10
Figure 2.1: DCT coefficients with the zigzag pattern ..................................... 21
Figure 2.2: DCT coefficients selection (a) zigzag mask (b) zonal mask............ 21
Figure 2.3: The Markov Generation Model [28] ............................................ 24
Figure 2.4: Simple Left-Right HMM [28] ..................................................... 25
Figure 2.5: A 4-state ergodic model with start and end states ....................... 25
Figure 3.1: The IAM database filled form [44] ............................................. 42
Figure 3.2: The HIT-MW database collection form layout [45] ....................... 43
Figure 3.3: The FHT database form layout details [46] ................................. 45
Figure 3.4: Dependency graph for verse (67:1) [51] .................................... 47
Figure 3.5: Form layout ............................................................................ 51
Figure 3.6: Group and Page reference files.................................................. 54
Figure 3.7: The representation of a line in the (x, y) space using (ρ, θ) .......... 58
xii
Figure 3.8: Text section ............................................................................ 60
Figure 3.9: Line segment .......................................................................... 60
Figure 3.10: Segmented word using crop rectangle ...................................... 61
Figure 3.11: Tidy segmentation of the whole word body ............................... 61
Figure 3.12: GT in XML file format for the word  ................................. 64
Figure 3.13: GT in XML file format for subset folder ..................................... 66
Figure 3.14: Database directory structure ................................................... 67
Figure 4.1: Upper and lower baselines of a text line (Simplified Arabic style of
writing) ............................................................................................. 74
Figure 4.2: A block diagram of the proposed holistic whole word recogniser .... 85
Figure 4.3: Block diagram of the applied word-image normalization ............... 86
Figure 4.4: DCT coefficients selection using zigzag masking .......................... 87
Figure 4.5: Left-to-right HMM model with 5-state (Bakis model) .................... 93
Figure 4.6: Discrete HMM description file (prototype) ................................... 93
Figure 4.7: Multi-font-252UW-32CB Performance....................................... 103
Figure 4.8: Multi-font-252UW for Andalus top-1, effect of varying the size of
codebook on recognition accuracy ...................................................... 104
Figure 5.1: The word lattice produced for the sentence “ 
..................................................................................................... 125
xiii
Figure 5.2: Word image sample from true scanned and synthetic word images
printed in Thuluth (Deco Type) font .................................................... 131
Figure 5.3: Andalus font recognition accuracy before and after the post-
processing ....................................................................................... 137
Figure 5.4: Simplified Arabic font recognition accuracy before and after the post-
processing ....................................................................................... 138
Figure 5.5: Tahoma font recognition accuracy before and after the post-
processing ....................................................................................... 139
Figure 5.6: Thuluth font recognition accuracy before and after the post-
processing ....................................................................................... 140
Figure 5.7: Traditional Arabic font recognition accuracy before and after the
post-processing ............................................................................... 141
Figure 5.8: Fonts recognition accuracy improvement to top-1 after the post-
processing ....................................................................................... 142
Figure 5.9: Font recognition accuracy worsening to out of the top-10 list after
the post-processing .......................................................................... 143
Figure 5.10: Sentence length frequency distribution in The Holy Qur‟an ........ 146
Figure 6.1: The translation problem due to word segmentation error ............ 159
xiv
LIST OF TABLES
Table 1.1: The complete Arabic character set ................................................ 4
Table 3.1: Data types ............................................................................... 52
Table 3.2: Used Arabic fonts ..................................................................... 52
Table 3.3: Font Style ................................................................................ 53
Table 4.1: Qur‟an frequency distribution of the characters and PAWs ............. 77
Table 4.2: Qur‟an unique words frequency distribution ................................. 78
Table 4.3: Samples of all used fonts typewritten in plain style 14 pt ............... 79
Table 4.4: Word-image samples for the five used fonts ................................ 81
Table 4.5: Quranic-252UW database frequency distribution of the characters and
PAWs ................................................................................................ 82
Table 4.6: Quranic-343UW database frequency distribution of the characters and
PAWs ................................................................................................ 84
Table 4.7: Font mask size ......................................................................... 92
Table 4.8: Calculated parameters for validating mono-font Quranic-252UW
database ........................................................................................... 96
Table 4.9: Fivefold cross-validation WRR obtained for Andalus font ................ 96
Table 4.10: Fivefold cross-validation WRR obtained for Tahoma font .............. 97
Table 4.11: Fivefold cross-validation WRR obtained for Simplified Arabic font .. 97
xv
Table 4.12: Fivefold cross-validation WRR obtained for Traditional Arabic font . 97
Table 4.13: Fivefold cross-validation WRR obtained for Deco type Thuluth font 98
Table 4.14: Cross-validation average WRR obtained for all font-types ............ 98
Table 4.15: Calculated parameters for Mono-font-252UW ........................... 100
Table 4.16: Mono-font-252UW-32CB experiment results (WRRs) ................. 100
Table 4.17: Validation and test set performance (WRR%) ........................... 101
Table 4.18: Multi-font-252UW-32CB experiment results (WRRs) .................. 102
Table 4.19: Multi-font-252UW WRRs for Andalus using different codebook .... 104
Table 4.20: Multi-font-252UW-64CB experiment results (WRRs) .................. 105
Table 4.21: Comparing top-1 WRR% for Multi-font-252UW-32CB and Multi-font-
252UW-64CB ................................................................................... 106
Table 4.22: Calculated parameters for Mono-font-343UW ........................... 107
Table 4.23: Mono-font-343UW-32CB experiment results (WRRs) ................. 107
Table 4.24: Comparing top-1 WRR% for Mono-font-252UW-32CB and Mono-
font-343UW-32CB ............................................................................ 108
Table 4.25: Multi-font-343UW-32CB experiment results (WRRs) .................. 109
Table 4.26: Multi-font-343UW-64CB experiment results (WRRs) .................. 110
Table 4.27: Comparing WRR% for Multi-font-252UW-64CB and Multi-font-
343UW-64CB ................................................................................... 110
xvi
Table 4.28: Summary of the used databases in the literature review and this
thesis ............................................................................................. 114
Table 4.29: Mono-font comparative summary with the literature review ....... 115
Table 4.30: Multi-font comparative summary with the literature review ........ 116
Table 5.1: Accuracy of the Word Recognition Systems without Post-Processing
..................................................................................................... 135
Table 5.2: Andalus Font Recognition Accuracy with Post-Processing ............. 137
Table 5.3: Simplified Arabic Font Recognition Accuracy with Post-Processing . 138
Table 5.4: Tahoma Font Recognition Accuracy with Post-Processing ............. 139
Table 5.5: Thuluth Font Recognition Accuracy with Post-Processing ............. 140
Table 5.6: Traditional Arabic Font Recognition Accuracy with Post-Processing 141
Table 5.7: Font Recognition Accuracy (from Top-1 to Top-10) Improvement to
Top-1 after the Post-Processing ......................................................... 142
Table 5.8: Fonts Recognition Accuracy Worsening to Out of the Top-10 List after
the Post-Processing .......................................................................... 143
Table 5.9: Top-1 Choice Accuracies Improvement Using N-Gram ................. 145
Table 5.10: n-gram Top-1 Average Accuracies and Complexities for All Fonts
Using OCR Top-10 Outputs ................................................................ 145
Table 5.11: Top-1 Accuracy Improvement between the selected n-grams ..... 145
xvii
Table 5.12: Accuracy and Complexity Comparison between the selected n-grams
..................................................................................................... 146
Table 5.13: Comparison Summary between Works Reported in Literature Review
..................................................................................................... 147
xviii
LIST OF EQUATIONS
Equation 2.1: 2D DCT coefficients .............................................................. 20
Equation 3.1: Straight line equation ........................................................... 57
Equation 3.2: Straight line polar equation ................................................... 57
Equation 4.1: Word recognition rate (WRR) ................................................ 94
Equation 5.1: Word-based n-gram model [28] .......................................... 126
Equation 5.2: maximum likelihood estimate [28] ....................................... 127
Equation 5.3: The cross-entropy H ........................................................... 128
Equation 5.4: The perplexity PP ............................................................... 128
xix
LIST OF ACRONYMS
Two-Dimensional Discrete Cosine Transform
Artificial Neural Network
Arabic Printed Text Image
Block-Based Discrete Cosine Transform
DARPA Arabic Machine Print
Defence Advanced Research Projects Agency
Discrete Cosine Transform
Discrete Hidden Markov Model
Fuzzy logic
Ground Truth
HTK Language Modelling Toolkit
Hidden Markov Model
Hidden Markov Model Toolkit
k-Nearest Neighbours
Multilayer Perceptron
Multi-Modal Arabic Corpus
Natural Language Processing
Optical Character Recognition
Piece of Arabic word
Recognition Rate
Scientific Application International Company
Statistical Language Model
Support Vector Machines
Vector Quantization
Word Error Rate
Word Recognition Rate
1
CHAPTER 1
INTRODUCTION
1.1 Introduction
Transferring written text to a computer-readable form has always
been one of the attractive research challenges. The objective of text
recognition is to develop a robust and accurate system capable of
achieving the level of human performance in reading. Off-line text
recognition applications can improve the input operation speed and
decrease the possibility of human errors by avoiding retyping a captured
document. It takes as input a raster image of a text captured by a
scanner or from screen and then transfers it into a machine-editable text.
Consequently, several computer subjects are involved in text recognition,
including image and signal processing, pattern recognition, natural
language processing, and information systems and databases. Although
researchers have been intensively investigated in the field, existing
systems still did not achieve yet humans reading capabilities. Automatic
off-line text recognition is inherently difficult due to the great variability
of writing and printing styles. The inherently difficulties are due to letters
writing styles, where they presented in different sizes and skews, and
variation in width and shape of strokes. The research presented in this
2
thesis is a contribution toward a reliable recognition system for Arabic
text.
1.2 Features of Arabic script
Arabic language is the revealed language of The Holy Qur'an, the
holy book of Islam. It is the worship language of Islam used all over the
Muslim world, and is the official language in nearly twenty Arabic
countries located in the Middle East and North Africa. This makes Arabic
script one of the most used around the world and one of the official
languages are used at the United Nations. Moreover, there are a number
of languages use Arabic alphabet and Arabic-like writing style such as
Persian and Urdu, Kurdi and Jawi.
Arabic language has three forms: Classic Arabic the language of
The Holy Qur'an the dialect of Arabic people in pre-Islamic times,
standardized in early Islamic period by adding dots and vowel signs. It is
still used as written language in limited formal circumstances. The second
form is the Modern Standard Arabic an adapted form of the Classical
Arabic. It is the language of today‟s publications and media. Spoken or
Colloquial Arabic is the third form, used (only spoken) in daily
interactions and every area of the Arabic countries has its own spoken
Arabic dialects.
Arabic writing and English writing are similar in a number of
aspects: use letters, spaces, numerals, punctuation marks, special
symbols, and written in horizontal lines. However, there are essential
differences among these two languages. Unlike English, writing direction
of the Arabic scripts is from right to left and its words printed in a semi
3
cursive style (). Both in printed and handwritten form, Arabic word
can be composed of one or more sub-words called PAW as depicted in
Figure 1.1. In reality, Arabic is calligraphic and not syllabic writing
language. Arabic language has 28 one case main letters there shape is
context sensitive, as shown in Table 1.1, hence each letter can have two
to four shapes: isolated, initial, medial, or final (). Moreover, there
are groups of letters has the same main body shape distinguished by
diacritical dots (). The character in its different location and PAWs
are called a glyph. Another characteristic that is Arabic letters can be
vowelized (
), Figure 1.2 shows more details. These vowels (diacritics)
usually omitted from the writing, but they could be present to resolve
context ambiguity or in formal documents. Arabic handwritten style and
some printing fonts usually comprise vertical arrangements of letters
called ligatures. Arabic characters can have special form known as
elongation. These letter shapes and vowels, together with some other
secondaries (), lead to expand the number of Arabic letters over than
120 different shapes. Furthermore, from the syntax, inflection and
vocabulary points view the factors such as feminine and masculine,
singular, double, plural, and their respective grammatical positions need
to be considering.
4
Figure 1.1: Arabic text: Printed and Handwritten sentences
Table 1.1: The complete Arabic character set
5
Figure 1.2: The characteristics of Arabic text
1.3 Arabic Language Challenges
The mentioned characteristics of the Arabic language demonstrate
the complexity of the language. These complexities include writing style,
characters shape similarity, diacritics and inflection poses unique
challenges for text recognition and natural language processing. The
main challenges to text recognition are the segmentation and
classification. The segmentation difficulties are segmenting the semi-
cursive script into words and words into letters, letter parts or primitives.
In order to correctly segment and extract structural primitives it is vital
to detect the imaginary line(s) called baseline(s). Baseline detection
considered crucial problem, especially in Arabic writing. The presence of
6
noise and diacritics greatly increase the difficulties of word segmentation
and baseline detection challenges.
Diacritics are a valuable and an important feature in Arabic
language. They provide information for the inflectional features for words
within the sentence, and resolving word meaning ambiguity by adding
semantic information to words. Furthermore, the word syntactic position
within the sentence is determined by its diacritics. The morphology
richness of Arabic language is a major concern for development of
Natural Language Processing (NLP) systems. It is classified as a highly
inflectional language where the word stem affixed with prefixes, infixes,
and suffixes to indicate case, gender, number, tense, etc.
1.4 Problem Definition
The researches in text recognition systems are mainly based on two
approaches, analytical and holistic approach. With a few scattered
exceptions, most recognising algorithms use the analytical technique.
The first approach segments a word into smaller units such as characters
or graphemes (sub-parts of characters). This strategy is adequate for the
open vocabulary, as it identifies these smaller units which are
concatenated to form word hypotheses which can be considered as valid
words or may not belong to the language. Due to the cursive written
form of Arabic script and the high variability of characters forms in
addition when writing or the image is so poor, the segmentation
efficiency hard and error-prone process. The segmentation issue is the
major drawback of this approach where segmentation errors mislead
classifier during character recognition.
7
Using holistic approaches, on the other hand, the word is treated as
a whole without segmentation and attempts to recognise it using its
global shape features. Holistic approaches eliminating the segmentation
problem that is primary concerns for analytical approaches, and they
may succeed on poorly written or digitised words. Its word recognition
result depends on the lexicon words, and the recognition of the unknown
word is a word in the lexicon with the highest probability in the
recognition result. The price for this method advantages, speed and
avoiding problems associated with segmentation, is to constrain the
recogniser to limited-lexicon applications. This is because the holistic
approaches consider every word as a different class, so when the lexicon
size increase, the number of classes increases which in turn increases the
probability of misclassification. The direct source of this misclassification
is the deficiency in the ability of features set to discriminate between
word classes.
These two approaches are reflecting the two different
interpretations of psychologists studies on human being visual word
recognition in alphabetic languages. Analytical models support the theory
of reading words letter by letter, the opposing view is the holistic model
where words are recognised on the basis of their global shapes.
One of the cognitive studies observed that the human tended to
read a word as a whole at a time and the contextual information take
place to deduce and discriminate a word among other possible words.
Holistic theories of reading propose that visual recognition of words
affected by their global shape as shown in Figure 1.3 [1] [2] [3], while
8
the segmentation takes place in case of unfamiliar words [4]. The use of
context can be realised in reading a misspelled word correctly in a
passage or sentence level. This is clear when reading a text and facing a
distorted written word, the human being has ability to delay reading of
some words until more contextual knowledge is gathered to define the
most probable right word [5].
All of the above are important motivations to build an automatic
text recognition systems based on holistic Arabic whole word approach
enhanced by contextual knowledge.
The aim of this research is to improve accuracy of holistic
approach for offline printed text by design a sentence recognition system
to deal with written words. The fundamental idea of using sentence level
is the ability to apply contextual information on the sentence (word-
hypothesis) to overcome the main weaknesses of the holistic approach;
Figure 1.3: The word shape of cursive words alone contains sufficient
information to classify the image as one of the lexicon words [1]
9
vocabulary growth problem. The task of the sentence recognition can be
decomposed into two subtasks: holistic Arabic whole word recognition
and sentence-level post-processing using contextual information. The
recognition phase completed using holistic word features, and the
outputs are ordered word-hypothesis lattice of top-n matching words in
the lexicon using language model.
1.5 Research Objectives
To achieve the aim of this dissertation, as stated in problem
definition section 1.4, a word-based or holistic recogniser is needed to be
constructed; it has tuneable parameters, its input is word features
representing global word shape and its output is a ranked list of word-
hypothesis. In many cases, the correct word may not be present in top-1
choices neither in the chosen top-n. To minimise the word error at or
near the top-1, the candidates are re-ranked using one of the re-ranking
post-processing techniques. Statistical language models (SLMs) are one
of these techniques used to re-rank the candidates especially in case of
sentence recognition (see Figure 1.4). To train and test the whole system
it is required to have word-image database related to an electronic
corpus; since the corpus is the fundamental element to create the SLM.
10
Figure 1.4: The Sentence recognition system
Taking into account main reasons the following objectives are
addressed:
Define an electronic corpus; a closed and an authentic Arabic
corpus
Developing benchmark data for printed Arabic word satisfy the
development needs; a full Arabic sentences related to the
corpus and printed in fonts cover different printing complexity
of character shapes and investigated by other researches
Designing and implementing a trainable classifier characterised
as multiple recognition hypotheses, and language and corpus
11
independent. A holistic word-based recogniser satisfies the
main idea; its input is word feature vector and outputs an N-
best word lattice
The system has to recognise word image printed in font size
within the range from 10 to 16 pts that are regularly used in
printed Arabic documents such as magazines [6]
Feature extraction stage plays an important role in constructing
a successful recognition system. The target technique intends
to be simple; bypasses word segmentation and language
independent
Linguistic post-processing to improve reliability of the holistic
word-based recognition system so it can handle larger corpora
1.6 Structure of the Thesis
This dissertation is organised as follows.
Chapter 2 discusses previous works and researches on the
areas of text-image databases, text recognisers, and re-ranking
post-processing techniques
Chapter 3 discusses previous works on databases
implementations and demonstrates the methodology for word-
image database creation
In chapter 4 a discrete one dimensional Hidden Markov (Bakis)
Model is constructed based on Hidden Markov Model Toolkit
1
(HTK) using features in Block-based DCT domain. For system
evaluation (validation/training, test, and stability test), two
databases are established. The two databases contain true
scanned typewritten Arabic word-images for five different fonts.
Validation on the training set is performed using the k-fold
cross-validation technique. Result discussion and comparison
with other related works are presented
1
http://htk.eng.cam.ac.uk/
12
Integrated Linguistic Post-processing is described in chapter 5,
where the linguistic knowledge is applied to improve the
accuracy of the sentence recogniser by re-ranking its top-n
hypothesis using n-gram SLMs of order 1 to 5. The recognition
system is evaluated using synthetic word-image database and
The Holy Qur'an is used to enable the use of Statistical
Language Models (SLMs) to improve recognition rate. The
achieved improvements are discussed and compared with other
related works
The last chapter is 6, it presents the conclusion, limitation, and
future work. The infrastructure utilities and units of the
integrated system are discussed. Results are reviewed as well
as the system limitation.
13
CHAPTER 2
RELATED CONCEPTS AND LITERATURE
REVIEW
2.1 Introduction
Each classical recogniser has twokeycomponents, namely, feature
extraction and pattern classification. Feature extraction is one of the
important processes in the recognition systems; it has a strong influence
on the classification stage. The major goal of feature extraction is to get
the most significant information to represent the whole data in a compact
representational set of features, which maximizes the recognition rate.
In conventional Research, which owns the majority in the
literature, the recognition system is trained on features of whole or
partial characters, and recognises each character then. Consecutive
characters are combined to generate candidate word patterns. This
technique is known as Segmentation-based Systems. Indeed,
segmentation is one of the most challenges of cursive script, due to its
variability. The Arabic text segmentation still remains an unsolved
problem though many segmentation algorithms exist. One way of
remedying this, is bypassing segmentation and looking at word or text
line as major unit for training and recognition. Several researchers have
proposed to use segmentation-free approaches where the unit scanned
14
from right-to-left, and at each horizontal position a set of features was
extracted from a narrow vertical strip. These features are fed to a
classifier which able to achieve segmentation and character recognition in
a continued way, this approach analogues the Automatic Speech
Recognition (ASR). Other researchers try to recognise the whole
representation of a word instead of segmenting and recognising its
characters or primitives individually. These systems are generally known
as holistic approach.
The classification stage is the decision-making stage of any
recognition system. It uses the features that were extracted in the
feature extraction stage to classify the text segment according to
predefined rules. The results of the classification stage are related to the
information from image data. As the classifier sometimes produces not a
single candidate but a set of possible candidate results, the role of the
post-processing stage emerge for improving the selection of the right
solution.
The text image database plays essential roles for text recogniser
development and evaluation. Therefore, the contents must meet the
needs of the recogniser nature. Moreover, relating database to a corpus
provides the facility for the recogniser to employ linguistic knowledge in
post-processing stage. Text databases are classified, based on the
recogniser nature, to handwritten text and machine-printed documents;
its units can be digits, characters, words or texts. In some cases it serves
a specific domain application, such as bank check and postal address
reading. In the field of text recognition, having a standard database is
15
vital for evaluating and comparing results achieved by different groups of
researchers. It is worth to mentioning that there is no generally accepted
database, for handwritten and machine-printed, that is freely available
for researchers and can be considered as a benchmark.
2.2 Printed Arabic Text Databases
A rich research literature exists for Arabic handwriting and only
few articles deal with Arabic typewritten. This describes the abundance of
handwritten Arabic databases and seldom of machine-printed Arabic
databases. A detailed discussion on the handwritten Arabic Text
databases can be found in [7].
Datasets can be synthetic or real world data. Synthetic images are
somehow similar to what can be seen in real-world documents. The real
world data can be simulated by applying modelled degradation to ideal
input images. It is a possible solution to generate automatically a quite
large database beneficial in terms of evaluation. The main weakness of
this approach is that the generated database may not be of the same
variety as real-world data. Worse still, it may bias a recogniser towards
unnatural image styles.
DARPA Arabic Machine Print (DAMP) document corpus was the
most popular large-scale typewritten Arabic databases in the literature.
APTI database [8] is the most popular synthetically generated machine-
printed database for screen-based OCR. In addition, there are other
private databases are developed to meet researchers needs.
16
2.2.1 DARPA
DARPA Arabic Machine Print (DAMP) corpus was collected by SAIC.
It consists of 345 scanned pages (at 600 DPI) of Arabic text (nearly 670k
characters) with ground-truth [9]. The scanned pages are collection of
book chapters, magazine articles, newspapers and computer printed text
in 4 different fonts. Apparently the data is unavailable today [9].
2.2.2 APTI
The APTI database developed by Slimane et al. [8] suited for the
evaluation of screen based OCR systems. The database contains more
than 45 million of single word images representing more than 250 million
characters. Word images are synthetically rendered in low resolution
(72 dpi) with a lexicon of 113,284 different Arabic words (648,280
character) presented in 10 fonts, 10 font-sizes and 4 font-styles. Each
word image in APTI has its ground truth annotation provided in XML files.
The dataset is divided into six equilibrated sets, where frequency
distribution of each letter is evenly between the sets. The first five sets
are publicly available for research, while the sixth set is kept internal
used by the creators of APTI for evaluating the submitted systems in
blind mode.
2.2.3 MMAC
MMAC corpus is presented by AbdelRaouf et al. [10]. The data is
created from text and images of existing documents. It is consisting of 6
million Arabic words (282,593 unique words) selected from different
sources covering old Arabic, religious texts, traditional and modern
17
language, different specialisations and from online chat rooms. In
addition to words the corpus includes PAWs as well as naked words and
PAWs. The diacritical markings are not considered in this corpus. The
image data are presented in three different formats: single token,
paragraph and full page.
Full-page real image dataset, 19 different documents, scanned
with resolution 300DPI, 24bit RGB colour mode and tiff file format. Each
document has an associated text files.
Paragraph image dataset, 552 paragraphs (around 8,000 words),
presented in three different categories: real scanned (223 images),
computer-generated (141 images) and computer-generated with artificial
noise (188 images). The font type, sizes and styles are not specified. The
truth text is provided for all paragraph images.
Single token image dataset, 1,779,582 rendered images, each
token (Word, naked Word, PAW or naked PAW) is presented in three
Arabic fonts; Simplified Arabic, Arabic Transparent, and Traditional
Arabic. The font size is 14 point and the style appears regular. Each
token is written to bitmap memory allocation with 300DPI resolution and
stored in bmp grey-scale file format. Total number of word images is
847,779 generated for the 282,593 unique words. Each token image has
its ground truth XML files. To simulate real-life scanned document the
rendered images are degraded by skewing and adding artificial noise.
All databases mentioned above are large-scale benchmarking
databases. However, many research groups build small datasets of their
18
own that fulfil their application needs; like those presented later in
Section 2.5.
2.3 Feature Extraction and 2D-DCT
One of the important processes in pattern recognition is feature
extraction; which is the process of extracting the most significant
information to represent the whole raw-data in small number of new data
suitable for the classification phase. It is used for reducing the dimension
of the feature set which improves prediction accuracy and minimizes
processing complexity. The good selected features are those that are
efficiently discriminate between patterns of different classes, but are
invariant for pattern within the same class. Feature extraction methods
can be classified based on features type into three categories: Structural
Features, Statistical Features, and Global Transformation [11].
Structural features are the representation of the global and local
properties of patterns by geometrical and topological features.
Topological features represent a pattern by extracting and counting
number of topological features such as ascending, descending and middle
strokes, openings, to the right, left, up and down, loops, cross points,
branch points, line ends, etc. [12] [13] [14]. In geometrical features the
patterns are represented by the measurement of the geometrical
quantities such as word length, aspect ratio, Centre of gravity, mean
grey value, stroke width, and more [15] [16] [17].
The statistical feature set is pixel based information derived from
statistical distribution of pixels calculated over images or regions of
19
images. They include zoning, moments, projection histograms, among
many others [13] [18] [19].
Global Transformation converts the pixel representation of the
pattern to a more compact form which shortens the features
dimensionality. In general, transformation schemes can be easily applied
and provide feature invariants to global deformations like translation,
dilation and rotation. Moreover, global transformation features are
language independent features set. Several mathematical transforms
have been adapted in text recognition include: Wavelets, Hough
transform, Gabor transform, Fast Fourier Transform, DCT, and others
[18] [20] [21] [22]. A good survey on feature extraction methods for
text recognition is presented in [23], [7].
The DCT is often referred, as is the case in this research, to the
2D-DCT of type DCT-II; which is the most commonly used type. The DCT
is widely used transformations in the area of speech and image storage
and transmission, as well as feature extraction in pattern recognition. It
has the property that most of the signal information is concentrated in
just a few low frequency coefficients.
20
For an M x N image expressed by f(x, y), DCT coefficients are
calculated as follows [24]:
Equation 2.1: 2D DCT coefficients
 

 



  










The size and dimension of the resultant DCT coefficient matrix is
equal to the input image, this implies that the DCT does not reduce data
dimension.
Figure 2.1 shows the properties of the DCT coefficients ordered in
zigzag pattern for 8x8 block size. The upper left coefficient is called the
DC coefficient with frequency (0, 0), holds most of the image energy,
while the rest are AC coefficients with increasing frequency along the
zigzag path and reaches its highest at lower right coefficient. The low
frequencies are more visually significant in an image than higher
frequencies which represent noise. This demonstrate that low frequency
coefficients have a great effect for image reconstruction, and the high
frequency coefficients can be discarded without sacrificing too much
image quality.
21
Figure 2.1: DCT coefficients with the zigzag pattern
For feature selection, after discarding some useless coefficients in
low and high frequencies the low-dimensional feature vectors usually
selected with conventional methods such as zigzag or zonal masking. The
two approaches, zigzag and zonal masking, are shown in Figure 2.2 (a)
and (b) respectively. DCT implementation can be entire image or Block-
based implementation. In block-based operation the image is divided into
blocks of m x n pixels size where each block represents a single image.
The widely used block size is 8 x 8, the same size used by the JPEG
compression standard.
Figure 2.2: DCT coefficients selection (a) zigzag mask (b) zonal mask
22
It is worth mentioning that DCT has the advantage of fast formula
calculations in the DCT and the inverse DCT, since they are based on the
real numbers. Because of its strong energy compaction and real number
calculation properties; it is successfully used in pattern recognition
applications, such as face recognition applications [25], and script
recognition [26].
AlKhateeb et al [27] apply the DCT to the entire image to obtain
100 DCT coefficients using the zigzag order. The features matrices are
normalized into the range [-1, 1] before fed to ANN classifier and
achieved recognition rate of 80.75% on word level. The used word
images to experiment are from the well-known IFN/ENIT database of
handwritten Arabic words.
Another system of AlKhateeb, et al., in [18] recognises Arabic
handwritten texts using a k-NN approach to classify the words by using
Block-based DCT transform (BBDCT) features. An image of a word is
segmented into non-overlapping blocks/frames of 20 pixels width, and
the DCT coefficients of each frame computed as feature vectors of the
word image. Finally, from each block the first five DCT coefficients in the
zigzag order are used for classification. This system was tested on the
IFN/ENIT database and achieved a 61.45% recognition rate.
Generally, once the features of a pattern are extracted, the features
extraction stage passes the selected features information to the classifier
in form of one, or combination, of feature extraction representations.
23
2.4 Classification and DHMM
The classification stage is the crucial and essential stage in the
recognition process. It is the main decisionmaking part; where it
classify unknown sample, based on the extracted features, into one of a
finite set of predefined classes. The classification methods can produce a
unique solution or a set of approximate solutions as an output. A good
OCR system that has the ability to cope with variations in the writing
styles while being capable of distinguishing similar yet different
characters or words. Several classification techniques and methods in
Arabic OCR are reported; which intended to reduce the processing time
while improving the recognition rate. Most of those classifying techniques
are based on ANN, HMM, k-NN, FL, SVM, Hybrid approaches and others.
A comprehensive survey has been presented in [7].
HMMs are doubly stochastic processes which have been widely and
successfully used in the recognition of one-dimensional applications
especially speech processing, in which the temporal information is
available. This Efficiency motivated researchers to employ HMMs in
character recognition. Online handwriting recognition; which is a 1-
dimension function of time, is an example of the use of HMM approaches.
The same approach is applied for 2-dimension text image signal and has
shown comparable results to other classifying techniques. Interpreting
off-line written text as sequence of ink signals moving on the writing
direction, analogues to the temporal sequence in speech can be achieved
by applying a fast segmentation process using a sliding window
technique. In addition, HMMs classifiers have been successfully applied in
24
off-line text recognition without consideration to the temporal
information, where word image either considered as a whole or
segmented into its primitives.
HMM is a finite state machine which consist a set of hidden states
connected to each other by conditional transition probabilities (see
Figure 2.3). Each hidden state has an associated set of probabilities
of emitting particular visible states according to some output probability
density function (pdf). With no constraints have been placed on the
transition matrix; the unit model topology can be defined with arbitrary
number of model states and transitions between these states. Several
types of model topologies have been proposed in the literature, ranging
from left/right models to fully connected ergodic models. Figure 2.4 is
depicting Left-to-Right or Bakis model and Figure 2.5 ergodic model.
Figure 2.3: The Markov Generation Model [28]
25
Figure 2.4: Simple Left-Right HMM [28]
Figure 2.5: A 4-state ergodic model with start and end states
HMMs are mainly classified into three categories depending on the
method of modelling output probabilities [28]: discrete HMMs [29],
continuous HMMs [30], and semi-continuous HMMs [31]. However,
discrete HMMs are more attractive because of its low computation cost
[28]. Another advantage of using the DHMM is that it is being able to
model any distribution especially in noisy signal (speech) when
assumption of normal distribution is not satisfied [32]. Accordingly, The
26
DHMMs can represent more complicated shapes [29]. DHMMs are mainly
suitable for modelling data which is naturally symbolic; such as letters
and words, bitmap images, and DNA sequences. In general, using
DHMMs demand that the feature vectors need to be encoded into
discrete symbols using VQ technique. The DHMMs can also be used with
continuous signals such as speech by mapping the continuous feature
vectors to discrete symbols using VQ.
As can be seen from the Figure 2.3 above, the DHMM is defined by
the following entities:
a finite set of hidden states, being the number of
states
a finite set of observations, being the size of the
symbol alphabet

a transition matrix, where

is the transition probability from
state
to state
, with

and



an observation (output) matrix, where
is the
probability of generating observation (symbol)
given the state
an initial state distribution vector, representing probabilities of
initial states, with
and

Thus it can be seen from above that a complete specification of a DHMM
can be described compactly by a set of parameters as,
 
.
27
2.5 HMM-based Printed Text Recogniser
A recently survey covering researches in handwriting Arabic text
recognition is presented in [7], and [33] for machine-print recognition
until 2006. Recently, most researchers have focused on handwritten
rather than typewritten recognition. There are numerous publications on
Arabic OCR. In this study, the most relevant works are exploited; that
are use HMM classifier applied on the same computer-printed Arabic
fonts.
Khorsheed [34] presented a font-independent recogniser. The
system is designed using multiple left-to-right HMMs where each word
represented by a separate model. Features are represented by segmental
Vector Quantisation extracted from half Fourier spectrum coefficients of
the normalized polar word-image. The system was designed using left to-
right 6-state HMM in a serial model where only one transition from a
state to its successor is allowed. The obtained features are invariant to
the Poincaré group of transformations: dilation, translations and
rotations. The used lexicon size is 145 words, more than 1,700 samples
used to evaluate the word recogniser performance. Two different cases
are applied to assess the performance of the system. The first case,
samples were rendered in four different fonts: Andalus, Simplified Arabic,
Thuluth, and Traditional Arabic. Font sizes ranging between 18 and 48
pt., angles range from 0 to , and translation up to twice the sampled
word size. The font sizes, angle and translation values generated
randomly. The second case, two fonts (Thuluth and Traditional Arabic)
replaced with two handwritten scripts. The experiments were performed
28
using 44% of the data set to train the word models, while the rest was
used for assessing the recogniser. Each word model was trained using
randomly selected observation sequences from the four fonts. The
recognition rates of the first case is not reported. The highest word
recognition rate (WRR) achieved in the second case for Simplified Arabic
and Arabic Traditional are: 90% as the first choice and 98% within the
top-ten choices.
In later work, Khorsheed reports a HMM-based mono-font off-line
recognition system for the computer-generated cursive Arabic text line
[35]. The HMM classifier is built on the HTK. A segmentation-free sliding
window technique is used on text line. Features extracted from the text
are statistical representing the intensity, intensity of horizontal derivative
and intensity of vertical derivative. The system is open vocabulary
dependent on character models and grammars. Two recognisers are
designed in two different model schemes: mono-models and tri-models.
In mono-models different shapes of a single letter are represented by a
distinct HMM model in total number of 60 models. In tri-model there are
9,393 models each HMM model representing a combination of three
letters and it is context-dependent. In all cases each HMM module,
regardless of font type, has eight states per model and codebook
includes 128 clusters. The states' transitions of the HMM module are not
stated. The recogniser performance was evaluated using Arabic text
database typewritten (not specified font sizes) in six different fonts:
Andalus, Naskh, Simplified Arabic, Tahoma, Thuluth, and Traditional
Arabic. The database containing more than 15,000 line images. The
performance was assessed for each font with 1,500 line images for
29
training and 1,000 line images for testing. The line image height is
normalized to 60 pixels to eliminate font type and size dependency. The
experiments conducted separately for each font. Three different type of
features are used: intensity, intensity of horizontal derivative and
intensity of vertical derivative. The highest average recognition rates
(RRs) achieved was measured for cell size (window width) 3 x 3 with 1
pixel vertical overlapping windows. The highest achieved results were
88.7% and 92.4% for Andalus font in mono-model and tri-model
respectively. Another set of experiments were carried out on a multi-font
where training data are selected randomly with same sample size from
all fonts. Using fixed data set for testing, 200 lines from each font. The
reported results did not discuss font results; it simply mentioned that the
system shows a flat performance of ≈95% when using the tri-model.
In Al-Muhtaseb et al. [36] a system is implemented with similar
techniques as in [35]; built on HTK Toolkit, segmentation-free, sliding
window feature extraction, text line unit and mono-font. The text line
images all normalized to height of 80 pixels. From each vertical strip, the
16 features were extracted that are representing one type of features
(sum of black pixels). The HMM module topology is the same for all fonts
where the transition is allowed to the current, the next, and the following
states only. However the number of states and codebook size are font
dependent; the HMM model can have five or seven states left-to-right
HMM and codebook sizes can be 128 or 256. Each character shape is
considered as a separate class in total of 126 classes for each font. To
evaluate the recogniser a text line database was implemented in eight
fonts (Akhbar, Andalus, Arial, Naskh, Simplified Arabic, Tahoma, Thuluth,
30
and Traditional Arabic), each font consists of 2766 lines of text. It
appeared that synthetic images have been used in this database. In
training phase the first 2,500 lines were used and for testing phase the
remaining 266 lines were used. To assure presence of a sufficient
number for all shapes of Arabic letters, the training sets are augmented
with 5 copies of the 3 lines of the minimal Arabic script database [37].
The classifier having a unique code for each shape of each character
achieves highest accuracy percentage is for Arial font with 99.85%, and
Andalus register the lowest accuracy 96.83%. An improvement in
accuracy in all fonts when combining, after recognition, different
character shapes into one code; the highest is Arial (99.90%) and the
lowest is Andalus (97.86%).
Slimane et al. [38] also applied an HMM recogniser with sliding
window to the recognition of Arabic printed word image. This work
focuses on screen-based OCR. The recogniser is constructed using HTK
Toolkit and classified as a word unit, segmentation-free, multi-font and
open vocabulary. Two proposed techniques are used: global multi-font
system (font independent) and cascading system (font recogniser
followed by mono-font word recogniser). The word images are
normalised in grey-level with 45 pixels height and then with 8 pixels
window width the feature vectors of 102 components are computed.
Concerning the HMM topology, 64 models having an equal length of five
states, not defined transitions, representing a similar character shapes
groups. These similar character shapes are grouped according to
predefined rules. At recognition time, all sub-models are used to
compose an ergodic HMM to allow recognising potentially any word in an
31
open vocabulary manner. To evaluate the system some parts of the APTI
database were used [8]. The ATPI database is word images synthetically
generated in low-resolution "72 dot/inch." It is freely available large-
scale benchmark database suited for the evaluation of screen-based OCR
systems. Word images are generated using 10 different fonts:
AdvertisingBold, Andalus, Arabic Transparent, DecoType Naskh,
DecoType Thuluth, Diwani Letter, M Unicode Sara, Simplified Arabic,
Tahoma, and Traditional Arabic. The 10 different fonts are select to be in
size 24pt. Set-1 (18,897 words for each font) is used for training and the
unseen set of the word in set-5 (18,868 words for each font) are for
testing. The highest calculated performance of character recognition rate
is 99.20% for Andalus font using mono-font recogniser, while the Global
multi-font attains 98.60% for Arabic Transparent font. The word
recognition rate also registered with highest calculated performance is
94.20% for Andalus font using cascading system mono-font recogniser,
and 87.20% for Arabic Transparent font using Global multi-font.
2.6 The Re-Ranking Post-Processing Techniques
The text recogniser has text image and a lexicon as inputs, and its
output is word-hypothesis best match the word image within the lexicon.
In some cases, the outputs are ordered word-hypothesis lattice of best-n
matching words in the lexicon. Among these hypotheses the correct
word-hypothesis is not always placed at the first position nor at any
selected first n-position in the ranked list. The recogniser can employ the
post-processing stage to improve the accuracy; this can be achieved by
32
incorporating the context and shape information at the post-processing
stage.
Using linguistic knowledge such as co-occurrence pattern of
adjacent words and syntax and semantics is still in its early stages in
Arabic text recognition and much more research efforts are needed. The
successful applications of these linguistic models supporting the
recognition process, especially for English text, recommend their
effectiveness. In the case of Arabic OCR only a few publications
investigate recognition rate improvement in the post-processing stage.
Most of these approaches are based on combining different systems or
classifiers. To the author‟s best knowledge only a small number of
researches investigate the use of contextual information to improve re-
ranking the list of candidates.
AlKhateeb et al. [39] use structure-like features (number of
connected regions: sub-words and diacritical marks) to re-rank the top-
10 result produced by a handwriting recognition system. They implement
a Hidden Markov Model (HMM) classifier using the HTK Toolkit evaluated
by the IFN/ENIT database of handwritten Arabic words (Tunisian
town/village names). The top-10 candidates are re-ranked according to
the values of their refined probability. The refined probability is a
function of associated probability (obtained from the recogniser) and
Gaussian-like function which in turn it is a function of structure-like
features. In terms of top-1 recognition rate, the highest improved
accuracy is 2.51% using database version v1.0p2 with set d (6735
33
words) for testing, where the recognition rate is enhanced from 86.73%
to 89.24%.
Prasad et al. [40] applied n-gram language models (LMs) to re-
score an n-best list. The kernel of the system is glyph HMM classifier. The
data set is from the DARPA Arabic Machine Print (DAMP) document
corpus collected by SAIC. It is collected from books, magazines,
newspapers, etc. The corpus size is 297 scanned images of text used for
developing, training and testing the OCR system; 60 for development, 60
for testing, and 177 for training purpose in addition to another 380
synthetically generated images of newswire text. Three types of LM are
used in improving the recognition rate: character, word and PAW. The
LMs are created from 2.6 million Arabic words of newswire
data in addition to transcriptions from the training set images. The lexical
sizes for character, word, and PAW n-grams are: 162 characters, 65K
words, and 9K PAW respectively. The achieved performance by rescoring
the N-best list (N not specified) using LMs is represented in terms of
word error rate (WER). The maximum reduction in WER is 10.1%
achieved by using the PAW trigram compared to 11.5% obtained by
character 5-gram and 15.9 % for word trigram. The system improvement
using the LMs is not specified.
Devlin et al. [41] proposed an alternative approach, where
statistical machine translation (SMT) and n-gram LM are used to re-rank
OCR generated sentences. The HMM based OCR system is Arabic
handwritten character recogniser. The baseline system is trained on 2.3
million words of transcribed text field collected such as legal filings. The
34
testing sentences are 1,397 (22,261 words). The Arabic-to-English SMT
system (hierarchical system) is trained on 45 million words (news data)
from Arabic-to-English parallel corpus at Linguistic Data Consortium
(LDC). The SMT uses English n-gram LM (n not specified) to re-score its
outputs. This English n-gram is trained on 5 billion words from the LCD
English GigaWord corpus. It should be noted that there is a significant
mismatch between domains used to train the SMT and the OCR systems.
The Arabic 5-gram LM is trained on three Arabic corpora: GigaWord (500
million words), in-domain baseline training text (2.3 million words), and
Arabic sets used in training SMT (45 million words). The task of the SMT
and the LMs is to minimize the WER by re-ranking the 20-best ranked list
produced by the baseline system. The baseline system WER (24.88%) is
improved on top to 24.21% and to 24.25% by using the Arabic 5-gram
and SMT-LM respectively. The highest WER improvement is 23.81%
resulted when the SMT-LM is used in combination with the Arabic 5-gram
LM.
Al-Hajj et al. [42] introduced a re-ranking scheme via fusion of
three homogeneous HMM-based classifiers. The classifiers have the same
topology and differ only in the sliding windows orientation angle (-, 0°,
and +). The classifiers outputs, list of word candidates with their scores,
are fused at the decision level. Three decision makers are employed: the
sum rule, the majority vote rule, and a combining classifier based on
neural network namely Multilayer Perceptron (MLP). The MLP is trained to
select only one HMM classifier and its Top-1 candidate. The three
combination strategies are considering the Top-10 candidate words of
each recogniser to produce a rescored word list. Experiments have been
35
carried out on the IFN/ENIT database, sets (a, b, and c) for training and
set (d) for testing. The reference classifier (0°) accuracy is higher than
classifiers using oriented frames ( = ±20°). Generally, the proposed
combination schemes improved the accuracy. The highest improvements
achieved at Top-1 choice are by using the MLP, where the reference
classifier Top-1 recognition rate increases from 87.60% to 90.96%. The
sum rule outperformed the other two schemes at Top-2 and Top-3 by
94.89% and 95.87% correspondingly.
Farah et al. [43] applied a syntactical analyser at post-
classification stage to select the better suited word from the set of
candidates. The suited word is the word which produces syntactically
correct sentences. The syntactic analyser uses the grammar describing
the Arabic legal amounts. The proposed system is to recognise
handwritten Arabic legal amounts in cheques. It is composed of three
classifiers, namely MLP neural network, k nearest neighbour (k-NN) and
Fuzzy k-NN, combined in a parallel scheme at the combination stage.
They take word holistic structural features as their input and produce
top-3 words ranked list as their output. The experiments were carried out
on database for Arabic literal amounts built by the authors (4,800
handwritten words represents 48 words of the lexicon written by 100
different writers). The three classifiers were trained using 1,200 words
and tested by 3,600 words. The testing set (48 words x 75 writers) written
by writers not seen in training phases. The classifiers results are
combined to produce a list of up to nine ranked candidate words. The
combined schemes are statistical decision systems, including maximum,
minimum, average, product, and sum-rule. The Fuzzy k-NN classifier
36
achieved the highest recognition results 92.16%. The combination
technique improves the recognition rate. The highest recognition rate
improvement is 94% achieved by the sum-rule scheme. Recognition
accuracy is further improved to 96% when the syntactic information
integrated at the end of the post-processing stage.
2.7 Summary
A typical pattern recognition system takes the normalised input
data, extracts the features, trains the classifier, evaluates the test
pattern, and finally improves recognition accuracy based on linguistic or
statistical data.
The background investigation illustrated many important facts
about the status of research in Arabic text recognition. Arabic printed
text has more or less been abandoned in favour of handwritten text. The
analytical approach, especially sliding windows, is widely used for word
recognition in favour of holistic one. The database is an actual part of any
recogniser satisfies its requirements. However, there is no benchmark
databases of text to test the performance of any system developed for
Arabic text recognition. The effectiveness of exploiting contextual
information is to improve re-ranking the list of candidates. Further
research can be directed toward integrating context-dependent word
correction for Arabic, as these studies are still at an early stage of
development.
This study is motivated by these facts and aim to design
recogniser employ holistic approach for word recognition and consider
37
passage-level post-processing for context-dependent word correction (re-
rank the candidate lattice). Unfortunately, none of the available
databases satisfy the requirements of this project. This raises the need to
build database representative to the recognition task, it is composed by
word samples derived from corpus.
38
CHAPTER 3
ARABIC WORDS DATABASE PREPARATION
3.1 Introduction
The database provides a very important infrastructure towards
development and comparison of different recognition systems for hand or
machine printed scripts. Accessing to a word-images database and
statistical language models (SLMs) is essential during system
development. Hence, the database should meet the needs of the
proposed method in this study.
The goal of this chapter is to present a methodology for building a
database of written words/sentences for the research in Arabic text
recognition, in general and machine printed (multi-font) Arabic script
recognition, in particular. Word segmentation is a challenge in Arabic text
because of the semi-cursive nature of Arabic script (PAWs) and word-
shape length is font dependent. To provide efficiency, accuracy,
timelessness, security and economy; a fully computerized system has
been developed for automatic: form creating, word segmentation, and
ground truth generation.
39
3.2 Considerations in Building OCR Text Database
Different databases for different OCR systems exist. Each one is
related to some application, e.g. handwritten numeral and literal
amounts for bank checks application, handwritten town names for postal
address reading and sorting system, and low resolution screen rendered
text for reading text from screen images. The nature of the classifier
creates conditions that influence the solution of the database
implementation problem. Accordingly, there are number of points to be
considered to design an OCR system, which in turn influence database
implementation. The most important points are:
(1) High level knowledge integrity: integration and contribution of
linguistic knowledge for improving the recognition throughput and
accuracy. This require a corpus as the foundation of the database
rather than collecting text from ”random” sources
(2) Major unit specification: character, word, sentence, script, or text
line
(3) Style variation: different fonts might be used (machine printed
and synthetic text), or many individual writing styles can occur
(handwritten text).
(4) Font size and style: font point size range and style (normal, bold,
italic, bold italic)
(5) Writing style: handwritten, machine printed or synthetic text
(6) Printing quality (pixels/inch)
(7) Scanning intensity: Binary-scale, grey-scale, Colour (RGB space)
(8) Image Resolution : DPI (Dots Per Inch)
40
(9) Noise handling ability: the scanned image can has noise due to
various reasons
In this research context, some constraints should be satisfied in order
to build word-images database:
Authentic text corpus comprises data set captures the semantic
and syntactic of the script
Two types of databases related to the corpus with enough samples
for training and testing recogniser uses holistic word features:
o Real scanned word-images to evaluate baseline recognition
system
o Rendered word-images representing full-sentences to assess
the integrated system (baseline system integrated with
linguistic post-processing)
Samples reflecting off-line real-life Arabic text typewriting
variability; printed in five font types, 14 points in plain style
Samples are stored in binary-scale using TIF format with resolution
of 300 dpi
3.3 Related Works
Different databases for different OCR system exist in different
languages; they contribute in their related researches. This section
presents an overview methodology of database implementation
techniques for some scripts which includes English, Chinese, Farsi, and
Arabic.
Marti and Bunke [44] designed the IAM database; a full English
handwritten sentences from the Lancaster-Oslo/Bergen (LOB) corpus
distributed over 4,881 lines of text. The collection data form, as depicted
41
in Figure 3.1, is automatically generated. The form layout has four parts
separated by horizontal lines. The first part contains database title and
text code identifying its category and number in that category and its
starting sentence. The second part has a machine-printed text which the
writer should copy in the following section. The last section is for writer
where he can voluntarily print his name. The horizontal lines are used for
document segmentation and skew correction operations. To ease the
image pre-processing, the writer asked to write on a guide lines printed
on another sheet of paper placed under the form. The form scanned at a
grey level of 8-bit with resolution of 300 dpi and saved in TIFF-format.
The aim of this database is to be applied in segmentation-free
recognition techniques and in systems that use linguistic knowledge. For
this reason, the ground truth file contains the labels of the printed and
handwritten line of text in ASCII-format.
42
Figure 3.1: The IAM database filled form [44]
The HIT-MW sentence database for Chinese language is implemented
by Su et al. [45]. It is built to fit segmentation-free recognition systems
and techniques that apply statistical language model. It is based on the
China Daily corpus. As can be seen from Figure 3.2, the collection form
layout is divided into three sections: writing guidelines, typewritten text,
and handwriting. The upper two sections are separated by horizontal line,
while the third section is bounded by rectangular box. Horizontal lines
can help to remedies the document and writing skew since there is no
writing rulers. Each form has 4-pair-digit code identifier; the first two-par
shows the publishing year and month; the other two-pair are stands for
43
text location in the corpus. The forms are digitized in 300 dpi resolution
saved as grey-scale in BMP-format image file. The ground truth file
format is not specified. In addition to the handwriting text line, the
ground truth file involves the text lines from the text section that
corresponding with the handwriting text line.
Figure 3.2: The HIT-MW database collection form layout [45]
In addition to these databases there are many other databases in the
handwritten recognition domain inspired by the former database. These
inspirations were in form design and unit sample acquiring procedure.
The FHT Farsi handwritten scripts database was implemented by
Ziaratban et al. [46]. Texts sampled from corpus in different categories
similar to the IAM database. The layout of form and its details are
illustrated in Figure 3.3. All filled form pages were scanned with 300 dpi
in 256 grey scales. The GT files are for the entire handwritten texts.
44
Mahmoud et al. [47] developed AHTD contains text selected from an
Arabic corpus in different topics. The forms are scanned in grey scale at
different resolution 200 dpi, 300 dpi, and 600 dpi. The page, the
paragraph and the line levels have ground truth files in database format.
Another Arabic Text Images Database (AHTID/MW) written by Multiple
Writers has been developed by Mezghani et al. [48]. It comprises written
Arabic words and text-lines by different writers. Each text line image and
word image has XML ground truth file describing its contained
information such as sequence PAWs and characters.
45
Figure 3.3: The FHT database form layout details [46]
A brief summary about related databases for different languages in
time period 1993 2008 can be found in [46] and Pravez and Mahmoud
[7] review the available Arabic text databases.
3.4 Corpus Acquisition
The utilization of contextual knowledge is one strategy to successful
text recognition. Since this work is directed toward this goal, it is
important to have an OCR oriented database that contains full image of
46
Arabic sentences in addition to SLMs. The first phase of database and
SLM development is defining an electronically available authentic text
corpus. After a comprehensive investigation, the text corpus is collected
from The Holy Qur‟an, which is available at the Tanzil Quran project
2
in
the public domain. It provides highly verified UTF-8 format Quran text
written in Classical Arabic. It allows downloading the Quranic text in
various formats including Simple and Uthmani Scripts with different
number of diacritics and symbols.
It should be noted that the linguists and grammarians use The Holy
Qur'an as a reference because it contains pure Arabic vocabulary as well
as Arabic grammatical rules [49] [50]. The Holy Qur‟an, therefore, forms
a robust NLP basis for researchers in Arabic language in general and
complementing OCR in specific.
Given the importance of the Quran, the Quranic Arabic Corpus
3
, an
online linguistic resource organized by the University of Leeds, provide
grammar, syntax and morphology for each word in the Holy Quran
[51]. This corpus project is part of the Arabic language computing
research group within the School of Computing at the University of
Leeds. In terms of data, an authentic copy of the Arabic Qur‟an text from
Tanzil project is used. Figure 3.4 below shows a dependency graph that
describes the syntax of verse (67:1). These set of the provided analysis
can be used to facilitate and/or improve the OCR process.
2
http://tanzil.net/download/
3
http://corpus.quran.com/
47
Figure 3.4: Dependency graph for verse (67:1) [51]
3.5 The Holy Qur'an Arabic
The Holy Qur‟an is the religious book of Islam written in Quranic
Arabic dating from seventh century, which forms a unique genre and
represents the purist and the most authentic form of the classical Arabic
language. The used version of The Holy Qur'an is according to Hafs'
transmitted version (riwayat). The Holy Qur'an organised in a hierarchical
model as follows: Document, Chapter/Surah, Verse/Ayah, and
Word/Kalimah. In addition to this hierarchy, there are pause symbols
indicating the end of complete meaningful sentences. Considering the
48
starting Basmalah of 112-Surah as verses and the functional words as
well as letters of introductory verses for some chapters ( ) as
words; The Holy Qur'an comprises of: 114-Surah, 6,348-Ayah, and
78,245-Kalimah, each Ayah can have from 1 to 129 words. It is
considered as a closed corpus including 14,870 unique words.
3.6 Database Implementation Methodology
This section presents the implementation details of a database of
isolated machine-written Arabic words for text recognition researches.
Work requirements demand creating a word-images database related to
the task. These requirements are: isolated words printed in different
fonts and samples that meet the constraints stated above (section 3.2).
Moreover, the needed attributes of the database are the followings:
Sample names are coded according to pre-defined rules
representing its attributes (character set group, font attribute,
position in the corpus, etc.)
word images are stored together with their ground-truth
information
Ability to add new fonts and font attributes according to the pre-
defined rules
Once the collection form has been designed, which will be described in
the next subsection, database implementation approach is summarized in
the following steps:
Define the desired set of words from the corpus and define
frequency distribution of each
49
Automatically generate Microsoft Word documents containing filled
forms; each document contains one form filled by words typed
with the same font type, size, and style. Each document provided
with two reference files in text format for segmentation process,
one for page-groups and the other for words in each page.
Printout Word documents in predefined printing quality
(pixels/inch)
Scan the printed forms at desired dpi resolution
Save scanned images in binary-scale using TIF format. Every
document image is saved in file its name structure according to a
predefined format
Edges distortion eliminating in document image
Skew detecting and correcting for document image using Hough
transform technique
Document segmentation;exploit vertices pixel coordinates of text
section and lines, and words
word image tidy segmentation, whole word body shape, by
applying smearing strategy using Run-Length Smoothing
Algorithm
Word binary images are saved in its corresponding subset folder
with resolution of 300 DPI in TIFF-format file. File names are
according to predefined format
Generate ground truth annotation in XML file format for each word
images and subset folder
50
To build an OCR text database, it is necessary to fill, print, and save a
series of document images continuously. Therefore a batch mode
command is needed at every stage or group of steps for practical use.
The modules of this system are implemented using MATLAB
(R2009a/64-bit) Image Processing Toolbox, as well as programming in
Delphi 7 and Microsoft Visual Basic 2010. It should be note that in this
study it is chosen to use binary word images without applying any pre-
processing.
3.6.1 Data collection form
In reality, the form layout is a nontrivial task since it must be simple
for automatic word segmentation and labelling tasks. Due to the
difficulties in word segmentation, the form layout designed in a way that
simplify line and word segmentation which has to be computerized.
The structure and foundation of the collection form is inspired by the
ideas presented in section 3.3. The collection form layout is illustrated in
Figure 3.5 emerged after several trails. The form is A4-size white plain
paper divided into three sections separated by horizontal lines: header,
text section, and footer. The header and footer sections are for
documentation purpose. The separating lines between sections are dark
black lines; to be able to employ them as horizontal references as well.
The header involves database title “Qur‟an-MP Database”; MP denotes
Machine Printed text, and code-number used for documentation and
automation process.
51
Figure 3.5: Form layout
Each form will have a unique code-number with three zones separated
by dashes (e.g. 05140 07 004). This code-number comprise from left
to right: the first zone includes data type one digit (Table 3.1) followed
by four digits font information, they include font type in the first digit
(Table 3.2), the next two digits are font size and the last digit is font
52
style (Table 3.3). For example, the code 03140 refers to real scanned
data type, its printing font type is (3) Simplified Arabic font with size of
14 points in plain style (0). The second zone is two digits group-number,
while the last zone is three digits page index in the group.
Table 3.1: Data types
Table 3.2: Used Arabic fonts
53
Table 3.3: Font Style
To simplify segmentation operations the text section organised in a
table format. It helps the segmentation of line and word based on pixel
location. The imaginary text table consists of five columns and twelve
lines. Bearing in mind that word-shape is font dependent, table cells are
sized enough to accommodatethe largest word-shape in the corpus with
less segmentation error. The text section can include up to the 60-word.
The first word of the printed text is located on the top right in the text
section to follow the reading and writing style of Arabic language. Font
information at footer section includes font name, size, style, and printing
date.
3.6.2 Data Preparation and Acquiring
Automatic generation of the filled forms is the first stage in the
process. A Microsoft Word document generated contains structured form
filled with text from the corpus. Each Word document is identified by its
form code-number prefixed by font name separated from the first zone
by dash. During the generation of Word documents another two text files
are created, they are group and page reference list files for automatically
word segmentation and labelling. Figure 3.6 shows samples of the
54
structure of these two files. The group reference file includes list of all
generated forms at the same occasion. This list comprises form names,
form code-number prefixed by database name “Quranic”, followed by
their contained number of lines and words, respectively. The form names
are given to the scanned forms acquired by a scanner since the scanner
assigns its own sequenced file names. The page reference file includes
list of word-codes indicates their position in the corpus. The word-codes
are ordered according to their occurrences in the forms, starting from top
right of the first form to bottom left of the last form.
Figure 3.6: Group and Page reference files
55
Form creation passes by three stages: selection of words or
sentences, document generation, and document to image conversion.
The followings are applied for the two types of documents: training and
testing purpose. Firstly, define number of groups; which refers to the
frequency distribution of each word or sentence, and then select the
words or sentences to be included. For both purposes, scatter the words
over the table in the form text section, this is because most of the
scanned documents can contain noises anywhere. The noise arises due to
inefficient writing instrument, writing surfaces, or scanning machine. The
binary images may have noises, include touching edges, broken strokes,
filled loops, salt and pepper noises etc. Scatter process is accompanied
by group and page reference text-files creation.
In the synthetic image, the documents are converted to an image
using dynamic programme simulates printing and scanning operations. In
both ways scanned or synthetic, images are saved as binary (black and
white) scale TIFF-format files with resolution of 300 DPI. To save storage
and time, binarisation of the scanned image is done with the scanner
hardware threshold. The really scanned forms have undergone two
primarily stages of verification: the quality of printing and scanning. The
final product is a scanned form having names as stated above.
3.6.3 Pre-processing and Word Segmentation
Pre-processing covers all those functions carried out to prepare
original image to be suitable for later recognition stages. For off-line
systems, pre-processing functions include: binarisation, noise filtering,
skew detection and correction. These situations and others make it
56
difficult to analyse and process document images. Here, the aim of the
pre-processing is mainly eliminating distortions at edges as a first step
followed by skew detection and correction. Finally, document image is
segmented into various zones like sections, text lines and words.
3.6.3.1 Skew Detection and Correction
In practice, these scanned documents can contain number of
unavoidable and crucial problems; it can be noised, skewed, deformed.
in this context, presence of skew in scanned document images is a
very common problem. The document image is skewed if it is not fed
straight into the scanner either manually or automatically. Existence of a
few degrees of skew within about three degrees is unavoidable [52]. This
is feasible if the document is fed by a human operator. The automatic
feeders may cause the document to rotate up to 20 degrees [52]. The
skew of a document image called “global skew”, where all text lines will
have the same orientation, deviate from the true horizontal x-axis.
Consequently, correcting the skew, orienting the text lines to be
horizontal, is an important pre-processing step because it affects the
efficiency of subsequent processing stages, such as segmentation and
classification.
Skew correction is generally carried out by calculating the skew angle
“θ” of the raw image and rotates it by “θ” in the opposite direction. A
number of methods have been proposed for skew detection such as
Projection Profile, Cross Correlation, and Hough Transform. In this work,
57
Hough transform, the widely used approach, is the adopted method for
determining the skew angle.
The Hough transform technique detects lines, circles and other
structures whenever their parametric equation is known. In the present
context, it will be used for the detection of straight lines for skew angle
determination. As the equation of any straight line in Cartesian space is:
Equation 3.1: Straight line equation
  
The polar (also called normal) representation of straight lines is:
Equation 3.2: Straight line polar equation
  
Where ρ (rho) is the perpendicular distance of the line from the
origin, and θ (theta) is the angle from the horizontal of the perpendicular
line, Figure 3.7 illustrate this concept. In image analysis context, Hough
transform maps each point in Cartesian image space (x, y) to a set
of all straight lines going through that point in the (ρ, θ) Hough space.
58
Figure 3.7: The representation of a line in the (x, y) space using (ρ, θ)
In general, to estimate the skew angle, the Hough-transform is
applied on the image and the longest straight line will show the most
accurate skew angle. The whole document image skew angle is
calculated from the slope of one of the two separating lines. The line
detection in a binary image can be summarized as follows:
1. Segment area enclosing pixels of one candidate separating
lines; reducing the input data to process for low computational
complexity
2. Apply an edge detection method to find all the edge points in
the segmented area; In this work, the Canny method is applied
for detecting boundaries of features within an image
3. Perform Hough transform on the detected edges for line
detection. It maps all the data points in the image (x, y) into
Hough space (p, θ)
59
4. The mapping result is the accumulator array element A(ρ, θ)
represents the number of points lying on the corresponding line
in the x-y plane
5. Detect best line candidates as local maxima in the accumulator
cell array, the longest digital straight line, and its angle “θ
considered as the actual skew angle
In addition to the skew problem, the scanner sometimes presents
distortions at the edge such as bounding box or lines. Before performing
the skew angle detection, these edge distortions should be eliminated
firstly. This can be done by pruning the image by specify the crop
rectangle around the image with suitable margins, which can be defined
experimentally.
3.6.3.2 Word Segmentation
The subsequent task to document-image skew correction is word
segmentation. The imaginary text table plays a very important role in all
levels of segmentation process; providing vertices pixel coordinates of
text section and lines, and words. Accordingly, at different levels in the
hierarchy, the crop rectangle around each component is defined and then
it gets clipped. The clipped text section is as depicted in Figure 3.8, while
Figure 3.9 shows a segmented line. Figure 3.10 illustrate word image,
the prerequisite for the next step where each block should contain only
one word.
60
Figure 3.8: Text section
Figure 3.9: Line segment
61
Figure 3.10: Segmented word using crop rectangle
The main objective of this stage is to find the “body” of the word in a
binary image. To achieve this; firstly, complement the original word
image to get binary-scale image having writing stroke in white pixels and
background pixels are black. Secondly, the picture is “smeared”
horizontally using Run-Length Smoothing Algorithm (RLSA). The RLSA is
applied row-by-row to an image document. The principle of this
smoothing algorithm is based on smearing a consecutive white pixels
(represented by 1‟s) along the horizontal direction: i.e. the black space
(represented by 0‟s) between them is filled with white pixels if their
distance is within a predefined threshold.
Thirdly, find region boundaries of the largest white area in the
smeared image. Finally, crop the word body in the original binary image
based on the vertices of the smallest rectangle containing the region.
Figure 3.11 displays the “body” of the word image as result of the tidy
segmentation.
Figure 3.11: Tidy segmentation of the whole word body
62
The tidy whole word binary images are saved with resolution of 300
DPI in TIFF-format files. During all processing stages, images are verified
at section, line, and word levels to ensure that there are no errors in
segmentation process. The verification is crucial, because any random
noise or smearing may make word segmentation imperfect. In the event
of imperfect word tidy segmentation because of presence background
noise, it's being eliminated manually and reapplies the Run-Length
Smoothing Algorithm on it again.
The image name represents its attribute, and it is formatted, from left
to right, as follows: one letter subset name, data type digit, font
information part and the word position in the corpus. For example, the
image file name a05140-01-002-164-006 indicates that the word-image
is from subset (a), scanned, written in font number 5 in size 14 pts in
plain style, related to group number 1, located in the corpus at chapter
(2), verse number (164) word (6). The word-images are classified in
group-subsets according to document type and font name.
3.6.4 Ground Truth
Ground truth (GT) refers to a number of information which describes
attributes of each entry in the database. These ground truth annotation
may including Number of words, PAWs, character sequence, font type,
font size and so on. The GT data plays a vital role in recognition system
development by providing information about the written text needed for
text recognition. The availability of electronic corpus, facilitate automatic
generation of GT files.
63
In this database each word image is accompanied by GT date in XML
file format describing the image at word level. Figure 3.12 shows an
example of an XML file at word level. The following GT information is
available for each word image:
Database name Quranic_MP_Database
Lexicon word reference identifier
Arabic word
Number of PAWs
Number of letters
Word image file name
Corpus name “The Holy Qura'n
Word identifier refers to its location in the corpus
Writing instrument
Binarisation
Resolution
Subset name
Font identifier
Font name
Font size
Font style
64
Figure 3.12: GT in XML file format for the word 
Another GT data file is provided for each subset folder that contains
word-image samples. It comprises main information about all words in
the subset and acting as lookup table for fast searching by one of the
keywords: file name, word, or lexicon word-id. An example of XML file at
subset level is given in Figure 3.13. The following GT information is
available for each word image:
Header information
o Database name “Quranic_MP_Database
o Corpus name “The Holy Qura'n
o Writing instrument
o Binarisation
o Resolution
o Subset name
o Font identifier
o Font name
65
o Font size
o Font style
For each word image
o Word image file name
o Arabic word
o Lexicon word reference identifier
The database files, images and GT, are stored in directory subset
structure as depicted in Figure 3.14:
66
Figure 3.13: GT in XML file format for subset folder
67
Figure 3.14: Database directory structure
68
3.7 Summary
Database implementation is a nontrivial task; it is not a simple
collection of text images. This chapter presents a complete procedure to
build a database considering the word as the major unit for text
recognition. The procedure used to build machine printed text database,
in general and Arabic script, in particular. The same procedure can be
applied to construct handwritten words database with some modifications
in the collection form.
Number of points influence database implementation is discussed. The
constraints that should be satisfied to implement database fulfil this
study requirements are described. Methodology of implementing of other
related database is reviewed.
Text database implementation methodology described in details.
Implementation methodology presents an automated system to create
the database from a corpus. The fully computerized systems provide
efficiency, accuracy, timelessness, security and economy. The automated
system starts by form creating, and the following is word segmentation
and labelling, and then ground truth generation.
Form layout designed in a way that simplifies line and word
segmentation based on pixel location. Forms are designed using
Microsoft Word and filled by words selected from The Holy Qur‟an. The
scanned forms are pre-processed for eliminating distortions at edges and
skew detection and correction using Hough transform approach. The
Run-Length Smoothing Algorithm used to help in cropping the word
69
body. File names of the tidy word binary images are following special
format represents the word attributes. Finally, a ground truth files in XML
format are supplied for each entry and subset folder in the database.
Note that all algorithms were developed using a variety of languages;
MATLAB (R2009a/64-bit), Delphi 7, and Microsoft Visual Basic 2010.
70
CHAPTER 4
HMM/DCT HOLISTIC WHOLE WORD
RECOGNISER
4.1 Introduction
Although other languages use Arabic letters, such as Persian and
Urdu, Arabic character recognition has not reached the same level of
maturity as other languages, especially English. This is attributed to a
number of issues; lack of fundamental interaction between researchers in
this field as well as deficiency of infrastructure supporting utilities,
including Arabic text databases, electronic language corpora, and
supporting staff; consequently each researcher has his own system and
database with rare exceptions. Accordingly it is very difficult to give
comparative results for the proposed methods due to the absence of
standard benchmark databases. In addition to the former issues, the
complexity of the Arabic script features demonstrates an additional
challenge to build Arabic Character Recogniser; more details can be
found in [11].
In this work, segmentation problem is avoided by considering the
word as the major unit. The popular Block-based DCT transform applied
to extracting word feature. The features of the entire word are fed to the
recogniser to identify it without segmentation. The system is built on
71
HMMs, where each word was represented by a separate model. The
applied HMMs to the task are Discrete 1D-HMMs built using the Hidden
Markov Model Toolkit (HTK) [28]. Vector quantization is used to generate
a discrete observation symbol density. In the recognition phase the
recogniser produces N-best recognition word hypotheses lattice. A true
scanned typewritten Arabic word image database for five different fonts
is built for this research.
4.2 Thesis Methodology
The research objective is to develop a multi-font Arabic printed
word recognition system for offline applications that accepts digital
images of words (a sentence), computes features of each image, a word
recognition process generates top-n candidate list for each word image
(a word lattice); next, a procedure constructs all possible sentences from
the word lattice; finally, constructed sentences are passed to linguistic
post-processing stage to re-rank them according to their scores
calculated using linguistic model. The outputs can be acquire top-n
hypotheses from the scored sentences.
The technique to be applied for the design and implementation of
the Arabic text recognition system is as follows:
The corpus being used to experiment on is The Holy Qur'an
Developing benchmark database
a. Designing data collection form layout
b. Microsoft Word creates document files contain filled form
with words written in desired font properties
c. For real scan images, print created documents and then
scan them, while generate image for synthetic images
72
d. Form images pre-processing and word segmentation
e. Word images labelling and ground truth
Word image global shape feature extraction
a. Calculate the 2D-DCT coefficients of the image blocks, 8-
by-8 pixels is the most commonly used block size
b. After applying masking on each block, a number of
coefficients are selected from each block and
concatenated to construct a feature vector for the whole
image
Training phase
a. Design of Discrete Hidden Markov Model (DHMM) models,
n-state Left-to-Right topology, for each individual word
from the training data set.
b. Hidden Markov Model Toolkit (HTK) training tools
estimates parameters for the HMM models using training
words and their associated transcriptions [28]
HTK recognition tools, also known as decoding tools, calculates
the most likely words to the unknown word image, and
produces the top-n hypotheses (a word lattice) from the lexicon
for that word image
construct n-gram SLM for the whole text of the corpus using
HLM Toolkit provided by the HTK toolkit [28]
Post-processing phase
a. construct all possible sentences from the word lattice
b. calculate the perplexity for each sentence in the test text
using n-gram models
c. re-rank the sentences according to their perplexity
scores
The system output is the list of the re-ranked sentences
73
4.3 Arabic Printed Datasets
One objective of this study is to establish a database to be used
throughout this research work and to make it freely available for
academic researches. The required database should have real scanned
binarised samples of machine-printed Arabic words. Word image samples
representing challenges of Arabic orthography to OCR technology; these
challenges include, ligatures, overlaps, diacritics and style variation. All
of these are well-known problems pose challenges to the recognition
system.
Two databases were established:
Quranic-252UW database for system validation and testing; it
comprise 252 unique words (UW)
Quranic-343UW database for provide evidence that the system
developed is not predisposed to a particular words. it contains
343 unique words not presented in Quranic-252UW database
4.3.1 Word Set Selection Criteria:
The Arabic alphabet consists of 28 basic letters. Arabic writing is
semi-cursive even when printed, and is written horizontally from right to
left. Each word consists of one or more PAWs. The characters inside a
PAW are normally connected to each other from the right side.
An Arabic text line can be considered as being composed of two
baselines: an upper baseline and a lower one (see Figure 4.1). These two
lines divide the word-image into three zones: 1) Upper zone: ascender
74
and upper diacritic points above the upper baseline; 2) Lower zone:
descender and lower diacritic points which are under the lower baseline
and 3) Middle zone: the main content of the word, the main body of the
letters, is between the two baselines. Each letter has a part or all of its
body in the middle zone, and, accordingly each text line has at least a
middle zone.
Figure 4.1: Upper and lower baselines of a text line (Simplified Arabic
style of writing)
The word set has been chosen in order to capture, in different
combination, the presence of ascenders, descenders and main body of
the letters. Since more than half of the Arabic letters include dots in their
shapes, dots have been taken into account in this work. Dots play an
important role in the identification of characters when the characters‟
bodies have identical or similar shape. Dots can be one, two or three dots
above or below the character‟s body. In addition to dots, Madda ~ and
Hamza shapes are considered as well. Hamza may appear above or
below the letter (e.g., “أ, إ”, ؤ”, “ئ) or enclosed within the letter main
body (ك) or isolated on the line (ء). The Madda may come instead of a
Hamza on the Alif letter (آ).
In this research, these words don‟t contain vowels and taTweel
“elongation” character (“_”), but can have affixes and they are
considered as a part of the word. Arabic affixes can be prefixes at the
75
beginning such as article  and connector , suffixes at the end such
as masculine plural , and infixes take their position between two
letters of the root to form the word radical such as  positioned in
between the root letters  to form a radical . Multiple affixes can
appear in a word as well as this combination is coherent.
Arabic writing uses many fonts and writing styles. Different Arabic
fonts pose a dramatic change in the character's drawing and features
with different font types. Arabic scripts are printed in fonts that show
characters in different drawing and features such as ligature and
overlapping, in addition to variations of character shapes which are font
and position dependent.
4.3.2 Samples Distribution:
The availability of dataset for training and testing is a fundamental
prerequisite for building a pattern recognition system. This data is not a
simple collection of samples without consideration to classes‟ distribution,
but it is necessary to consider the imbalanced class problem; classes with
uneven distributions. This issue occurs when at least one of the classes
having more sample than other classes in the training data [37] [39]
[53]. In this case, the classifier would then be expected to perform best
for classifying the frequently represented classes but very poor
performance on the fewer representative classes.
To verify the performance of a classifier the data needs to be divided
into disjoint sets; validation/training and testing sets. This demands
76
database maintains the balance of representations of different classes for
training and testing procedures.
4.3.3 Statistical analysis of the Arabic Qur’an text corpus:
To select word-samples that can be included in this database, an
authentic copy of the Arabic Qur‟an text has been taken from
(http://tanzil.info). The total number of words in the corpus text is
78,245 with 14,870 unique words. Table 4.1 shows frequency distribution
of the characters and PAWs. Unique words frequency distribution shown
in Table 4.2.
77
Table 4.1: Qur’an frequency distribution of the characters and PAWs
Characters
PAWs
Total
1
2
3
4
5
6
7
8
1
3
0
0
0
0
0
0
0
3
2
9514
2414
0
0
0
0
0
0
11928
3
6984
8571
915
0
0
0
0
0
16470
4
4830
9513
3094
257
0
0
0
0
17694
5
2241
4927
6182
993
61
0
0
0
14404
6
455
3064
4819
2057
349
11
0
0
10755
7
90
705
1498
1452
403
34
3
0
4185
8
16
324
864
679
428
51
0
1
2363
9
2
14
114
152
80
26
3
0
391
10
1
4
8
15
14
6
0
0
48
11
0
0
2
2
0
0
0
0
4
Total
24136
29536
17496
5607
1335
128
6
1
78245
78
Table 4.2: Qur’an unique words frequency distribution
Characters
Unique PAWs
Total
1
2
3
4
5
6
7
8
1
3
0
0
0
0
0
0
0
3
2
80
21
0
0
0
0
0
0
101
3
504
444
64
0
0
0
0
0
1012
4
1022
1478
616
47
0
0
0
0
3163
5
634
1719
1256
303
19
0
0
0
3931
6
268
1098
1472
597
108
7
0
0
3550
7
57
438
729
566
165
17
2
0
1974
8
13
150
278
254
133
31
0
1
860
9
2
14
66
91
48
17
3
0
241
10
1
3
6
12
7
3
0
0
32
11
0
0
1
2
0
0
0
0
3
Total
2584
5365
4488
1872
480
75
5
1
14870
79
4.3.4 Database implementation Process
The database used in this work is build up from words contained in
the Qur‟an text corpus. The Arabic words are printed in five fonts that are
commonly used by the OCR researchers. Each token (word) is printed in
the five Arabic fonts; Andalus, Simplified Arabic, Tahoma, Thuluth (Deco
Type) and Traditional Arabic. The font size is 14 point in plain style.
Table 4.3 lists samples of all used fonts.
Table 4.3: Samples of all used fonts typewritten in plain style 14 pt
Heuristics are applied to select number of words. The combination of
PAWs, ascenders, descenders and dots in addition to printing fonts
guaranties a wide variability of images in the database. This variability
also provides samples pose challenges to segmentation-based methods.
These challenges include ligatures and overlapping and Arabic writing
fonts. Another issue is the presence of dots that pose challenges to the
recognition system; since there are letters have a same body (primary
part) and distinguished only by the number and/or position of dots.
A comprehensive analysis and characterization of isolated handwritten
Arabic letters are presented in [54]. Results for on- and off-line
80
handwriting recognition of isolated Arabic letters are reported,
respectively, in references [54] and [55].
Word selection done according to selection criteria stated above
considering two main standards defining word length; characters and
PAWs count respects.
Word samples of the databases have been printed on a plain white A4
paper at specified resolution supported by the printer. Then, they were
scanned as binary images (Black and White), as required by the feature
extractor, with resolution set to 300 dpi. The image binarisation
performed via the scanner hardware threshold to save storage and time.
The scanned sheets are stored in a tiff file format. Sheet images pre-
processing and word segmentation are done using MATLAB procedures.
Table 4.4 illustrates samples of word-images printed in the five used
fonts.
81
Table 4.4: Word-image samples for the five used fonts
Andalus
Tahoma
Simplified
Arabic
Traditional
Arabic
Thuluth
82
4.3.4.1 Quranic-252UW database:
It is decided to select a set of word lengths (number of characters)
have the same distribution, this leads to not consider the 1-character and
11-character word lengths. It is noteworthy that the number of samples
per word character length is influenced bythe 28-standard Arabic letters.
Except 10-character length, where sample limitation exists, PAWs
samples per each character length are selected to be equally (or nearly
equally) sized. Table 4.5 shows distribution of the selected words
according to the characters and PAWs lengths.
Table 4.5: Quranic-252UW database frequency distribution of the
characters and PAWs
Characters
PAWs
Total
1
2
3
4
5
6
7
8
2
14
14
0
0
0
0
0
0
28
3
9
10
9
0
0
0
0
0
28
4
7
7
7
7
0
0
0
0
28
5
5
6
6
6
5
0
0
0
28
6
4
5
5
5
5
4
0
0
28
7
4
5
5
5
5
4
0
0
28
8
4
5
5
5
5
4
0
0
28
9
0
5
6
6
6
5
0
0
28
10
0
3
6
9
7
3
0
0
28
Total
47
60
49
43
33
20
0
0
252
83
The database is created using a lexicon includes 252 different words.
For balancing requirements, each word has 25 samples at each font type.
Therefore for each font type, there are total of 252 x 25 = 6,300 word
samples. The database has 6,300 x 5 = 31,500 samples collected from
real scanned documents.
Word samples of this database have been printed with resolution
FastRes 1200 using a laser printer HP LaserJet M5035 Series MFP. The
images are scanned on 300 dpi resolution using the same machine‟s
scanner.
4.3.4.2 Quranic-343UW database:
Considering the unavailability of benchmarking databases for printed
Arabic words that satisfy the requirements of this thesis; the Quranic-
343UW database was developed. This database has importance to
evaluate the developed system using unseen dataset which provide
evidence that the system developed is not predisposed to a particular
lexicon.
As can be seen in the Table 4.6, the number of words at each
combination of word lengths, characters and PAWs, are equal to 9 unless
there are not enough word samples in the corpus data. The words are
randomly selected in order to get reliable results.
84
Table 4.6: Quranic-343UW database frequency distribution of the
characters and PAWs
Characters
PAWs
Total
1
2
3
4
5
6
7
8
1
3
0
0
0
0
0
0
0
3
2
9
7
0
0
0
0
0
0
16
3
9
9
9
0
0
0
0
0
27
4
9
9
9
9
0
0
0
0
36
5
9
9
9
9
9
0
0
0
45
6
9
9
9
9
9
3
0
0
48
7
9
9
9
9
9
9
2
0
56
8
9
9
9
9
9
9
0
1
55
9
2
9
9
9
9
9
3
0
50
10
1
0
0
3
0
0
0
0
4
11
0
0
1
2
0
0
0
0
3
Total
69
70
64
59
45
30
5
1
343
The database contained 25,725 word instances for