.........................................................................................................................
आभ्यंतर (Aabhyantar)
SCONLI-12
विशेषांक ISSN : 2348-7771
.........................................................................................................................
23. VEA Model in Word Formation Process of Maithili
Saroj Kumar
Jha1, Hariom Ji Shivam2 : ILE, MGAHV,Wardha
Abstract:
Morphological analysis is the most remarkable stage
for the development of Maithili-English-Hindi MT system under NLP. This paper
is motivated to design a morph analyzer for Maithili language and as add-on of
Maithili MT system for appropriate analysis at the morphological level. The
research is contributing through derivational process of analyzing word
attached with affixes. This paper has reviewed most of the methods of MA at
different level. Among all the development of MA, suffix striping and FSA are
commonly pacticed. Some of them are using lemma based approach for analyzing
morph (Nikhil et.al. 2012). There are several linguists and computer scientists
have discussed the statistical approach and Hybrid approach with FSA,
probability based model (Rinju et. al. 2013) and several other approaches of
machine learning to develop Maithili Morphological Analyzer (MMA). But rule
based approach is friendly enough to use with all the machine learning models.
VEA is one of the emerging modal in word formation process which is somehow
adequate with Maithili language. Overall linguistics approach is core of MA in
Maithili for developing MT system. This research is introducing linguistically
a friendly and bit new with machine learning and proficient model for analyzing
words and generating multiple words on the basis. The discussion also covered
the concatenation with root word to suffix and prefix. Maithili MA is
demonstrating a small concept with rule based model and we are designing it
with hybrid modal including corpus based approach. This design is incorporating
the lexicon tables, suffix list, prefix list and the Vowel Ending Approach
(VEA) to justify that how does concatenation take place. In the above table POS
category of Noun, Adjective and Verb are shifting to another category of POS
after concatenation of suffixes. And it’s also focused that how the words end
with their vowels and how does suffix connects on the basis of its vowel ending
mechanism.
Keywords: NLP, Morphology,
PoS, MA, MT.
State of Arts:
[1] The research paper on “Unsupervised Improvement
of Morphological Analyzer for Inflectionally Rich Languages” written by
(Bharti, Akshar. et. al. 2001). This paper presented an algorithm for
unsupervised learning of morphological analysis and generation of
inflectionally rich language like Hindi, given a low coverage morph and a
corpus of raw text. The result of the algorithm are encouraging with the
coverage of primitive morph going up from 32% to about 63% and that of an
advanced morph going up from 96% to about 97%.
[2]The research paper on “Hindi Derivational
Morphological Analyzer” is written by (Kanuparthi, Nikhil et.al. 2012). In
this paper the authors present their Hindi derivational morphological analyzer.
Their algorithm upgrades an existing inflectional analyzer to a derivational
analyzer & primarily achieves two goals. First it successfully incorporates
derivational analysis in the inflectional analyzer. Second, it also increases
the coverage of the inflectional analysis of the existing inflectional
analyzer. The authors pursued the five steps approach for building their derivational
analyzer – studying Hindi derivations, derivational rules, finding majority of
properties, using Wikipedia data for confirming genuineness, Develop an
algorithm for derivational analysis. The algorithm uses the principle of
Porter’s stemmer & Krovetz stemmer.
[3] The paper on “Morphological Analyzer for
Malayalam: Probabilistic Vs Rule Based Method” is written by (Rinju, O.R.
et. al. 2013). This paper presents a Morphological Analyzer for
Malayalam, by considering the noun and verb categories of a word. The proposed
morph analyzer returns the morpheme along with the grammatical information such
as Gender, Number and case information of noun and tense aspect for verb. A
probabilistic and rule based method is used for analysis using inflection and
suffix list, which is created using look up tables. The result shows that the
rule based method is more accurate than the others.
[4] The research paper on “Hindi Morphology Analyzer
& Generator” written by (Vishal et. al. 2008). This paper presents the
morphological analysis and generator tools for Hindi language using paradigm
approach for windows platform having GUI. Hindi is very rich in inflectional
morphology can be witnessed from the fact that is English usually there are
maximum 7-8 inflected word forms of noun but in Hindi it can be up to 40 and
even more than that. This morphological analyzer gives preference to the time
taken to search for a word in the database to know its grammatical information
& also accuracy of returned results. In the database used by this tool all
the possible word forms of all root words are stored. Though it takes a bit
more space but the search time is very less.
[5] The paper on “Developing
Morphological Analyzer for South Asian Languages: Experimenting with Hindi and Gujarati
Languages” is written by (Ashwani, Niraj. et. al. 2010). This paper is described on morphological analyzer for the Hindi and
Gujarati language. In order to demonstrate our approach’s portability to other
similar languages, we present our experiments for Gujarati language. The paper
presents a rule-based morphological analyzer where the rules are acquired
semi-automatically from corpora. The experiment proposes an approach that takes
both prefixes as well suffixes into account. Given an inflected Hindi word, our
system returns its root form. It uses a dictionary, and a monolingual corpus to
obtain suffix-replacement rules. The improvement is partially depending on the
GRFL list which causes variation in result.
Introduction:
Morphological analysis is a significant process
covered under Machine Translation development for Text-to-Text (T2T) and
Speech-to-Speech (S2S) model. The phase of Morph Analysis is a complex task due
to autographic variation of text and speech areal variation where its challenge
to decide the word within semantic context. Therefore MA is called a noteworthy
process of MT under Computational Linguistics. This experiment is categorized
in the two main category, the one is Fixed (रूढ़) and the other is Unfixed (यौगिक). We can simply take it as the one
can be breakable with its individual meaning and the other without meaning. This paper has preferably focused on the POS category of Noun, Adjective
and Verb. And how all these POS category contain some suffix with them. So what
are the possible conditions, where the suffixes could be attached with words
and the important description of Sandhi rules, is the point of
discussion. In first step, we have taken three category of POS like Noun,
Adjective and Adverb and picked up some most used suffix with root
having (“अ” “आ” “इ”
“ई”“उ” “ऊ” “ए” “ऐ” “ओ” “औ”)vowel endings. Second step, we confirm the ending status and
assured the most used suffix with it having same vowel initials or final. Third
step we analyze the function of concatenation are according to Sandhi
rules are followed or violated. If, Sandhi rules are followed then with
or without modification and the last step, if violated then find the cause of
violation.Here we have picked up some prominent example to
justify the conditions.
Corpus Design:
For meeting research purpose we have made the corpus
in following order which could be explained in detail. First of all the root
list of the noun, adjective and verb have been prepared. Then next we have made
the suffix and prefix list with all their features and their switching POS and
Sex category from one to another like NN-NN, NN-ADJ, NN-VB etc.
Analysis:
This
analysis is an effort of the discussion on assembling a Morph Analyzer for
Maithili. While analyzing
the morphological aspects in Maithili Language, several unrevealed concepts
have been disclosed to us. The discussion is also leading us towards the
concerned aspects encountered, which incorporate significant role in analysis
and pattern which would lead us to appropriate result. The paper have classified in various segments to
understand the most relevant for analysis to justify illustrations. At the
first, we have filtered the root list of noun, adjective and verb. Second we
have prepared the list of affixes (prefix and suffix). Further we have checked
the possible pattern available in Maithili corpus, fetched from the web. Moving
to further we have filtered the all the possible words having prefix or suffix.
Moving to next level we have assign the possible features of suffix and prefix
that, does the word class shift its POS categories with majority or
exceptionally? Initially we have categorised mainly three POS category for
detail description like Noun, Adjective and Verb. The few illustrations are
discussed over here to look for.
S. No.
|
Root
|
VE
|
Suffix/Prefix
|
POS Shift
|
Maithili Word
|
Hindi
|
Remark
|
|||||||
Noun
|
|
|||||||||||||
1.
|
गुण
|
--
|
अब
|
NN-NN
|
अबगुण
|
अवगुण
|
--
|
|||||||
2.
|
मान
|
--
|
अभि
|
NN-NN
|
अभिमान/गौरब
|
अभिमान
|
--
|
|||||||
3.
|
मुट्ठि
|
“इ”
|
आ
|
NN-NN
|
मुठिया
|
मुठिया
|
हाथ से पकड़ने वाला
|
|||||||
4.
|
नून
|
“अ”
|
गर
|
NN-JJ
|
नुनगर
|
अत्यधिक नमकीन
|
अत्यधिक नमक
|
|||||||
5.
|
छौंकी
|
“ई”
|
आएब
|
NN-VB
|
छौंकियाएब
|
लठियाना
|
बांस के पतले सीक से
|
|||||||
Table 1.1
Example: 1
Maithili–
ओकरा अपना पर अभिमान छैक।
Hindi –उसे खुद पर अभिमान है।
English – He has proud on himself.
In the above example the word “अभिमान”contains the noun POS category
attached with “अभि” prefix to “मान” root incorporate the meaning of “respect” but in Sandhi process
the word have split in the “अभि+मान” format. Here the attachment
takes place on the basis of “इ+आ” vowel concatenation between
prefix and root. The concatenations are based on the Sandhi rules.
|
||||||||||||||
Adjective
|
|
|||||||||||||
6.
|
पाकल
|
“अ”
|
“आहा”
|
JJ-JJ
|
पकलाहा
|
पका हुआ
|
कोई फल या सब्जी
|
|||||||
7.
|
सरल
|
“अ”
|
“आहा”
|
JJ-JJ
|
सरलाहा
|
सड़ा हुआ
|
कोई फल या सब्जी
|
|||||||
8.
|
मरन
|
--
|
“आ”
|
NN-JJ
|
आमरन
|
आमरण
|
मरने तक
|
|||||||
9.
|
नेन
|
“अ”
|
“बा”
|
JJ-JJ
|
नेनबा
|
--
|
छोटा सा
|
|||||||
10.
|
पातर
|
“अ”
|
“ई”
|
JJ-JJ
|
पतरकी
|
पतली
|
पतला सा
|
|||||||
Table 1.2
Example: 2
Maithili–
ई सरलाहा भट्टा किया अनलहूँ?
Hindi –ये सड़ा बैगन क्यों लाये है।
English – Why did you bring this rotten
brinjal?
In the above example the word are shifting from
adjective to adjective and adjective to other. The word “सरलाहा” has been broke up in “सरल+आहा” which follows the Sandhi rules of “अ+आ” ending where root ending contains “अ” and suffix initial containing “आ” vowel ending.
|
||||||||||||||
Verb
|
||||||||||||||
11.
|
बान्ह
|
“अ”
|
“अल”
|
VB-NN
|
बानहल
|
बंधा हुआ
|
रस्सी या किसी तार से
|
|||||||
12.
|
दौग
|
“अ”
|
“ब”
|
VB-NN
|
दौगब
|
दौड़ना
|
दौड़ना
|
|||||||
13.
|
झांप
|
“अ”
|
“ई”
|
VB-NN
|
झंपनी
|
ढकनी
|
कोई वस्तु ढकने हेतु
|
|||||||
14.
|
लेब
|
“अ”
|
“हार”
|
VB-NN
|
लेनिहार
|
लेने वाला
|
लेने वाला व्यक्ति
|
|||||||
15.
|
परह
|
“अ”
|
“औनी”
|
VB-NN
|
परहौनी
|
पढौनी
|
पढ़ाया गया
|
|||||||
Table 1.3
Example: 3
Maithili–
ओकरा दौगब निक लगे छैक।
Hindi –उसे दौड़ना
अच्छा लगता है।
English – He loves to run.
In the above examples have POS category VB as Main Verb
which shift to another POS like NN. The one of the above example “दौगब” can be split like “दौग+ब”. In this the vowel ending
process would contain “अ” ending of “अ” initial of suffix resulted the obvious sequence. At some of the place we
found such words which snag us to take appropriate decision but we are in
process to overcome such issues. The above analyses have been divided in two prominent categories, the one
fixed and the other is unfixed. The above analysis is based on unfixed part and
fixed are counted as an exceptional till date. The exceptions are kept apart
due to meaningless breakup entities and the work is under progress.
Application Flow:
Input
|
Normalize
|
Split to Word
|
Root Finding
|
Affix Features
|
Vowel End
|
Output
|
Statistical Graph:
Graph 1.1
Graph 1.2
Result & Discussion:
The above analysis has stated the prominent aspects of
Maithili morph analyzer currently under development for Maithili-Hindi-English
Machine Translation System. The above discussed aspects are prominent enough to
keep under consideration in support of VEA concatenation process. The graphical
representation has indicated the maximum availability of suffixes concatenation
among all the four domains. Therefore the research is closure to achieve
satisfactory output from the initially covered of 400k words corpus and the
process will continue in this direction to enhance the present current accuracy
71-74%, and the target to enhance it up to 95%.
References:
·
Bharati Akshar, Chaitanya Vineet, Sangal Rajeev.
Natural Language Processing: A Paninian Perspective. Prentice-Hall of India.
(1995).
·
KanuparthiNikhil, Inumella Abhilash, Misra Sharma Dipti. Hindi Derivational
Morphological Analyzer (2012).
·
R Rinju, R R Rajeev, “Morphological Analyzer for
Malayalam: Probabilistic Method Vs Rule Based Method”, IJCNLP (2013).
·
Vishal Goyal, Gurpreet Singh Lehal, “Hindi Morphological
Analyzer and Generator”, First International Conference on Emerging Trends in
Engineering and Technology, USA, pp.1156– 1159, 2008.
·
Niraj Aswani, Robert Gaizauskas, “Developing
Morphological Analyzers for South Asian Languages: Experimenting with the Hindi
and Gujrati Languages”, Proceedings of the Seventh International Conference on
Language Resources and Evaluation, Valleta, Malta pp.811-815, May, 2010.
·
Rastogi Mayuri, Khanna Pooja. Development of Morphological
Analyzer for Hindi. IJCA, 2014.
·
Malladi Deepak Kumar, Mannem Prashanth,. Statistical morphological analyzer
for hindi. In Proceedings of 6th International Joint Conference on Natural
Language Processing, 2013.
·
Muley Aditi, et al., “Morphological Analysis for a given
text In Marathi language”, International Journal of Computer Science &
Communication Network, Vol-4 (1), 13-17, 2014.
·
Deepak Kumar Malladi and Prashanth Mannem. Context based
statistical morphological analyzer and its effect on hindi dependency parsing.
In Fourth Workshop on Statistical Parsing of Morphologically Rich Languages,
volume 12, page 119, 2013.
·
Agarwal Ankita, Pramila, Singh Shashi Pal, Kumar Ajai,
Darbari Hemant,. Morphological Analyser for Hindi – A Rule Based Implementation.
In proceeding of International Journal of Advanced Computer Research (ISSN
(print): 2249-7277 ISSN (online): 2277-7970) Volume-4 Number-1 Issue-14
March-2014.
No comments:
Post a Comment