Putting together a quick report of how I spent my last 3 months on improving varnam, an awesome transliteration project. My task was to implement a stemmer to improve the learning in varnam.
A stemmer is an algorithm that, upon giving a word as the input, gives the base word as the output.
For example, giving മരത്തിലൂടെ as the input would give you മരത്തിൽ and മരം as outputs. മരം is the final output of the stemmer and മരത്തിൽ is an intermmediate output of the stemmer. The algorithm is described here. The stemmer is similar to SILPA stemmer created by Santhosh Thottingal except that my version makes use of an exceptions table and produces meaningful intermmediate words.
A screencast that explains my work is posted above. Make sure you watch it in 720p to clearly see the words being typed.
As far as statistics go, see this thread to know how much the learning has improved. This is not the final result, as the number of words learned is of no consequence if the stemmer does not improve transliteration accuracy. Transliteration accuracy tests before and after the tests are yet to be done thoroughly. Judging by the number of new words in the word corpus alone, varnam saw an improvement of 63% in learning when tested with 408 words.See the above thread for the exact results and the word corpus used.
This week I’ve been busy rewriting the stemmer and debugging some memory heap corruption. My first implmentation of the stemmer used to crash ibus whenever certain words, like “ദൂരെയാണ്” and “വിദൂരമായ” were typed. I could not locate the problem, and the only error message I got was “free() – invalid next size” when ibus crashed. Some searching revealed that it might be due to a memory heap corruption. I used valgrind memcheck to debug the memory corruption. It was difficult to make sense of valgrind’s output, and that eventually lead me to ask a question at stackoverflow. However, before all this, I was convinced that I made some serious mistake somewhere along the development path and decided to sit down and rewrite the whole project. I thought that I made a mistake by not testing with ibus early on. I discovered what I was doing wrong to merit the memory corruption soon after (even before the guy came in and gave his answer at stackoverflow.com). However, I realised that a rewrite would do the project much good. To start with, I could then run valgrind as I went with the rewrite to make sure that I plugged all the possible memory leaks. Also, I was able to look into some unnecesary function calls among other things. In short, I cleaned the code and is ready for a code review.
Here’s a changelog:
1. Tried implementing the “improvement scheme”, as I had suggested in this thread. The results were far worse than expected. 60% of the words after suffix appending were not meaningful. Any further attempts along this path would require much more careful planning and reasearch of the malayalam language.
2. Located and avoided [did not stonewall it] an annoying memory corruption. Filed it under issue 51.
3. Removed the level hierarchy. All stemrules are now grouped into one. Splitting the stemrules into 3 levels serve no real purpose, and complicates stemming by needing to check each level seperately. Also, removal of the level system has improved the code readability a lot.
4. Replaced some function calls with inline expansions. Made all the functions more defensive and freed memory wherever valgrind reported memory leaks.
5. Libvarnam ibus requires a clean build every time libvarnam.so changes. It seems that libvarnam-ibus has its own version of libvarnam or something. Should look into this. Ibus not reflecting the changes I made to libvarnam was a real headache – no amount of debugging could solve the issue. Tried recompiling libvarnam-ibus and things started to work.
6. Eliminated recursive calls to varnam_learn(). In the first implementation, varnam_learn() would call varnam_stem() which calls varnam_learn_internal(). This was bad design. Now varnam_stem() returns a varray to varnam_learn(), and varnam_learn() iterates over this varray to learn all the stemmed words.
These changes are not final. Some of it, like doing away with the level system, was done without consulting my mentor and would be reintroduced if he thinks that removing it was a bad decision. You can see all my changes here and make suggestions.
To do :
1. More tests
2. Make sure stemmer works well with other languages
3. Enable varnam to stem from the command line interface
Progress has been slow the past week, thanks to some non-academic preoccupations and a trip home. However, had I been a bit more organized, I would have been more successful at the rather mundane task of testing out the stemming accuracy.
There are some design changes. Some stem rules did not gave the desired results in all cases. That is,there were exceptions. One particular stem rule that was giving me considerable headache was “ന്” => “ൻ”. For example, ആദിത്യന് should stem to ആദിത്യൻ. But while this worked wonderfully, പിറ്റേന്ന് would be incorrectly stemmed to പിറ്റേന്ൻ. This is because ന്ന is actually a combination of ന് and ന. So ന്ന് is actually ന്+ന് and my algorithm stems the first ന് to ൻ (see previous post).
This problem can be solved by using a look ahead. A look ahead in its proper and fully scalable (that is, an algorithm that can look ahead any number of characters) can turn out to be too much so I decided to test the idea with a single look ahead. Along with stem rules, I added another table to the database “stem_exceptions” that contain exceptions for each stem rule. For example, the exception rule for ന് is “ന്” => “ന്”. This tells varnam to NOT stem ന് to ൻ if the syllable preceding ന് is another ന്. This will ensure that varnam will ന് to ൻ in all cases except when it occurs as a part of ന്ന്.
Lucky for me, the exceptions table proved useful with many other stem rules. A look ahead of a single syllable seems to satisfy varnam’s need at least with malayalam. I had to implement some helper functions that returns the last syllable of a word (eg: in ആദിത്യന്, the last syllable would be “ന്”, and the last unicode character would be “്”) and another that can count the number of syllable in a word. The count of the syllables is useful to skip stemming of very short words. For example, varnam do not apply a stem rule if syllables_in_original_word – syllables_in_suffix is less than 2. The number 2 is arbitrary, but solves some common problems such as മകൾ. As a happy consequence, now varnam will not stem മകൾ at all but will stem പേനകൾ to പേന. Though this is not a permanent nor a complete solution, it is enough to prevent some common stemming mistakes.
I’ve been able to test the accuracy of the algorithm on some malayalam wikipedia articles. I made 3 sets of about 1000 words each. Contents of each set belonged to a particular category. My rather small test data is hosted at this repository. Here are the results for each set:
Give or take 2% from each set, though I’ve been quite liberal in flagging results as errors. The fact to be noted is that if a word that should not be stemmed is not stemmed, it counts as a correct result. I do not know if this is how other stemming algorithms are tested. If 3000 definitely stemmable words were given as input, there is a considerable chance that the accuracy would be lower.
I would have loved to test the data on some more recent corpus such as mathrubhumi newspaper archives. But there was some issue with the font, especially the chillus, that represented the malayalam letters quite differently on the konsole than how they were rendered on the browser. For example, words ending with ൽ in the browser was seen to be ending with ല് when I copied them to the konsole. Hence, the stem rules did not match with many suffixes and produced a lot of incorrect stemming or no stemming at all.
One thing I’m happy to observe is that given a word, the stemmer is producing multiple words that varnam can learn in diffent stages. For example, കാലങ്ങളുടെ would first stem to കാലങ്ങൾ (which varnam learns) and then to കാലം (learns again). If all goes well, I will be able to test and tweak the algorithm extensively this week and hopefully start estimating how much the suggestions are improved. Then, I hope, will be time for some code reviewing with my mentor.
Very productive ten days. Libvarnam is finally stemming the words. I might not be so wrong in stating that the project is almost half complete. I’ve come up with a multi-pass stemming algorithm (although no flow-chart drawing was required – maybe I’ll draw one for clarity later) that has the *potential* to stem with a reasonable accuracy. Since the algorithm is intended to serve as a platform for many Indian languages, proper documentation is quite important. As a first step, I’ve made a separate github repository putting together the thought process that went into designing the algorithm. The first draft of the algorithm is in the file “03algorithm” here. Please note that the files 01classification and 02implementation are not updated – those are just things I jotted down.
A rather quick explanation :
The varnam stemmer removes suffixes from malayalam words to obtain the base word. For example,
വേദനാജനകമായ : വേദനാജനകം
The algorithm does this by using a set of rules, called stem rules. The stem rule that was used in the above example is :
“മായ” => “ം”
These rules can be classified into 3 : level 1, level2, and level3. Level1 contains the shortest rules, and the most basic ones. Level2 contains the most common rules and are often 2 syllables or more long. Level3 contains the longest suffixes, like “യിരിക്കുന്നു” => “”. (More on levels at “01classification” here). For now, the classification into levels is rather a convenience than a necessity. I decided that one long list of stem rules is ugly and dividing them into 3 would be nicer. So there you go.
These stem rules reside in a database.
1. Compile the scheme file and insert values into stemrules table
2. buffer = empty_string_bufer
3. Do not stem if size of word is less than 10 bytes. (Min_stem_size)
4. While (termination_condition() is not met)
4.1 Get last letter of the word and insert it at the beginning of the buffer
4.2 if buffer is in level1
4.2.1 apply stemrule from level1, word is modified
18.104.22.168 learn word
4.2.3 clear buffer
4.3 else if buffer is in level2
4.3.1 apply stemrule from level2, word is modified
22.214.171.124 learn word
4.3.3 clear buffer
4.4 else if buffer is in level3
4.4.1 apply stemrule from level3, word is modified
126.96.36.199 learn word
4.4.3 clear buffer
5. Learn the stemmed word
1. Return true if :
a) The word ends with ം.
b) If the word ends with a consonant and there is no added swara eg : പരീക്ഷ (pareeksha)
c) If the buffer contains the rest of the word (or whole of it).
There is something wrong with code indentation in wordpress. Click on the screen shot to see the neatly indented version on sublime text editor.
For example, consider the stemming of the word എന്നിവിടങ്ങളിൽ. Initially, the buffer is empty.
1. Shift word ending to buffer. Buffer now contains ൽ
2. Buffer contents (ൽ) does not correspond to a stem rule in any level.
3. Shift word ending to buffer, buffer now contains ളിൽ
4. There exists a stem rule in level 2 “ളിൽ” => “ൾ”. Apply this stem rule to the word. Word now becomes എന്നിവിടങ്ങൾ
5. Clear the buffer
6. എന്നിവിടങ്ങൾ is independent. That is, it is a meaningful word. Hence learn it. This step (learning) is not necessary in stemming, but is crucial to improve varnam’s predictions.
7. Shift word ending to buffer. Buffer now contains ൾ. Not part of a stem rule.
8. Shift the next ending to buffer. Buffer now contains ങ്ങൾ. There is a stem rule “ങ്ങൾ” => “ം” in level 2. Apply stem rule, and the word becomes എന്നിവിടം. (Varnam learns this word too)
9. The algorithm continues by shifting the endings of എന്നിവിടം. Since the contents of the buffer will not correspond to a stem rule at any point of time, the algorithm eventually terminates.
The termination condition needs some refinement. Condition a) and b) is not being used right now. Stemming terminates when there is no more element left in the word to shift to the buffer. This seems to work fine right now, and if it continues to work, I will drop conditions a) and b) altogether.
The accuracy of the stemmer is ultimately determined by how good and accurate the design of stem rules are. This requires a lot of trial and error, and some of the stem rules are in the mlstemmer repository. By careful choice of the stem rules, an accuracy of more than 80% is expected.
I’ve implemented a stemmer.c program under the examples directory that can read words separated by blank spaces from a text file and stem them. This is the sample input :
വിവിധതരം വധശിക്ഷകളിൽ ഒന്നാണ് കുരിശിലേറ്റിയുള്ള വധശിക്ഷ. ഈ ശിക്ഷാരീതിയിൽ പ്രതിയെ ഒരു മരക്കുരിശിൽ ആണിയടിച്ച് തളയ്ക്കുകയാണ് ചെയ്യുക വേദനാജനകമായ വധശിക്ഷ നടപ്പാക്കണം എന്ന ഉദ്ദേശത്തോടുകൂടി രൂപപ്പെടുത്തിയ പുരാതനമായ ഒരു ശിക്ഷാരീതിയാണിത് സെല്യൂസിഡ് സാമ്രാജ്യം കാർത്തേജ് റോമാ സാമ്രാജ്യം എന്നിവിടങ്ങളിൽ ക്രിസ്തുവിന് മുൻപ് നാലാം ശതകം മുതൽ ക്രിസ്തുവിനു ശേഷം നാലാം ശതകം വരെ കുരിശിലേറ്റൽ താരതമ്യേന കൂടിയ തോതിൽ നടപ്പാക്കപ്പെട്ടിരുന്നു യേശുക്രിസ്തുവിനെ കുരിശിലേറ്റി വധിച്ചുവെന്നാണ് ക്രൈസ്തവ വിശ്വാസം. ക്രിസ്തുവിനോടുള്ള ബഹുമാനത്താൽ കോൺസ്റ്റന്റൈൻ ചക്രവർത്തി എ.ഡി. 337-ൽ ഈ ശിക്ഷാരീതി നിർത്തലാക്കുകയുണ്ടായി ജപ്പാനിലും ഒരു ശിക്ഷാരീതിയായി ഇത് ഉപയോഗത്തിലുണ്ടായിരുന്നു മരണശേഷം മൃതശരീരങ്ങൾ മറ്റുള്ളവർക്കുള്ള ഒരു താക്കീത് എന്ന നിലയ്ക്ക് പ്രദർശിപ്പിക്കപ്പെട്ടിരുന്നു കാഴ്ചക്കാരെ ഹീനമായ കുറ്റങ്ങൾ ചെയ്യുന്നതിൽ നിന്നും തടയുക എന്ന ഉദ്ദേശത്തോടെയാണ് കുരിശിലേറ്റൽ സാധാരണഗതിയിൽ നടത്തിയിരുന്നത്
I’ve removed almost all the punctuation so that they won’t interfere with the stemming. I’ve taken 2 screen shots showing the results. The results are far from perfect, and that is certainly because I haven’t added that many stem rules to the database. Things should improve significantly in the next few days.
I’ve referred two papers for designing this stemmer. The first one, LALITHA uses a longest suffix stripping method and was of little use for varnam. The second one, STHREE, uses a similar algorithm to mine but confines the number of iterations to 3. However, both the papers did not contain any links to stem rules or programs that could be reused. Hence I’d be relying on the SILPA stemmer, the first stemmer in Malayalam, for the invaluable stem rules.
But I would be looking for a more exhaustive set of rules (hopefully) and will have to do quite some Malayalam reading. Apart from the Mathrubhoomi newspaper which will definitely be soaked in curry and tea by the time I could carry it away from the mess hall, Malayalam reading materials are actually hard to come by. But wait, I saw a few SFI magazines on the other guy’s room. Gathi kettal puli pullum thinnum! :p
Another week, and I’m finally working on what I signed up to do – implement a malayalam stemmer. The algorithm itself is still a haze, and I will be sitting down and drawing flowcharts soon. Despite being harassed by university practical exams, I managed to squeeze in enough time to lay down a basic framework. Varnam now has the *potential* to stem.
Something wonderful happened during my last conversation with my mentor. The scheme file, which looked like an ordinary text file full of rules to convert manglish (a blend of Malayalam and English) into malayalam, turned out to be a ruby file. I’m telling you, this is a ruby program! Its actually called the scheme file. The titles “consonants” and “vowels” and the like are actually function calls. Ruby does not need paranthesis to call a function. Beautiful.
Yes, I learned a bit of ruby to add the functionality I needed. I added a few stem rules to the scheme file which gets added to an sqlite3 table when I compile the scheme file. I learned how to call c functions from ruby using FFI and also added a “–stem” option to the list of arguments accepted by varnamc.
varnamc --symbol ml --stem പരീക്ഷയാ
gives the following output:
Doesn’t make much sense, I know. But under the hoods, varnam checked if there is a stem rule for the ending “ാ” in the database and seeing that there is, substituted the ending of the supplied word with the ending specified in the stem rule (“്”). The above stemrule doesn’t serve any purpose, and will be conveniently removed after I draft the algorithm.
Now I have to write tests for all the functions I wrote. I wonder how much of the codebase I broke already.
I’m excited to announce that I’ve been selected to this year’s google summer of code. My mentoring organization is SMC – Swathantra Malayalam Computing and I will be working on the varnam project.
Varnam means ‘colors’. Varnam is a transliterator for indic languages. My task is to improve the learning capability of varnam by coming up with a stemmer algorithm for indic languages. A stemmer algorithm returns a base word when it is supplied a complex word. In english, supplying ‘retirement’ to the porter stemmer algorithm will trim it down to ‘retire’ and subsequently return ‘retir’. I have to do the same thing with malayalam words. The trick is to design the whole thing in such a way that stemming support for other languages can be easily added. The stemming rules will differ from language to language. Though I will be laying down the rules for malayalam, I should provide room for someone else if she decides to add support for another language. In short, my algorithm should be designed to read a ‘rule file’.
The varnam project can be found here. Why use varnam when you have, say, google input tools? For one, google input tools work only in windows. Two, I’m not sure if you can use it in your own programs. I guess not. Three, it is not open source which means google won’t let you take a peek inside. Four, varnam can render the whole linux shell in malayalam if need be (and if you are willing to put in the effort)! To be frank, seeing small round malayalam alphabets on my desktop konsole was quite unexpected!
I’m so grateful to SMC for letting me work on this and even more grateful to google for the upcoming paycheck ;). SMC requires us to keep the blog updated on a weekly basis, so I guess everyone will be hearing an awful lot from me 😀