GSoC : Memory heap corruption and code rewrite

This week I’ve been busy rewriting the stemmer and debugging some memory heap corruption. My first implmentation of the stemmer used to crash ibus whenever certain words, like “ദൂരെയാണ്” and “വിദൂരമായ” were typed. I could not locate the problem, and the only error message I got was “free() – invalid next size” when ibus crashed. Some searching revealed that it might be due to a memory heap corruption. I used valgrind memcheck to debug the memory corruption. It was difficult to make sense of valgrind’s output, and that eventually lead me to ask a question at stackoverflow. However, before all this, I was convinced that I made some serious mistake somewhere along the development path and decided to sit down and rewrite the whole project. I thought that I made a mistake by not testing with ibus early on. I discovered what I was doing wrong to merit the memory corruption soon after (even before the guy came in and gave his answer at stackoverflow.com). However, I realised that a rewrite would do the project much good. To start with, I could then run valgrind as I went with the rewrite to make sure that I plugged all the possible memory leaks. Also, I was able to look into some unnecesary function calls among other things. In short, I cleaned the code and is ready for a code review.

Here’s a changelog:

1. Tried implementing the “improvement scheme”, as I had suggested in this thread. The results were far worse than expected. 60% of the words after suffix appending were not meaningful. Any further attempts along this path would require much more careful planning and reasearch of the malayalam language.

2. Located and avoided [did not stonewall it] an annoying memory corruption. Filed it under issue 51.

3. Removed the level hierarchy. All stemrules are now grouped into one. Splitting the stemrules into 3 levels serve no real purpose, and complicates stemming by needing to check each level seperately. Also, removal of the level system has improved the code readability a lot.

4. Replaced some function calls with inline expansions. Made all the functions more defensive and freed memory wherever valgrind reported memory leaks.

5. Libvarnam ibus requires a clean build every time libvarnam.so changes. It seems that libvarnam-ibus has its own version of libvarnam or something. Should look into this. Ibus not reflecting the changes I made to libvarnam was a real headache – no amount of debugging could solve the issue. Tried recompiling libvarnam-ibus and things started to work.

6. Eliminated recursive calls to varnam_learn(). In the first implementation, varnam_learn() would call varnam_stem() which calls varnam_learn_internal(). This was bad design. Now varnam_stem() returns a varray to varnam_learn(), and varnam_learn() iterates over this varray to learn all the stemmed words.

These changes are not final. Some of it, like doing away with the level system, was done without consulting my mentor and would be reintroduced if he thinks that removing it was a bad decision. You can see all my changes here and make suggestions.

To do :

1. More tests
2. Make sure stemmer works well with other languages
3. Enable varnam to stem from the command line interface

GSoC : Code review 1, almost.

Before more thorough testing of the stemming algorithm and its effect on varnam’s learning, my mentor and I decided that it would be a good idea to do some code review. So this week I fixed some problems with the stemming, tested how the stemming works with ibus input method, checked if learning is improving at all, and wrote some unit tests.

Stemming with IBus works, though with some bugs. Let us consider a case that works. The learnings database is now empty and we are starting with the blank state. Varnam does not know anything other than the symbols specified in the scheme file.
The below video demonstrates varnam learning a word with Ibus as the input method. The next time the user starts to type the same word, you can see that its stemmed forms are available in the suggestions.


Right now the only cause of concern with the suggestions is that incomplete words are suggested first, and the user has to go through the suggestions list to find the intended word. Also each time varnam learns a stemmed word, all its prefixes are learned as well. This will eventually lead to the incomplete prefixes coming up first on the suggestions list and the user will have to look through the list to find the word she is looking for.

There are some bugs, like some words dissappearing when I choose them from suggestions. The varnam_stem() function is possibly modifying some things that it isn’t supposed to. I’m also getting errors when I’m using free() – invalid next size(fast). Maybe the upcoming code review will expose my mistakes.

GSoC : Exceptions table and some testing

Progress has been slow the past week, thanks to some non-academic preoccupations and a trip home. However, had I been a bit more organized, I would have been more successful at the rather mundane task of testing out the stemming accuracy.
There are some design changes. Some stem rules did not gave the desired results in all cases. That is,there were exceptions. One particular stem rule that was giving me considerable headache was “ന്” => “ൻ”. For example, ആദിത്യന് should stem to ആദിത്യൻ. But while this worked wonderfully, പിറ്റേന്ന് would be incorrectly stemmed to പിറ്റേന്ൻ. This is because ന്ന is actually a combination of ന് and ന. So ന്ന് is actually ന്+ന് and my algorithm stems the first ന് to ൻ (see previous post).

This problem can be solved by using a look ahead. A look ahead in its proper and fully scalable (that is, an algorithm that can look ahead any number of characters) can turn out to be too much so I decided to test the idea with a single look ahead. Along with stem rules, I added another table to the database “stem_exceptions” that contain exceptions for each stem rule. For example, the exception rule for ന് is “ന്” => “ന്”. This tells varnam to NOT stem ന് to ൻ if the syllable preceding ന് is another ന്. This will ensure that varnam will ന് to ൻ in all cases except when it occurs as a part of ന്ന്.

Lucky for me, the exceptions table proved useful with many other stem rules. A look ahead of a single syllable seems to satisfy varnam’s need at least with malayalam. I had to implement some helper functions that returns the last syllable of a word (eg: in ആദിത്യന്, the last syllable would be “ന്”, and the last unicode character would be “്”) and another that can count the number of syllable in a word. The count of the syllables is useful to skip stemming of very short words. For example, varnam do not apply a stem rule if syllables_in_original_word – syllables_in_suffix is less than 2. The number 2 is arbitrary, but solves some common problems such as മകൾ. As a happy consequence, now varnam will not stem മകൾ at all but will stem പേനകൾ to പേന. Though this is not a permanent nor a complete solution, it is enough to prevent some common stemming mistakes.

I’ve been able to test the accuracy of the algorithm on some malayalam wikipedia articles. I made 3 sets of about 1000 words each. Contents of each set belonged to a particular category. My rather small test data is hosted at this repository. Here are the results for each set:

history_wikipedia – 94%
Technical_wikipedia – 89.7%
Art_wikipedia – 92.6%

Give or take 2% from each set, though I’ve been quite liberal in flagging results as errors. The fact to be noted is that if a word that should not be stemmed is not stemmed, it counts as a correct result. I do not know if this is how other stemming algorithms are tested. If 3000 definitely stemmable words were given as input, there is a considerable chance that the accuracy would be lower.

I would have loved to test the data on some more recent corpus such as mathrubhumi newspaper archives. But there was some issue with the font, especially the chillus, that represented the malayalam letters quite differently on the konsole than how they were rendered on the browser. For example, words ending with ൽ in the browser was seen to be ending with ല് when I copied them to the konsole. Hence, the stem rules did not match with many suffixes and produced a lot of incorrect stemming or no stemming at all.

One thing I’m happy to observe is that given a word, the stemmer is producing multiple words that varnam can learn in diffent stages. For example, കാലങ്ങളുടെ would first stem to കാലങ്ങൾ (which varnam learns) and then to കാലം (learns again). If all goes well, I will be able to test and tweak the algorithm extensively this week and hopefully start estimating how much the suggestions are improved. Then, I hope, will be time for some code reviewing with my mentor.