GSoC : Final report

Putting together a quick report of how I spent my last 3 months on improving varnam, an awesome transliteration project. My task was to implement a stemmer to improve the learning in varnam.
A stemmer is an algorithm that, upon giving a word as the input, gives the base word as the output.

For example, giving മരത്തിലൂടെ as the input would give you മരത്തിൽ and മരം as outputs. മരം is the final output of the stemmer and മരത്തിൽ is an intermmediate output of the stemmer. The algorithm is described here. The stemmer is similar to SILPA stemmer created by Santhosh Thottingal except that my version makes use of an exceptions table and produces meaningful intermmediate words.

A screencast that explains my work is posted above. Make sure you watch it in 720p to clearly see the words being typed.

As far as statistics go, see this thread to know how much the learning has improved. This is not the final result, as the number of words learned is of no consequence if the stemmer does not improve transliteration accuracy. Transliteration accuracy tests before and after the tests are yet to be done thoroughly. Judging by the number of new words in the word corpus alone, varnam saw an improvement of 63% in learning when tested with 408 words.See the above thread for the exact results and the word corpus used.

GSoC : Memory heap corruption and code rewrite

This week I’ve been busy rewriting the stemmer and debugging some memory heap corruption. My first implmentation of the stemmer used to crash ibus whenever certain words, like “ദൂരെയാണ്” and “വിദൂരമായ” were typed. I could not locate the problem, and the only error message I got was “free() – invalid next size” when ibus crashed. Some searching revealed that it might be due to a memory heap corruption. I used valgrind memcheck to debug the memory corruption. It was difficult to make sense of valgrind’s output, and that eventually lead me to ask a question at stackoverflow. However, before all this, I was convinced that I made some serious mistake somewhere along the development path and decided to sit down and rewrite the whole project. I thought that I made a mistake by not testing with ibus early on. I discovered what I was doing wrong to merit the memory corruption soon after (even before the guy came in and gave his answer at stackoverflow.com). However, I realised that a rewrite would do the project much good. To start with, I could then run valgrind as I went with the rewrite to make sure that I plugged all the possible memory leaks. Also, I was able to look into some unnecesary function calls among other things. In short, I cleaned the code and is ready for a code review.

Here’s a changelog:

1. Tried implementing the “improvement scheme”, as I had suggested in this thread. The results were far worse than expected. 60% of the words after suffix appending were not meaningful. Any further attempts along this path would require much more careful planning and reasearch of the malayalam language.

2. Located and avoided [did not stonewall it] an annoying memory corruption. Filed it under issue 51.

3. Removed the level hierarchy. All stemrules are now grouped into one. Splitting the stemrules into 3 levels serve no real purpose, and complicates stemming by needing to check each level seperately. Also, removal of the level system has improved the code readability a lot.

4. Replaced some function calls with inline expansions. Made all the functions more defensive and freed memory wherever valgrind reported memory leaks.

5. Libvarnam ibus requires a clean build every time libvarnam.so changes. It seems that libvarnam-ibus has its own version of libvarnam or something. Should look into this. Ibus not reflecting the changes I made to libvarnam was a real headache – no amount of debugging could solve the issue. Tried recompiling libvarnam-ibus and things started to work.

6. Eliminated recursive calls to varnam_learn(). In the first implementation, varnam_learn() would call varnam_stem() which calls varnam_learn_internal(). This was bad design. Now varnam_stem() returns a varray to varnam_learn(), and varnam_learn() iterates over this varray to learn all the stemmed words.

These changes are not final. Some of it, like doing away with the level system, was done without consulting my mentor and would be reintroduced if he thinks that removing it was a bad decision. You can see all my changes here and make suggestions.

To do :

1. More tests
2. Make sure stemmer works well with other languages
3. Enable varnam to stem from the command line interface

GSoC : Malayalam stemmer foundation

Another week, and I’m finally working on what I signed up to do – implement a malayalam stemmer. The algorithm itself is still a haze, and I will be sitting down and drawing flowcharts soon. Despite being harassed by university practical exams, I managed to squeeze in enough time to lay down a basic framework. Varnam now has the *potential* to stem.

Something wonderful happened during my last conversation with my mentor. The scheme file, which looked like an ordinary text file full of rules to convert manglish (a blend of Malayalam and English) into malayalam, turned out to be a ruby file. I’m telling you, this is a ruby program! Its actually called the scheme file. The titles “consonants” and “vowels” and the like are actually function calls. Ruby does not need paranthesis to call a function. Beautiful.

Yes, I learned a bit of ruby to add the functionality I needed. I added a few stem rules to the scheme file which gets added to an sqlite3 table when I compile the scheme file. I learned how to call c functions from ruby using FFI and also added a “–stem” option to the list of arguments accepted by varnamc.

varnamc --symbol ml --stem പരീക്ഷയാ

gives the following output:

പരീക്ഷയ്

Doesn’t make much sense, I know. But under the hoods, varnam checked if there is a stem rule for the ending “ാ” in the database and seeing that there is, substituted the ending of the supplied word with the ending specified in the stem rule (“്”). The above stemrule doesn’t serve any purpose, and will be conveniently removed after I draft the algorithm.

Now I have to write tests for all the functions I wrote. I wonder how much of the codebase I broke already.

GSoC : Unit tests, Merges and Travis

Its been two weeks since community bonding got over and I’ve been making slow but steady progress. A couple of new bugs has surfaced and I will be fixing them before moving on to my main task.

Also, varnam project has been moved to github owing to gitorious being too unstable/unreliable.

I walked (crawled would be more appropriate) into some new territory the past week :

1. Unit tests : Every software project need unit tests. Unit tests make sure that all the tiny little parts (units) of the software function the way they are supposed to function and that you did not accidentally break anything with your new commit. I had heard of them, but I never thought I’d have to do them in c. Varnam uses check for unit testing. I wrote a patch that checks for the validity of the suggestions file libvarnam receives and wrote a test case for it. I learned to do stat calls to a path, rather than using fopen() to check if the file exists. Being the beginner I am, I did break something because the test fails at 70%. Should look into it.

2. Merging : I finally understood what merging is in git, though a bit painfully. Somehow I couldn’t digest the fact that git could simply “merge” all the changes that I made and all the changes (possibly overlapping) that some one else made into something that works. The other guy could have completely removed the stuff I’ve been working on from his commit! I just found out that doing so results in conflicts, and git won’t merge the branches until the conflicts are resolved. Bloody mess. Thankfully, there are wonderful tools like meld that makes resolving conflicts a lot less painful. Besides, the conflicts I had was of a less “violent” nature. So there I go, making my very first (meaningful) merge and a pull request

3. Travis : Just when I thought that the day is done, Travis CI reported that the build failed. Travis? Travis is a continous integrations tool integrated into github. This means, every time I push into a repository, Travis builds (does cmake . , and make) the project after merging my pull request and then runs the test cases to see if anything is broken. And yes, I had broken quite a few things. My new patch completely fails a few assertions.

****************I called it a day and went to sleep*****************

I spent almost the entire day debugging and the tests continued to fail. Apparently some files that should have been generated are simply not there. I created a quick hack, and put together a new pull request. At least the tests are proceeding better now. Travis still fails. Perhaps my mentor can help me with that.

The pull request : https://github.com/varnamproject/libvarnam/pull/47

GSoC : Community bonding

The community bonding period of this year’s google summer of code is nearing an end. Its been a rather busy week, and I had to juggle time between exam preps and GsoC. I cannot say that I have made much progress. However, an IRC meeting with the mentor turned out to be very fruitful. It was about setting up the right development environment, and I did learn a lot!

1. ctags/etags : I was complaining how hard it is to find function definitions in the libvarnam codebase. There are a lot of header files. That’s when I heard about ctags. I had to install the ctags package from the ubuntu repositories, and configure it to catalogue the libvarnam folder. Then I got myself the sublime text editor and installed the plugin for ctags. Now all I have to do is press ctrl+t+t when I encounter a function call and sublime will open the the definition of that function in a separate tab! Productivity multiplied – ten fold!

Another convenient way (though not as convenient) would be to use grep -iR. The -iR argument makes grep list the files from which the pattern matches were found.

2. Nemiver : I have used the gnu debugger (gdb) in my lab before. The programs I wrote then were rather small and I could live without a debugger. But mentor says no. Nemiver is a rather neat front end to gdb and I don’t have to look up line numbers to insert break points anymore – I click on the line instead. Also, nemiver makes the print command in gdb quite obsolete. Nemiver shows the values of all the variables in the scope as a list.

3. Sample project : My first task. In order to get myself familiar with libvarnam and learn some debugging in the process, the mentor asked me to write a sample project. My sample program, found here, would convert all the string literals in a python program into their corresponding Malayalam equivalent. Simple and buggy. But I did learn how to make nemiver branch into the libvarnam API and do some transliteration.

Now that I’m getting a few days gap before the last exam, I must fix a bug or two. I hope I’ll be able to start working on the stemming algorithm starting May 20th.

Google Summer of Code!

I’m excited to announce that I’ve been selected to this year’s google summer of code. My mentoring organization is SMC – Swathantra Malayalam Computing and I will be working on the varnam project.

Varnam means ‘colors’. Varnam is a transliterator for indic languages. My task is to improve the learning capability of varnam by coming up with a stemmer algorithm for indic languages. A stemmer algorithm returns a base word when it is supplied a complex word. In english, supplying ‘retirement’ to the porter stemmer algorithm will trim it down to ‘retire’ and subsequently return ‘retir’. I have to do the same thing with malayalam words. The trick is to design the whole thing in such a way that stemming support for other languages can be easily added. The stemming rules will differ from language to language. Though I will be laying down the rules for malayalam, I should provide room for someone else if she decides to add support for another language. In short, my algorithm should be designed to read a ‘rule file’.

The varnam project can be found here. Why use varnam when you have, say, google input tools? For one, google input tools work only in windows. Two, I’m not sure if you can use it in your own programs. I guess not. Three, it is not open source which means google won’t let you take a peek inside. Four, varnam can render the whole linux shell in malayalam if need be (and if you are willing to put in the effort)! To be frank, seeing small round malayalam alphabets on my desktop konsole was quite unexpected!

I’m so grateful to SMC for letting me work on this and even more grateful to google for the upcoming paycheck ;). SMC requires us to keep the blog updated on a weekly basis, so I guess everyone will be hearing an awful lot from me 😀