GSoC : Libvarnam can now stem

Very productive ten days. Libvarnam is finally stemming the words. I might not be so wrong in stating that the project is almost half complete. I’ve come up with a multi-pass stemming algorithm (although no flow-chart drawing was required – maybe I’ll draw one for clarity later) that has the *potential* to stem with a reasonable accuracy. Since the algorithm is intended to serve as a platform for many Indian languages, proper documentation is quite important. As a first step, I’ve made a separate github repository putting together the thought process that went into designing the algorithm. The first draft of the algorithm is in the file “03algorithm” here. Please note that the files 01classification and 02implementation are not updated –  those are just things I jotted down.

A rather quick explanation :

The varnam stemmer removes suffixes from malayalam words to obtain the base word. For example,
വേദനാജനകമായ : വേദനാജനകം

The algorithm does this by using a set of rules, called stem rules. The stem rule that was used in the above example is :

“മായ” => “ം”

These rules can be classified into 3 : level 1, level2, and level3. Level1 contains the shortest rules, and the most basic ones. Level2 contains the most common rules and are often 2 syllables or more long. Level3 contains the longest suffixes, like “യിരിക്കുന്നു” => “”. (More on levels at “01classification” here). For now, the classification into levels is rather a convenience than a necessity. I decided that one long list of stem rules is ugly and dividing them into 3 would be nicer. So there you go.

These stem rules reside in a database.


1. Compile the scheme file and insert values into stemrules table
2. buffer = empty_string_bufer
3. Do not stem if size of word is less than 10 bytes. (Min_stem_size)
4. While (termination_condition() is not met)
4.1 Get last letter of the word and insert it at the beginning of the buffer
4.2 if buffer is in level1
4.2.1 apply stemrule from level1, word is modified
4.2.2 if(independent_existence)
4.2.2.1 learn word
4.2.3 clear buffer
4.3 else if buffer is in level2
4.3.1 apply stemrule from level2, word is modified
4.3.2 if(independent_existence)
4.3.2.1 learn word
4.3.3 clear buffer

4.4 else if buffer is in level3
4.4.1 apply stemrule from level3, word is modified
4.4.2 if(independent_existence)
4.4.2.1 learn word
4.4.3 clear buffer


5. Learn the stemmed word

TERMINATION CONDITION
1. Return true if :
a) The word ends with ം.
b) If the word ends with a consonant and there is no added swara eg : പരീക്ഷ (pareeksha)
c) If the buffer contains the rest of the word (or whole of it).

There is something wrong with code indentation in wordpress. Click on the screen shot to see the neatly indented version on sublime text editor.

The algorithm
The algorithm

For example, consider the stemming of the word എന്നിവിടങ്ങളിൽ. Initially, the buffer is empty.

1. Shift word ending to buffer. Buffer now contains ൽ

2. Buffer contents (ൽ) does not correspond to a stem rule in any level.

3. Shift word ending to buffer, buffer now contains ളിൽ

4. There exists a stem rule in level 2 “ളിൽ” => “ൾ”. Apply this stem rule to the word. Word now becomes എന്നിവിടങ്ങൾ

5. Clear the buffer

6. എന്നിവിടങ്ങൾ is independent. That is, it is a meaningful word. Hence learn it. This step (learning) is not necessary in stemming, but is crucial to improve varnam’s predictions.

7. Shift word ending to buffer. Buffer now contains ൾ. Not part of a stem rule.

8. Shift the next ending to buffer. Buffer now contains ങ്ങൾ. There is a stem rule “ങ്ങൾ” => “ം” in level 2. Apply stem rule, and the word becomes  എന്നിവിടം. (Varnam learns this word too)

9. The algorithm continues by shifting the endings of  എന്നിവിടം. Since the contents of the buffer will not correspond to a stem rule at any point of time, the algorithm eventually terminates.

 

The termination condition needs some refinement. Condition a) and b) is not being used right now. Stemming terminates when there is no more element left in the word to shift to the buffer. This seems to work fine right now, and if it continues to work, I will drop conditions a) and b) altogether.

The accuracy of the stemmer is ultimately determined by how good and accurate the design of stem rules are. This requires a lot of trial and error, and some of the stem rules are in the mlstemmer repository. By careful choice of the stem rules, an accuracy of more than 80% is expected.

 

Preliminary testing

I’ve implemented a stemmer.c program under the examples directory that can read words separated by blank spaces from a text file and stem them. This is the sample input :

വിവിധതരം വധശിക്ഷകളിൽ ഒന്നാണ് കുരിശിലേറ്റിയുള്ള വധശിക്ഷ. ഈ ശിക്ഷാരീതിയിൽ പ്രതിയെ ഒരു മരക്കുരിശിൽ ആണിയടിച്ച് തളയ്ക്കുകയാണ് ചെയ്യുക വേദനാജനകമായ വധശിക്ഷ നടപ്പാക്കണം എന്ന ഉദ്ദേശത്തോടുകൂടി രൂപപ്പെടുത്തിയ പുരാതനമായ ഒരു ശിക്ഷാരീതിയാണിത് സെല്യൂസിഡ് സാമ്രാജ്യം കാർത്തേജ് റോമാ സാമ്രാജ്യം എന്നിവിടങ്ങളിൽ ക്രിസ്തുവിന് മുൻപ് നാലാം ശതകം മുതൽ ക്രിസ്തുവിനു ശേഷം നാലാം ശതകം വരെ കുരിശിലേറ്റൽ താരതമ്യേന കൂടിയ തോതിൽ നടപ്പാക്കപ്പെട്ടിരുന്നു യേശുക്രിസ്തുവിനെ കുരിശിലേറ്റി വധിച്ചുവെന്നാണ് ക്രൈസ്തവ വിശ്വാസം. ക്രിസ്തുവിനോടുള്ള ബഹുമാനത്താൽ കോൺസ്റ്റന്റൈൻ ചക്രവർത്തി എ.ഡി. 337-ൽ ഈ ശിക്ഷാരീതി നിർത്തലാക്കുകയുണ്ടായി ജപ്പാനിലും ഒരു ശിക്ഷാരീതിയായി ഇത് ഉപയോഗത്തിലുണ്ടായിരുന്നു മരണശേഷം മൃതശരീരങ്ങൾ മറ്റുള്ളവർക്കുള്ള ഒരു താക്കീത് എന്ന നിലയ്ക്ക് പ്രദർശിപ്പിക്കപ്പെട്ടിരുന്നു കാഴ്ചക്കാരെ ഹീനമായ കുറ്റങ്ങൾ ചെയ്യുന്നതിൽ നിന്നും തടയുക എന്ന ഉദ്ദേശത്തോടെയാണ് കുരിശിലേറ്റൽ സാധാരണഗതിയിൽ നടത്തിയിരുന്നത്

I’ve removed almost all the punctuation so that they won’t interfere with the stemming. I’ve taken 2 screen shots showing the results. The results are far from perfect, and that is certainly because I haven’t added that many stem rules to the database. Things should improve significantly in the next few days.

 

Stem results page 1
Stem results page 1
Stem results page 2
Stem results page 2

References

I’ve referred two papers for designing this stemmer. The first one, LALITHA uses a longest suffix stripping method and was of little use for varnam. The second one, STHREE, uses a similar algorithm to mine but confines the number of iterations to 3. However, both the papers did not contain any links to stem rules or programs that could be reused. Hence I’d be relying on the SILPA stemmer, the first stemmer in Malayalam, for the invaluable stem rules.

But I would be looking for a more exhaustive set of rules (hopefully) and will have to do quite some Malayalam reading. Apart from the Mathrubhoomi newspaper which will definitely be soaked in curry and tea by the time I could carry it away from the mess hall, Malayalam reading materials are actually hard to come by. But wait, I saw a few SFI magazines on the other guy’s room.  Gathi kettal puli pullum thinnum! :p

GSoC : Malayalam stemmer foundation

Another week, and I’m finally working on what I signed up to do – implement a malayalam stemmer. The algorithm itself is still a haze, and I will be sitting down and drawing flowcharts soon. Despite being harassed by university practical exams, I managed to squeeze in enough time to lay down a basic framework. Varnam now has the *potential* to stem.

Something wonderful happened during my last conversation with my mentor. The scheme file, which looked like an ordinary text file full of rules to convert manglish (a blend of Malayalam and English) into malayalam, turned out to be a ruby file. I’m telling you, this is a ruby program! Its actually called the scheme file. The titles “consonants” and “vowels” and the like are actually function calls. Ruby does not need paranthesis to call a function. Beautiful.

Yes, I learned a bit of ruby to add the functionality I needed. I added a few stem rules to the scheme file which gets added to an sqlite3 table when I compile the scheme file. I learned how to call c functions from ruby using FFI and also added a “–stem” option to the list of arguments accepted by varnamc.

varnamc --symbol ml --stem പരീക്ഷയാ

gives the following output:

പരീക്ഷയ്

Doesn’t make much sense, I know. But under the hoods, varnam checked if there is a stem rule for the ending “ാ” in the database and seeing that there is, substituted the ending of the supplied word with the ending specified in the stem rule (“്”). The above stemrule doesn’t serve any purpose, and will be conveniently removed after I draft the algorithm.

Now I have to write tests for all the functions I wrote. I wonder how much of the codebase I broke already.

It’s black and its listening!

I’ve been longing for this day (/night) for almost an year now! I’ve finally, finally, got some data from an electret mic and finally got it into a beaglebone black. If you really want to understand the depth of this achievement, please go through what I’ve been doing for the past few months:

 

http://electronics.stackexchange.com/questions/111458/reading-sound-with-beaglebone-black

http://electronics.stackexchange.com/questions/111969/understanding-a-circuit-diagram

http://electronics.stackexchange.com/questions/106204/connect-electret-mics-to-computer

http://electronics.stackexchange.com/questions/77160/microphone-time-delay-estimation

http://dsp.stackexchange.com/questions/11399/signal-processing-using-panda-board

http://dsp.stackexchange.com/questions/11561/sound-card-for-recording-audio

http://dsp.stackexchange.com/questions/11472/audio-interface-or-external-sound-card

 

That’s a lot of questions. And I got a lot of answers. Though I’m no where near completing the actual project, at least getting it started is a relief. I’ve tried making my own amplifier, tried an audioboard from hrc, tried another opamp IC for amplifier, and finally hit upon this :

Click to access hbmic.pdf

The previous amplifiers I made might have been working fine – I just might not have read the values properly. Anyways, its working now and that’s all I care about.

Ain't she a (black) beauty?
Ain’t she a (black) beauty?

 

Components : Everything that was in the previous link containing the circuit diagram. I did not have 10 uF capacitors so I used 22 uF instead.

Beaglebone black : The INR 4200 beast. This board is being sold out faster than it is being produced. Lucky we could order one!

The process is pretty straightforward now that it works. The beaglebone black has a set of analog pins that can read analog inputs. I connect the mic (which produces voltage variations in the range of 20mv) to a preamplifier (built using the circuit diagram in the link and 741 opamp) and the output of the preamp (in range of 1 v) is given to the analog input of the BBB. When the sound hits the electret mic, it produces a voltage that is proportional to the amplitude of the sound. This is the signal we then amplify and finally measure.

Is this enough to do some signal processing and thus continue the project? Now that requires more experimenting and is a matter for another post.

I made the mistake of not connecting the ground of beaglebone black to the ground rail in the breadboard and cursed many perfectly working amplifier circuit because it BBB did not read.

Now we have to read this value using beaglebone.

First, enable the input output pins :

echo cape-bone-iio > /sys/devices/bone_capemgr.*/slots
If that step went well, you should not see any messages.

Then I wrote a small python script to read the analog pin 1 continously :


import os
import time

while 1:
os.system("cat /sys/devices/ocp.2/helper.14/AIN1")
time.sleep(0.1)

Replace AIN1 with your analog pin (for eg: AIN2 or AIN3) and the value in time.sleep with your desira
ble delay between reads. Saved it as test.py and python test.py does the job.

BBB reading values. Only positive values can be read. I should have given a DC bias.
BBB reading values (look hard. They’re there). Only positive values can be read. I should have given a DC bias.

The beaglebone black can read only voltages till 1.8 volts on its analog pins. Also, it can only read DC. That is why there are a lot of 0s in my screenshot – that’s where the negative voltages should go. The solution is to give a DC bias to the analog signal. Say we give 1v bias. Then the readings would be around 1V and will go below 1v when a negative peak occurs.

GSoC : Unit tests, Merges and Travis

Its been two weeks since community bonding got over and I’ve been making slow but steady progress. A couple of new bugs has surfaced and I will be fixing them before moving on to my main task.

Also, varnam project has been moved to github owing to gitorious being too unstable/unreliable.

I walked (crawled would be more appropriate) into some new territory the past week :

1. Unit tests : Every software project need unit tests. Unit tests make sure that all the tiny little parts (units) of the software function the way they are supposed to function and that you did not accidentally break anything with your new commit. I had heard of them, but I never thought I’d have to do them in c. Varnam uses check for unit testing. I wrote a patch that checks for the validity of the suggestions file libvarnam receives and wrote a test case for it. I learned to do stat calls to a path, rather than using fopen() to check if the file exists. Being the beginner I am, I did break something because the test fails at 70%. Should look into it.

2. Merging : I finally understood what merging is in git, though a bit painfully. Somehow I couldn’t digest the fact that git could simply “merge” all the changes that I made and all the changes (possibly overlapping) that some one else made into something that works. The other guy could have completely removed the stuff I’ve been working on from his commit! I just found out that doing so results in conflicts, and git won’t merge the branches until the conflicts are resolved. Bloody mess. Thankfully, there are wonderful tools like meld that makes resolving conflicts a lot less painful. Besides, the conflicts I had was of a less “violent” nature. So there I go, making my very first (meaningful) merge and a pull request

3. Travis : Just when I thought that the day is done, Travis CI reported that the build failed. Travis? Travis is a continous integrations tool integrated into github. This means, every time I push into a repository, Travis builds (does cmake . , and make) the project after merging my pull request and then runs the test cases to see if anything is broken. And yes, I had broken quite a few things. My new patch completely fails a few assertions.

****************I called it a day and went to sleep*****************

I spent almost the entire day debugging and the tests continued to fail. Apparently some files that should have been generated are simply not there. I created a quick hack, and put together a new pull request. At least the tests are proceeding better now. Travis still fails. Perhaps my mentor can help me with that.

The pull request : https://github.com/varnamproject/libvarnam/pull/47