As we want to introduce a cool russian search in conjectPM, we tried out the russian snowball stemmer that comes with Lucene. It’s a rule based stemmer, that cuts and changes a number of characters from the input to produce some kind of word stem. This is used to prepare the indexed files as well as the queries so that it doesn’t matter for the search whether words are plural or singular (same for other kinds of flection).

After some time of wondering why the output was always the same as the input, we found out that our input was Unicode, but the stemmer expected ISO 8859-5 (cyrillic). But how to translate one encoding to the other?  >> more…