As we want to introduce a cool russian search in conjectPM, we tried out the russian snowball stemmer that comes with Lucene. It’s a rule based stemmer, that cuts and changes a number of characters from the input to produce some kind of word stem. This is used to prepare the indexed files as well as the queries so that it doesn’t matter for the search whether words are plural or singular (same for other kinds of flection).
After some time of wondering why the output was always the same as the input, we found out that our input was Unicode, but the stemmer expected ISO 8859-5 (cyrillic). But how to translate one encoding to the other?
We used native-to-ascii conversion to change the u00xx characters in RussianStemmer.java to some readable (but wrongly interpreted) text, pasted that text into Textwrangler (the best editor for mac), chose “Reopen with other encoding -> ISO 8859-5″, so that the characters were interpreted correctly, copied it back to IDEA, did ascii-to-native and – Ta-Dah! – we had a unicode version.
Encoding-stuff always makes my head ache.




Somehow we didn’t get the Unicode version working as well… instead we switched to the RussianStemmerFilter provided in the Lucene Analyzers.
This does a pretty good job stemming russian words (in Unicode)
Would this python snippet satisfy your needs?
# read the unicode contents to a variable txt
txt = u”… some unicode text …”
out = txt.encode(‘ISO 8859-5′)
# dump out to a file
Or did I get the problem wrong here?
Thanks for a reply.
In fact, that’s what Textwrangler does internally (or something very similar). But replacing the ISO encoded characters with unicode encoded characters unfortunately didn’t help here. Thanks anyway!