As we want to introduce a cool russian search in conjectPM, we tried out the russian snowball stemmer that comes with Lucene. It’s a rule based stemmer, that cuts and changes a number of characters from the input to produce some kind of word stem. This is used to prepare the indexed files as well as the queries so that it doesn’t matter for the search whether words are plural or singular (same for other kinds of flection).

After some time of wondering why the output was always the same as the input, we found out that our input was Unicode, but the stemmer expected ISO 8859-5 (cyrillic). But how to translate one encoding to the other? 

We used native-to-ascii conversion to change the u00xx characters in RussianStemmer.java to some readable (but wrongly interpreted) text, pasted that text into Textwrangler (the best editor for mac), chose “Reopen with other encoding -> ISO 8859-5″, so that the characters were interpreted correctly, copied it back to IDEA, did ascii-to-native and – Ta-Dah! – we had a unicode version.

Encoding-stuff always makes my head ache.

3 Comments »

  • Somehow we didn’t get the Unicode version working as well… instead we switched to the RussianStemmerFilter provided in the Lucene Analyzers.
    This does a pretty good job stemming russian words (in Unicode)

    Comment by Andrea Stubbe — February 24, 2009 @ 6:03 pm
  • Would this python snippet satisfy your needs?

    # read the unicode contents to a variable txt
    txt = u”… some unicode text …”
    out = txt.encode(‘ISO 8859-5′)
    # dump out to a file

    Or did I get the problem wrong here?
    Thanks for a reply.

    Comment by Shuja Parvez — May 25, 2009 @ 4:10 pm
  • In fact, that’s what Textwrangler does internally (or something very similar). But replacing the ISO encoded characters with unicode encoded characters unfortunately didn’t help here. Thanks anyway!

    Comment by astro — May 25, 2009 @ 4:47 pm

Email this Share this on Facebook Share this on LinkedIn Tweet This! RSS feed for comments on this post. TrackBack URL

Leave a comment