After finishing the work on stopwords and stemming, the only thing left to do was the conversion from random input texts to the encoding Java uses. Strangely there seems to be no library that takes care of this – well, there is chardet but it didn’t suit our demands. Fortunately I faced the same problem some time ago and implemented a charset detector. It plainly counts the occurences of non-ASCII characters and makes an educated guess whether the input is UTF8, windows-1250, ISO-8859-15 or x-MacRoman. Hours were spent evaluating code tables and finding out what encoding each of the common special characters has in the different charsets (and how this whole encoding stuff works).
So the only thing that needed to be done was looking up the bytes used to represent certain cyrillic (windows-1252) letters in an hex-editor and adding them to the guessing algorithm. To spare you the trouble (if you ever need this too), here’s the code-snippet: >> more…

As we want to introduce a cool russian search in conjectPM, we tried out the russian snowball stemmer that comes with Lucene. It’s a rule based stemmer, that cuts and changes a number of characters from the input to produce some kind of word stem. This is used to prepare the indexed files as well as the queries so that it doesn’t matter for the search whether words are plural or singular (same for other kinds of flection).

After some time of wondering why the output was always the same as the input, we found out that our input was Unicode, but the stemmer expected ISO 8859-5 (cyrillic). But how to translate one encoding to the other?  >> more…

Today we had to learn the hard way, that things in your favorite IDE aren’t always what they seem to be. 

We created a new file in Intellij IDEA, typed in some cyrillic text which was correctly shown on the screen and relied on the editor to save the file the same way as we saw it. But unfortunately it only saved a series of question marks.

As we found out, the file encoding (shown at the lower left corner of the window) was set to ASCII. And so, eventually, all non-ASCII characters were replaced by question marks. What remains is deep frustration, 500 lines of question marks and the big question: Why the hell didn’t the editor show us, that he couldn’t interpret the typed-in characters?