One of the ingredients for a good (and fast) search is the stopwords list. It contains all words that appear so often in a language that their relevance for search is almost zero. For example the word “and” – if a search would list all texts that contain “and” the list of results would be enormous! Of course this won’t harm the quality of a good search algorithm, but its performance and the size of the index. And that’s why wanted a stopwords list for the new russian search.
We invited Olha Biletska and Tetyana Dekola, two students I met at an alumni meeting recently, to help us with this task. After losing everything because of this great IDEA bug we finally came up with a list. It’s bigger than the german or english one because the russian language has six cases instead of four.
Download (UTF-8)
As we want to introduce a cool russian search in conjectPM, we tried out the russian snowball stemmer that comes with Lucene. It’s a rule based stemmer, that cuts and changes a number of characters from the input to produce some kind of word stem. This is used to prepare the indexed files as well as the queries so that it doesn’t matter for the search whether words are plural or singular (same for other kinds of flection).
After some time of wondering why the output was always the same as the input, we found out that our input was Unicode, but the stemmer expected ISO 8859-5 (cyrillic). But how to translate one encoding to the other? >> more…