After finishing the work on stopwords and stemming, the only thing left to do was the conversion from random input texts to the encoding Java uses. Strangely there seems to be no library that takes care of this – well, there is chardet but it didn’t suit our demands. Fortunately I faced the same problem some time ago and implemented a charset detector. It plainly counts the occurences of non-ASCII characters and makes an educated guess whether the input is UTF8, windows-1250, ISO-8859-15 or x-MacRoman. Hours were spent evaluating code tables and finding out what encoding each of the common special characters has in the different charsets (and how this whole encoding stuff works).
So the only thing that needed to be done was looking up the bytes used to represent certain cyrillic (windows-1252) letters in an hex-editor and adding them to the guessing algorithm. To spare you the trouble (if you ever need this too), here’s the code-snippet:

  1. private static boolean isCyrillic(final int b) {
  2.    return b == 0xe6 || b == 0xe3 || b == 0xe4 || b == 0xed || b == 0xe9 
  3.        || b == 0xec || b == 0xe8 || b == 0xe2 || b == 0xef || b == 0xea
  4.        || b == 0xe1 || b == 0xfb || b == 0xf2 || b == 0xf6 || b == 0xf4
  5.        || b == 0xf7 || b == 0xcf || b == 0xc3 || b == 0xc6 || b == 0xc4
  6.        || b == 0xca || b == 0xc8 || b == 0xcb || b == 0xc7 || b == 0xd7
  7.        || b == 0xd4 || b == 0xdf;
  8. }

Even though it was fun to write code like this, I was happy to go back to plain application development.

1 Comment »

Email this Share this on Facebook Share this on LinkedIn Tweet This! RSS feed for comments on this post. TrackBack URL

Leave a comment