I don’t even know since when it’s possible, but ever since I started working here, all I ever heard was ranting that java files can’t be saved in utf-8. However, all is fine if you tell your compiler:
-encoding utf-8
Now we have no more annoying \u0420\u0443\u0441\u0441\u043a\u0438\u0439 in our code, but a pretty Русский. Nice!
Today I nearly went mad because I had to determine the encoding of some html files.
I tried various command line tools without any success until a colleague showed me that firefox can
guess the encoding: In my version (3 under Ubuntu) it is located here: “View”>”Character Encoding”.
As I didn’t find that on the web, maybe this post will save somebody out there the trouble I had.
While developing a tool for easy handling of translation processes, we had to take care for the right encoding of our property files: Translation agencies need UTF-8 and java needs ascii with escaped unicode sequences. Thats easy if we use the native2ascii ant task, but it might be nicer if we could use an appropriate java class. But obviously there’s only the exe-tool and the ant-task. Anybody knows why?
Another strange thing is, that the java IO methods don’t support unicode escaped ascii. But we don’t worry, just rely on the Properties.store method – apart from encoding it also handles escaping of key characters ([#!] for comments and [:=] as delimiters). Very comfortable. Except one thing: property files will be unsorted. As you can imagine, checking the changes made by the tool gets a pain, because standard diff can’t find the changes in such scrambled files. >> more…
After finishing the work on stopwords and stemming, the only thing left to do was the conversion from random input texts to the encoding Java uses. Strangely there seems to be no library that takes care of this – well, there is chardet but it didn’t suit our demands. Fortunately I faced the same problem some time ago and implemented a charset detector. It plainly counts the occurences of non-ASCII characters and makes an educated guess whether the input is UTF8, windows-1250, ISO-8859-15 or x-MacRoman. Hours were spent evaluating code tables and finding out what encoding each of the common special characters has in the different charsets (and how this whole encoding stuff works).
So the only thing that needed to be done was looking up the bytes used to represent certain cyrillic (windows-1252) letters in an hex-editor and adding them to the guessing algorithm. To spare you the trouble (if you ever need this too), here’s the code-snippet: >> more…
As we want to introduce a cool russian search in conjectPM, we tried out the russian snowball stemmer that comes with Lucene. It’s a rule based stemmer, that cuts and changes a number of characters from the input to produce some kind of word stem. This is used to prepare the indexed files as well as the queries so that it doesn’t matter for the search whether words are plural or singular (same for other kinds of flection).
After some time of wondering why the output was always the same as the input, we found out that our input was Unicode, but the stemmer expected ISO 8859-5 (cyrillic). But how to translate one encoding to the other? >> more…
Today we had to learn the hard way, that things in your favorite IDE aren’t always what they seem to be.
We created a new file in Intellij IDEA, typed in some cyrillic text which was correctly shown on the screen and relied on the editor to save the file the same way as we saw it. But unfortunately it only saved a series of question marks.
As we found out, the file encoding (shown at the lower left corner of the window) was set to ASCII. And so, eventually, all non-ASCII characters were replaced by question marks. What remains is deep frustration, 500 lines of question marks and the big question: Why the hell didn’t the editor show us, that he couldn’t interpret the typed-in characters?