We recently closed our offshore development office in St. Petersburg, Russia.
We had a great team there but weren’t able to integrate them well enough into the processes here in Munich. As we had no plans to grow the office in SPB significantly and we only had four guys there, we were stuck in the middle. The team was too small to be self sustainable, especially with regard to offering everybody a clear track for personal development.
So after almost a year of agonizing and sometimes just deferring the issue, at the start of this year we together with the team decided to close the office. Everybody had almost nine months of warning to look for a new job. Plus we created an attractive termination package to make sure that all the young families that had sprung up around our team there were provided for. >> more…
For a new feature recently added to conjectPM we needed to extract information like original sender, send date and subject from the “header” that is generated by mail clients for forwarded emails as shown below.
From: Susan Sunshine
Sent: Friday, November 04, 2009 6:13 PM
To: Daisy Daffodil
Subject: Flowers
And because conjectPM supports 18 languages, we needed to make this work with all of them. Therefore Maria compiled a list of all keywords found in Gmail, Outlook and Apple Mail. The basic idea is simple: write a regex that (for each language) matches the part before the colon, e.g. ^(Sent|Date): (.*)$ and take the captured value in the brackets to analyze it further. But as we are dealing with the real world here which has lots of exceptions, the task turned out to be trickier than we thought, especially for dates. Good that conject employs people from many different countries, so we had native speakers for almost all languages at hand. (Is there anyone from Japan looking for a job..?) >> more…
I don’t even know since when it’s possible, but ever since I started working here, all I ever heard was ranting that java files can’t be saved in utf-8. However, all is fine if you tell your compiler:
-encoding utf-8
Now we have no more annoying \u0420\u0443\u0441\u0441\u043a\u0438\u0439 in our code, but a pretty Русский. Nice!

During importing the translation files I got this message and was totally confused…
My files are modified and this already in the future??
The solution of mystery:
Our translation agency is based in the United Arab Emirates and they are 2 hours ahead to our time zone, that means they provide files from future
While improving our translation process we got some inconsistencies in arabian properties files. Below is a small example (1). Suddenly in the unicode escaped files parantheses balance was broken (2), while in the editor everything seemed to be ok (3).
-
message.title = Message {0}:
-
message.title = :{\u0627\u0644\u0631\u0633\u0627\u0644\u0629 {0
-
message.title = :{الرسالة {0
How could this happen? There’s no voodoo going on, just bidi (bidirectional text) algorithm operating in the dark and someone editing the file without knowing about this algorithm. >> more…
Not everybody understands English and anyway for native speakers it’s much easier. conjectPM is available in 18 languages, however it wasn’t that easy to administer them, which resulted in our users not being fully satisfied:

- In some places we had a mix of native language and English
- Some translations didn’t match the context anymore
- Wrong country-specific mail addresses and telephone numbers showed up
- Special characters were displayed incorrectly
- New functionalities weren’t translated until weeks later… >> more…
While developing a tool for easy handling of translation processes, we had to take care for the right encoding of our property files: Translation agencies need UTF-8 and java needs ascii with escaped unicode sequences. Thats easy if we use the native2ascii ant task, but it might be nicer if we could use an appropriate java class. But obviously there’s only the exe-tool and the ant-task. Anybody knows why?
Another strange thing is, that the java IO methods don’t support unicode escaped ascii. But we don’t worry, just rely on the Properties.store method – apart from encoding it also handles escaping of key characters ([#!] for comments and [:=] as delimiters). Very comfortable. Except one thing: property files will be unsorted. As you can imagine, checking the changes made by the tool gets a pain, because standard diff can’t find the changes in such scrambled files. >> more…
After finishing the work on stopwords and stemming, the only thing left to do was the conversion from random input texts to the encoding Java uses. Strangely there seems to be no library that takes care of this – well, there is chardet but it didn’t suit our demands. Fortunately I faced the same problem some time ago and implemented a charset detector. It plainly counts the occurences of non-ASCII characters and makes an educated guess whether the input is UTF8, windows-1250, ISO-8859-15 or x-MacRoman. Hours were spent evaluating code tables and finding out what encoding each of the common special characters has in the different charsets (and how this whole encoding stuff works).
So the only thing that needed to be done was looking up the bytes used to represent certain cyrillic (windows-1252) letters in an hex-editor and adding them to the guessing algorithm. To spare you the trouble (if you ever need this too), here’s the code-snippet: >> more…
One of the ingredients for a good (and fast) search is the stopwords list. It contains all words that appear so often in a language that their relevance for search is almost zero. For example the word “and” – if a search would list all texts that contain “and” the list of results would be enormous! Of course this won’t harm the quality of a good search algorithm, but its performance and the size of the index. And that’s why wanted a stopwords list for the new russian search.
We invited Olha Biletska and Tetyana Dekola, two students I met at an alumni meeting recently, to help us with this task. After losing everything because of this great IDEA bug we finally came up with a list. It’s bigger than the german or english one because the russian language has six cases instead of four.
Download (UTF-8)
As we want to introduce a cool russian search in conjectPM, we tried out the russian snowball stemmer that comes with Lucene. It’s a rule based stemmer, that cuts and changes a number of characters from the input to produce some kind of word stem. This is used to prepare the indexed files as well as the queries so that it doesn’t matter for the search whether words are plural or singular (same for other kinds of flection).
After some time of wondering why the output was always the same as the input, we found out that our input was Unicode, but the stemmer expected ISO 8859-5 (cyrillic). But how to translate one encoding to the other? >> more…