in C++, Problems, Uncategorized

What language is that?

I spend some of my time reading liner notes of the music CDs that I buy. Like the Rosetta Stone, they are often written in three languages, although the languages are usually English, French, and German, rather than hieroglyphs, demotic script, and Ancient Greek. Once in a while, I open to the wrong language, and for a few seconds my eyes skim the German version of the text before it dawns on me that I am not looking at the English. 

Like most native speakers of a European language I can recognize a number of other European languages that I have no ability to read, write, or speak. How does this recognition work? Can it be put into a simple, not very sophisticated or difficult computer program?

Making an intelligent guess about the language in which a document is written is helpful in a number of situations, so how should we best go about it? One possibility involves the use of a dictionary. After a “white space parse,” we look up a few of the words and find out what dictionary contains them.

Unfortunately the dictionary based approach has several problems. First, dictionaries are large, arbitrarily large in fact, and to recognize each additional language you might need a good bit more storage than is convenient. Next, there is the issue of having to partially parse the source text, recognizing words and throwing out the punctuation, which necessitates a data structure in which to keep the parsed text.  Finally, there is the Sideshow Bob problem, which refers to the “Cape Feare” episode of The Simpsons in which the Bob character attempts to convince the court that his tattoo saying “die Bart die,” is a German translation of “the Bart the.” Never mind that Bart is masculine. Keeping track of gender would be an even more intensive use of dictionaries.

In polyGLOB, I opted for the approach of analyzing the relative frequency of the characters in the source text. For example, ‘e’ is the most common character in both English and German. It accounts for about 12% of English letters, but an overwhelming 17% of the letters in German. Other letters have similar differences in distribution: the ‘f’ is 2% of English but only 1% of Spanish.

Primitive substitution ciphers were the first attempts at character based cryptography. They were easily broken a long time ago by means of frequency analysis, and because of this history the distribution of letters in all the European languages is well known. Far from needing dictionaries of arbitrary complexity, we need only a table with one entry for each of the twenty-six letters in the Roman alphabet. Even with twenty or so languages, the number of tables and their finite size is quite manageable.

This approach is not perfect, and it has perils unique to its approach. The most obvious problem involves the use of diacritical marks in just about all the languages other than English. If we are counting e-s, we probably want to count e, é, è, and ê as belonging to the same bucket. Converting them requires a bit more work for the computer than the human because they do not “look alike” to the machine. This is different from E and e, which have a machine resemblance in the sense that the representation of the capital letter is 32 less than the corresponding lower case letter.

Another problem arises when the analysis incorporates an increasing number of languages with similar distributions. For example, using this method it is relatively easy to tell German from French from Italian from Spanish. However, once Portuguese is added to the mix, short Spanish texts may be mistaken for it.

Only in rare cases does the vocabulary itself cause a problem: an article about X-rays for zebras is unlikely to be correctly recognized.

The most effective method for settling the bulk of the disputes is the addition of information about shibboleths. For example, only French uses the ç, only German has the ß, and Spanish has no characters with any accent other than the acute, in other words: always á, never à.

In polyGLOB, I have incorporated a concept of “confidence,” measured on the interval 0 .. 1, into the analysis. Confidence is a composite statistic made in part by the length of the text to be analyzed (longer texts give better fits), and in part by the difference between the best fit and the second best fit.

I have experimentally determined that 500 “letters” [explanation: non-space, non-punctuation] is enough that adding another 500 is unlikely to make much difference. The reason for considering the difference between the best fitting language and the second best fit is more subtle. The concept of fit involves quite a bit of linear regression math, and not every text sample fits close enough to the list of languages that the choice is obvious (to the computer program). The better measurement is then whether the second choice is a close runner up. 

So, it can be done. In the near future we will have enough data to perhaps tweak and refine the algorithms, but for now the measurements correspond very closely to gut feel.