You will know, of course, that "Mae'n drwg gen i" means "I am sorry". If, that is, you speak Welsh. If you don't, you'd be hard put to guess; the phrases look quite different. And computers don't speak Welsh - or indeed anything at all. So how can they possibly translate from one language to another? The answer lies in abstraction. Abstract linguistic representations sit at the core of most machine translation (MT) systems. At present they operate in three phases: analysis, transfer and synthesis.
First, a program called the parser analyses the source sentence. It assigns grammatical categories to each word. Words are then grouped together into bigger semantic components. Think of the output of this phase as a tree. The leaves are the words, and the further you move from the leaves, the nearer you get to the root of the sentence's meaning.
The second stage, the transfer, is where all the hard work of translation takes place. The vital resource is a set of rules called the comparative grammar. This shows the abstract differences between the source and target languages. So, to go from the English "Tom owns this car" to the French "Cette voiture appartient a Tom", a grammar rule would be used to swap the sentence subjects ("Tom" in English, "voiture" in French) and the objects ("car", "Tom").
The final stage, synthesis, is like analysis but in reverse. The synthesiser turns the sentence representation into a full translation.
The transfer is the least elegant and weakest part in the chain. If you want to build a system that can translate between n languages you have to consider n(n-1) interrelationships. This is complex and time consuming. Worse still, even when the system is built, every time you change any information about any of the languages, you have to think about the impact on all the other n-1 languages. Add just one word to your English dictionary and you have to write a rule that shows what that word translates to in each of the other n-1 languages.
All these problems have driven some researchers to look for ways to reduce - or skip - the transfer stage. The quest is for so-called interlinguas: sentence representations that are totally independent of language. In interlingual systems, translation occurs in only two phases. Analysis produces an interlingual form for the source sentence and this is then expressed by the synthesiser.
To get a better feel for the way interlinguas work, look at the most unwieldy component of a transfer system: the hundreds of thousands of rules that map the words of one language onto another. The interlingual solution is to throw those rules away, and in their place develop a set of primitive concepts - building blocks of meaning - using them to define words.
Imagine we want to produce interlingual definitions for the words "boy", "girl", "man", "woman". We can do this using the concepts "human", "female" and "adult". "Girl" would be defined as [human+, female+, adult-], "man" as [human+, female-, adult+], et cetera.
To translate the word "boy" from English into, say, French, first we would use our English dictionary to find the interlingual form: boy [human+, female-, adult-]. Then, in the synthesis phase, we would find the word that had the same pattern of features in the French dictionary - garçon [human+, female-, adult-]. No explicit language-to-language rules are involved. So, if we add a new language - say, Welsh - we don't have to work out that "bach" maps to "boy" (in English) and to "garçon" (in French). We just define its meaning.
Another problem interlinguas can help with is tense. Compare French and English. The French present tense can be used to say things like "I eat", "I am eating", and "I have eaten" - things which English needs several tenses to express. A transfer system would require lots of complex rules to show which English tense is the equivalent of a French present tense verb form.
But step away from the idea of tenses and think in terms of time. S is the time when the sentence is spoken, E is the time of the event spoken about and R is a reference time. With these we can describe the temporal meaning of a sentence. Think about the sentence "By midday, everyone had left." Here, the event (E) occurred before the reference time (R), and before the sentence is spoken (S). Or, put in a form a computer could process, E<R, R<S.
Suddenly, we are liberated from having to work out the language-to-language tense mappings. Adding a new language to the system now involves the much simpler process of defining tenses in terms of these concepts.
So, we can differentiate between the simple past ("I ran") and the present perfect ("I have run") like this: ran R<S, E=R, have run R=S, E<R. As we now have a universal way of talking about time, translation involves just two phases - analysis of the temporal concepts, then realisation of these in the target language.
But can it be that easy? It really is, once you have the primitive concepts for words, time or whatever. The biggest problem is that the way you and I see the world is greatly affected by the languages we speak. Where an Eskimo has many words for snow, I have but one. Choosing a set of shared concepts, then, is not trivial. For now, interlinguas will deal only with parts of the language that are conceptually simple and well-defined, such as technical domains.
MT systems have come a long way from their code-cracker origins five decades ago. The research is leading to viable automatic translation. Better still, our understanding of linguistics is showing that we are less divided than we thought. Let's talk.
Matt Jones is a researcher at the Computing Science Interaction Design Centre, Middlesex University, London.