Statistical machine translation to enable universal communication?
6 mins read
Electronics and communications have always been extremely close bedfellows, but technology could be about to provide one of the most remarkable advances yet in communication, something not long ago confined to science fiction: translation for everyone.
If machine translation (MT) and speech recognition (SR) achieve what some expect, it will only be a few years before we can converse with virtually anyone on the planet. The famous Babel Fish from the Hitchhiker's Guide to the Galaxy will have arrived. At the heart of this possibility is a combination of a relatively new translation technique, statistical machine translation (SMT), together with the availability of vast amounts of translated text, or corpora, in electronic form which the arrival of the Web and the whole online revolution has brought about.
One of the leading users of SMT is Google and Google Translate engineer Anton Andryeyev explains SMT's essence. "SMT generates translations based on patterns found in large amounts of text. To teach someone a language, you might start by teaching them a vocabulary of words and grammatical rules that determine how to construct sentences. A computer can learn a foreign language in the same way by referring to a vocabulary and a set of rules. But languages are complicated and there are exceptions to almost any rule.
When you try to capture in a program all the exceptions and exceptions to the exceptions, translation quality starts to break down." Instead of trying to teach the machine all the rules of a language, SMT effectively lets computers discover the rules for themselves. It works by analysing millions of documents that have already been translated by humans, coming from books, organisations like the United Nations and EU, and worldwide websites.
"Computers scan texts looking for statistically significant patterns – patterns between the translation and the original text that are unlikely to occur by chance," Andryeyev says. "Once the computer finds a pattern, it can use it to translate similar texts in the future. Repeat this process billions of times and it becomes a very smart computer program." Key to SMT are the translation dictionaries, patterns and rules that the program develops. It does this by creating many different possible rules based on the previously translated documents and then ranking them probabilistically.
Google admits this approach to translation inevitably depends on the amount of texts available in particular languages and for particular pairings of languages. It also admits translation quality will be poorer for relatively rarer languages. Even so, over time, as the number of scanned samples grow, so should quality improve. Franz Josef Och, a pioneer of SMT, is now the head of Google's MT group.
He claims that creating an SMT system for a new pair of languages requires a bilingual text corpus with more than 1million words and two monolingual corpora, each having more than 1billion words. To obtain this, Google used documents from the UN, which publishes documents in all six official UN languages, creating a huge corpus.
Translation, a feature of Google since 2006, is now available in the form of a Toolbar, a browser add on for Internet Explorer and Firefox that automatically detects web pages written in foreign languages and translates the text of those pages into a user's native language. Last year, it introduced Google Translate as a free downloadable application for Android OS users, which works like the browser version.
But Google's latest plan is its most ambitious yet – to move from text or voice translation to on the fly, two way conversation. This is particularly difficult because it demands not only very fast and accurate translation, but also excellent speech recognition, and achieving the latter has proved an extremely difficult nut to crack for the best part of two decades. Even so, a Conversation Mode of Google's Translate app is now in experimental mode, enabling English speaking Android smartphone owners to speak in English and have Spanish speech output.
Spanish speakers can reply and have it translated back in to English. A mobile data connection is needed – Google's huge computing resources are doing the donkey work – but it says results should be virtually instantaneous. Other languages are in the pipeline. Indeed, Google first demonstrated the Conversation Mode last September using English and German but currently speech to speech conversion is supported only from English into German, French and Italian. Going from the three languages back to English has yet to become available.
Despite these limitations, does the new app suggest we will soon be throwing away our foreign dictionaries? Some translation experts are not so sure. "Things changed around 15 years, from a previous concentration on rule based, linguistic methods, looking at the syntax, morphology (patterns of word formation) and so on, to the statistical methods, which now dominate," says John Hutchins, who is on the committee of the European Association of Machine Translation and a former President. "Yet for a long time, statistical methods were not necessarily much better.
"It's cheaper to set up a statistical method than a rule based one, which is one of its attractions," Hutchins says. "Also, the availability of large bilingual corpora (collections of words and phrases) has grown hugely with the emergence of the Internet." Hutchins says the best MT performance today is probably around 80% as good as the best human would achieve, which explains why it is without doubt useful, particularly as a productivity tool for companies that have to translate millions of word a year.
Hutchins himself is doubtful that SMT will go much beyond 80% accuracy, but agrees many statistical researchers disagree. One who does is Daniel Marcu, who codeveloped a SMT program in 2002 called Language Weaver, which was bought last year by the UK global information management company SDL. SDL's customers include Adobe, Dell, Intel and Siemens and TripAdviser, which uses Language Weaver to provide instant translations of its hotel and restaurant reviews.
Marcu, now SDL's chief technology officer, compares the potential of SMT today with the semiconductor industry of 30 years ago and believes it can make the same kind of dramatic progress, given enough time and money. "In the 1980s, people would say to semiconductor makers 'we could play amazing games on computers, when are you going to give us the processors and memory we need?'.
Since 2000, SMT has seen steady, incremental progress and it has clearly got better. That is the result of innovations happening on virtually a daily basis, in many areas: better statistical modelling of linguistic phenomena, models more grounded in linguistic knowledge than previously and faster learning and search algorithms. "This is the coming of the age of SMT. SMT was pioneered at IBM in the 1990s, before the invention of the Web.
Then, the world was not ready for building machine translation systems for, say, 50 languages, or accessing huge amounts of data online, it just wasn't there. Now electronic data is everywhere." Marcu acknowledges there will probably be limits to MT – excellent machine translations of poetry or novels are not on the horizon – and humans translate in a radically different way to SMT because they can understand semantic content, or meaning.
But he still believes SMT will continue improving. "I see no sign of any plateau. Even though SMT works differently to humans, that is no reason why it cannot match them in certain areas – we have built machines that fly, but not by copying how birds do it. Do we understand the cognitive processes that human translators use? No. But it does not matter if we can build machines that get the job done."
Today, the hardware needed to run the most sophisticated SMT systems is beyond the capacity of current mobile devices, but Marcu believes this could be possible within as little as five years. "The most successful applications today are in the cloud, but I don't think it will have to stay that way." Another SMT development is, ironically, a move towards the use of a more rule based, linguistic oriented approach – the kind of technique with which SMT is often contrasted.
"It is a mistake to assume that statistical techniques cannot involve any kind of linguistics," Marcu says. "It is just that the field evolved with people being most concerned about pragmatic issues like scalability and speed, and the first systems built were just word for word, they had no linguistic knowledge whatsoever."
Since then, SMT systems have begun to incorporate linguistic knowledge relating to phrases, for example, and syntax. "We have nothing against linguistic techniques and it is perfectly possible to apply statistical techniques on to linguistic representations. We just want to embrace them without giving up any of the scalability or learning capabilities that we have in the current framework.
SMT is extremely pragmatic – it will use anything to improve translation quality. "So, the first generation of translation engines that Language Weaver commercialised were statistical phrase based systems – the unit of reasoning was a phrase. Since then, we have gradually developed them so that more linguistic expertise is incorporated – for example, in the form of lots of knowledge about syntax – so they handle bigger chunks of text, such as a noun phrase or prepositional phrase."
There is no question the use of MT is set to grow enormously over the next few years. "Only a small percentage of content is translated today," says Mark Lancaster, SDL's chairman and ceo. (Only 1% of translation is automated currently, according to IDC). "The digital universe is set to rise tenfold in the next five years and this expansion will be across the globe. Internet users are significantly more likely to read and react to content in their own language.
However, there are simply not enough translators in the world to translate the text we need in local language at the speed and quantities required. MT will become an integral part of companies' content creation and management strategy. Within the next five years, we expect more than 30% of all translated content to use MT." And what about the arrival of the Babel Fish? "For certain scenarios, it's virtually here," Marcu says. "With no noise to disrupt the SR and with a limited domain of discourse, like say the travel industry, systems do a pretty good job using an iPad or Android."
As Marcu says, the true Babel Fish application is a totally open requirement, with no constraints, which is not what will happen first. "Speech to speech translation is like the automotive industry at its very beginning. There are lots of ideas bubbling to the surface, but we are still waiting for someone to figure out which market segment has a big need that can be met."