The first in a series on translation services.

The machines created this mess; let them clean it up.

On the one hand, enterprises need to make ever more content available in multiple languages. As I noted in my last post on translation, the drivers include the flood of content generated online (much of it created by consumers), the growing importance of business in emerging markets, and the desire to enable global collaboration among employees. On the other hand, advances in machine translation and new approaches such as crowdsourcing are making translation ever faster and less expensive. This is no fortunate coincidence: The very computing dynamics that enabled the Web and especially Web 2.0 — rapid increases in processor speed, cheap storage, and high-speed networks, combined with social technologies — also empower the latest technology-based solutions to translation and localization. 

What it means (WIM): Computers have allowed us to create a problem that only computers can help solve.

This is the first of an irregular series of blog posts on how technical advances, new solution paradigms, and evolving client needs are changing translation services and providers (TSPs). I'll begin by offering a select glossary of some of the unfamiliar terms end users encounter when they begin to investigate translation services.

MT: Machine Translation, which simply means the use of computing technologies and software to assist with the translation of content (usually text, but voice recognition is of growing importance) from one language ("the source") to another ("the target"). Machine translation takes two primary forms, namely:

RbMT: Rule-based Machine Translation involves training the machine with translation dictionaries (source-target language pairs) and linguistic rules about how sentences are formed and meaning is constructed in each of the languages. The machine breaks down the source content into its constitutive elements. In the most prevalent form of RbMT, the result is a "syntactic tree." The tree is then "repopulated" with the target language according to the language pairs and the rules. In RbMT, the machine effectively mimics a human translator. 

SMT: Statistical Machine Translation systems are trained with a large corpus of existing translated material. (United Nations proceedings and the records of the European Parliament are popular sources). The machine proceeds to isolate words or phrases in the source content, locates multiple instances in which this content (or something similar) has been translated in the target language corpus, and selects the "best" — that is, the statistically most probable — translation in the target language. SMT is akin to someone who moves to a foreign country and "picks up" the language by experience and exposure rather than formally learning the syntax and linguistic rules.

Hybrid MT is, predictably, the combination of rule-based and statistical approaches to provide a higher-quality output than either could alone. (This may be supplemented with semantic tagging that further disambiguates the language for the machine.)

WIM: You don't care about RbMT versus SMT. Not long ago, a battle raged (or at least a tiff was performed) between proponents of the two approaches, but the fact is that virtually every provider of MT now offers some version of a hybrid approach. As one vendor said, "There's movement from both ends towards the middle. But there is no one middle point." In other words: Yes, there are differences between solution A and solution B. But all that matters is which one performs better for you. (Where "performs better" is determined by your success criteria, whether speed, cost, quality, etc.)

TM: (Confused yet?) Translation Memory is a database that stores previously translated content. Incoming source language content is run against the TM to isolate the portion of it that needs to be translated for the first time, whether by machines or humans. TM is obviously a great way to reduce the cost of ongoing translations. But a common problem is that organizations have numerous translation teams working with different TSPs, and the various TMs are not coordinated.

Raw machine output: The output from MT, with or without the assistance of TM.

Post editing (PE) is any work on raw machine output by a human with the aim or removing errors or refining the content. Post editing raises daily output to roughly 10,000 words, compared to around 3,000 for human translation alone. The tradeoff is that the MT + post editing content is typically of lower quality. Which brings us finally to . . .

FAHQT: Fully Automated High-Quality Translation was for decades the holy grail of MT. Computers can match or beat humans at chess and Jeopardy; why shouldn't they be able to match human-quality translation? Primarily because language does not lend itself to exhaustive description in rules that can be followed by computers. In fact, even humans don't provide "high-quality translation" as defined by one method for testing FAHQT: Human translators score around 80 versus 68 for MT. The opening paper at a 1988 conference on Translating and the Computer was telling, titled, "Ten years of machine translation design and application: From FAHQT to realism."

FAUT: The turn to Fully Automatic Useful Translation rather than FAHQT could be viewed as an admission of defeat. "Machine translation," said a VP of SYSTRAN some years ago, "is an imperfect science." (The cited essay is a great analysis of how the popular "fun with Babelfish" games I noted in my 2009 report on translation possibly obscure the inherent literary merits of "bad" translation.) But is it is better viewed, as the conference paper noted, as an embrace of "realism." What really matters isn't whether a machine translation is perfect (or rather imperfect but equal to human quality) — it's whether the translated content is useful. When I read a raw machine output hotel review on tripadvisor.com I can tell whether the traveler enjoyed or deplored his/her visit, despite the questionable "quality" of the translation.

So the equations for translation today could be: (RbMT + SMT) / TM = FAUT and FAUT + PE = F(airly) AHQT.

WIM: Enterprises now have access to a full range of options, with raw machine output and exclusively human translation at the extremes and MT plus post editing occupying the middle. How vendors are offering these services, and how users are taking advantage of them, will be the subject of another post in the series.

What are you doing about translation? And what aspects would you like to see addressed in this series?