Spaces:
Running
Running
| # This document contains text in Perl "POD" format. | |
| # Use a POD viewer like perldoc or perlman to render it. | |
| =head1 NAME | |
| Locale::Maketext::TPJ13 -- article about software localization | |
| =head1 SYNOPSIS | |
| # This an article, not a module. | |
| =head1 DESCRIPTION | |
| The following article by Sean M. Burke and Jordan Lachler | |
| first appeared in I<The Perl Journal> #13 | |
| and is copyright 1999 The Perl Journal. It appears | |
| courtesy of Jon Orwant and The Perl Journal. This document may be | |
| distributed under the same terms as Perl itself. | |
| =head1 Localization and Perl: gettext breaks, Maketext fixes | |
| by Sean M. Burke and Jordan Lachler | |
| This article points out cases where gettext (a common system for | |
| localizing software interfaces -- i.e., making them work in the user's | |
| language of choice) fails because of basic differences between human | |
| languages. This article then describes Maketext, a new system capable | |
| of correctly treating these differences. | |
| =head2 A Localization Horror Story: It Could Happen To You | |
| =over | |
| "There are a number of languages spoken by human beings in this | |
| world." | |
| -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the | |
| Identification of Languages" | |
| =back | |
| Imagine that your task for the day is to localize a piece of software | |
| -- and luckily for you, the only output the program emits is two | |
| messages, like this: | |
| I scanned 12 directories. | |
| Your query matched 10 files in 4 directories. | |
| So how hard could that be? You look at the code that | |
| produces the first item, and it reads: | |
| printf("I scanned %g directories.", | |
| $directory_count); | |
| You think about that, and realize that it doesn't even work right for | |
| English, as it can produce this output: | |
| I scanned 1 directories. | |
| So you rewrite it to read: | |
| printf("I scanned %g %s.", | |
| $directory_count, | |
| $directory_count == 1 ? | |
| "directory" : "directories", | |
| ); | |
| ...which does the Right Thing. (In case you don't recall, "%g" is for | |
| locale-specific number interpolation, and "%s" is for string | |
| interpolation.) | |
| But you still have to localize it for all the languages you're | |
| producing this software for, so you pull Locale::gettext off of CPAN | |
| so you can access the C<gettext> C functions you've heard are standard | |
| for localization tasks. | |
| And you write: | |
| printf(gettext("I scanned %g %s."), | |
| $dir_scan_count, | |
| $dir_scan_count == 1 ? | |
| gettext("directory") : gettext("directories"), | |
| ); | |
| But you then read in the gettext manual (Drepper, Miller, and Pinard 1995) | |
| that this is not a good idea, since how a single word like "directory" | |
| or "directories" is translated may depend on context -- and this is | |
| true, since in a case language like German or Russian, you'd may need | |
| these words with a different case ending in the first instance (where the | |
| word is the object of a verb) than in the second instance, which you haven't even | |
| gotten to yet (where the word is the object of a preposition, "in %g | |
| directories") -- assuming these keep the same syntax when translated | |
| into those languages. | |
| So, on the advice of the gettext manual, you rewrite: | |
| printf( $dir_scan_count == 1 ? | |
| gettext("I scanned %g directory.") : | |
| gettext("I scanned %g directories."), | |
| $dir_scan_count ); | |
| So, you email your various translators (the boss decides that the | |
| languages du jour are Chinese, Arabic, Russian, and Italian, so you | |
| have one translator for each), asking for translations for "I scanned | |
| %g directory." and "I scanned %g directories.". When they reply, | |
| you'll put that in the lexicons for gettext to use when it localizes | |
| your software, so that when the user is running under the "zh" | |
| (Chinese) locale, gettext("I scanned %g directory.") will return the | |
| appropriate Chinese text, with a "%g" in there where printf can then | |
| interpolate $dir_scan. | |
| Your Chinese translator emails right back -- he says both of these | |
| phrases translate to the same thing in Chinese, because, in linguistic | |
| jargon, Chinese "doesn't have number as a grammatical category" -- | |
| whereas English does. That is, English has grammatical rules that | |
| refer to "number", i.e., whether something is grammatically singular | |
| or plural; and one of these rules is the one that forces nouns to take | |
| a plural suffix (generally "s") when in a plural context, as they are when | |
| they follow a number other than "one" (including, oddly enough, "zero"). | |
| Chinese has no such rules, and so has just the one phrase where English | |
| has two. But, no problem, you can have this one Chinese phrase appear | |
| as the translation for the two English phrases in the "zh" gettext | |
| lexicon for your program. | |
| Emboldened by this, you dive into the second phrase that your software | |
| needs to output: "Your query matched 10 files in 4 directories.". You notice | |
| that if you want to treat phrases as indivisible, as the gettext | |
| manual wisely advises, you need four cases now, instead of two, to | |
| cover the permutations of singular and plural on the two items, | |
| $dir_count and $file_count. So you try this: | |
| printf( $file_count == 1 ? | |
| ( $directory_count == 1 ? | |
| gettext("Your query matched %g file in %g directory.") : | |
| gettext("Your query matched %g file in %g directories.") ) : | |
| ( $directory_count == 1 ? | |
| gettext("Your query matched %g files in %g directory.") : | |
| gettext("Your query matched %g files in %g directories.") ), | |
| $file_count, $directory_count, | |
| ); | |
| (The case of "1 file in 2 [or more] directories" could, I suppose, | |
| occur in the case of symlinking or something of the sort.) | |
| It occurs to you that this is not the prettiest code you've ever | |
| written, but this seems the way to go. You mail off to the | |
| translators asking for translations for these four cases. The | |
| Chinese guy replies with the one phrase that these all translate to in | |
| Chinese, and that phrase has two "%g"s in it, as it should -- but | |
| there's a problem. He translates it word-for-word back: "In %g | |
| directories contains %g files match your query." The %g | |
| slots are in an order reverse to what they are in English. You wonder | |
| how you'll get gettext to handle that. | |
| But you put it aside for the moment, and optimistically hope that the | |
| other translators won't have this problem, and that their languages | |
| will be better behaved -- i.e., that they will be just like English. | |
| But the Arabic translator is the next to write back. First off, your | |
| code for "I scanned %g directory." or "I scanned %g directories." | |
| assumes there's only singular or plural. But, to use linguistic | |
| jargon again, Arabic has grammatical number, like English (but unlike | |
| Chinese), but it's a three-term category: singular, dual, and plural. | |
| In other words, the way you say "directory" depends on whether there's | |
| one directory, or I<two> of them, or I<more than two> of them. Your | |
| test of C<($directory == 1)> no longer does the job. And it means | |
| that where English's grammatical category of number necessitates | |
| only the two permutations of the first sentence based on "directory | |
| [singular]" and "directories [plural]", Arabic has three -- and, | |
| worse, in the second sentence ("Your query matched %g file in %g | |
| directory."), where English has four, Arabic has nine. You sense | |
| an unwelcome, exponential trend taking shape. | |
| Your Italian translator emails you back and says that "I searched 0 | |
| directories" (a possible English output of your program) is stilted, | |
| and if you think that's fine English, that's your problem, but that | |
| I<just will not do> in the language of Dante. He insists that where | |
| $directory_count is 0, your program should produce the Italian text | |
| for "I I<didn't> scan I<any> directories.". And ditto for "I didn't | |
| match any files in any directories", although he says the last part | |
| about "in any directories" should probably just be left off. | |
| You wonder how you'll get gettext to handle this; to accommodate the | |
| ways Arabic, Chinese, and Italian deal with numbers in just these few | |
| very simple phrases, you need to write code that will ask gettext for | |
| different queries depending on whether the numerical values in | |
| question are 1, 2, more than 2, or in some cases 0, and you still haven't | |
| figured out the problem with the different word order in Chinese. | |
| Then your Russian translator calls on the phone, to I<personally> tell | |
| you the bad news about how really unpleasant your life is about to | |
| become: | |
| Russian, like German or Latin, is an inflectional language; that is, nouns | |
| and adjectives have to take endings that depend on their case | |
| (i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of | |
| what role they have in syntax of the sentence -- | |
| as well as on the grammatical gender (i.e., masculine, feminine, neuter) | |
| and number (i.e., singular or plural) of the noun, as well as on the | |
| declension class of the noun. But unlike with most other inflected languages, | |
| putting a number-phrase (like "ten" or "forty-three", or their Arabic | |
| numeral equivalents) in front of noun in Russian can change the case and | |
| number that noun is, and therefore the endings you have to put on it. | |
| He elaborates: In "I scanned %g directories", you'd I<expect> | |
| "directories" to be in the accusative case (since it is the direct | |
| object in the sentence) and the plural number, | |
| except where $directory_count is 1, then you'd expect the singular, of | |
| course. Just like Latin or German. I<But!> Where $directory_count % | |
| 10 is 1 ("%" for modulo, remember), assuming $directory count is an | |
| integer, and except where $directory_count % 100 is 11, "directories" | |
| is forced to become grammatically singular, which means it gets the | |
| ending for the accusative singular... You begin to visualize the code | |
| it'd take to test for the problem so far, I<and still work for Chinese | |
| and Arabic and Italian>, and how many gettext items that'd take, but | |
| he keeps going... But where $directory_count % 10 is 2, 3, or 4 | |
| (except where $directory_count % 100 is 12, 13, or 14), the word for | |
| "directories" is forced to be genitive singular -- which means another | |
| ending... The room begins to spin around you, slowly at first... But | |
| with I<all other> integer values, since "directory" is an inanimate | |
| noun, when preceded by a number and in the nominative or accusative | |
| cases (as it is here, just your luck!), it does stay plural, but it is | |
| forced into the genitive case -- yet another ending... And | |
| you never hear him get to the part about how you're going to run into | |
| similar (but maybe subtly different) problems with other Slavic | |
| languages like Polish, because the floor comes up to meet you, and you | |
| fade into unconsciousness. | |
| The above cautionary tale relates how an attempt at localization can | |
| lead from programmer consternation, to program obfuscation, to a need | |
| for sedation. But careful evaluation shows that your choice of tools | |
| merely needed further consideration. | |
| =head2 The Linguistic View | |
| =over | |
| "It is more complicated than you think." | |
| -- The Eighth Networking Truth, from RFC 1925 | |
| =back | |
| The field of Linguistics has expended a great deal of effort over the | |
| past century trying to find grammatical patterns which hold across | |
| languages; it's been a constant process | |
| of people making generalizations that should apply to all languages, | |
| only to find out that, all too often, these generalizations fail -- | |
| sometimes failing for just a few languages, sometimes whole classes of | |
| languages, and sometimes nearly every language in the world except | |
| English. Broad statistical trends are evident in what the "average | |
| language" is like as far as what its rules can look like, must look | |
| like, and cannot look like. But the "average language" is just as | |
| unreal a concept as the "average person" -- it runs up against the | |
| fact no language (or person) is, in fact, average. The wisdom of past | |
| experience leads us to believe that any given language can do whatever | |
| it wants, in any order, with appeal to any kind of grammatical | |
| categories wants -- case, number, tense, real or metaphoric | |
| characteristics of the things that words refer to, arbitrary or | |
| predictable classifications of words based on what endings or prefixes | |
| they can take, degree or means of certainty about the truth of | |
| statements expressed, and so on, ad infinitum. | |
| Mercifully, most localization tasks are a matter of finding ways to | |
| translate whole phrases, generally sentences, where the context is | |
| relatively set, and where the only variation in content is I<usually> | |
| in a number being expressed -- as in the example sentences above. | |
| Translating specific, fully-formed sentences is, in practice, fairly | |
| foolproof -- which is good, because that's what's in the phrasebooks | |
| that so many tourists rely on. Now, a given phrase (whether in a | |
| phrasebook or in a gettext lexicon) in one language I<might> have a | |
| greater or lesser applicability than that phrase's translation into | |
| another language -- for example, strictly speaking, in Arabic, the | |
| "your" in "Your query matched..." would take a different form | |
| depending on whether the user is male or female; so the Arabic | |
| translation "your[feminine] query" is applicable in fewer cases than | |
| the corresponding English phrase, which doesn't distinguish the user's | |
| gender. (In practice, it's not feasible to have a program know the | |
| user's gender, so the masculine "you" in Arabic is usually used, by | |
| default.) | |
| But in general, such surprises are rare when entire sentences are | |
| being translated, especially when the functional context is restricted | |
| to that of a computer interacting with a user either to convey a fact | |
| or to prompt for a piece of information. So, for purposes of | |
| localization, translation by phrase (generally by sentence) is both the | |
| simplest and the least problematic. | |
| =head2 Breaking gettext | |
| =over | |
| "It Has To Work." | |
| -- First Networking Truth, RFC 1925 | |
| =back | |
| Consider that sentences in a tourist phrasebook are of two types: ones | |
| like "How do I get to the marketplace?" that don't have any blanks to | |
| fill in, and ones like "How much do these ___ cost?", where there's | |
| one or more blanks to fill in (and these are usually linked to a | |
| list of words that you can put in that blank: "fish", "potatoes", | |
| "tomatoes", etc.). The ones with no blanks are no problem, but the | |
| fill-in-the-blank ones may not be really straightforward. If it's a | |
| Swahili phrasebook, for example, the authors probably didn't bother to | |
| tell you the complicated ways that the verb "cost" changes its | |
| inflectional prefix depending on the noun you're putting in the blank. | |
| The trader in the marketplace will still understand what you're saying if | |
| you say "how much do these potatoes cost?" with the wrong | |
| inflectional prefix on "cost". After all, I<you> can't speak proper Swahili, | |
| I<you're> just a tourist. But while tourists can be stupid, computers | |
| are supposed to be smart; the computer should be able to fill in the | |
| blank, and still have the results be grammatical. | |
| In other words, a phrasebook entry takes some values as parameters | |
| (the things that you fill in the blank or blanks), and provides a value | |
| based on these parameters, where the way you get that final value from | |
| the given values can, properly speaking, involve an arbitrarily | |
| complex series of operations. (In the case of Chinese, it'd be not at | |
| all complex, at least in cases like the examples at the beginning of | |
| this article; whereas in the case of Russian it'd be a rather complex | |
| series of operations. And in some languages, the | |
| complexity could be spread around differently: while the act of | |
| putting a number-expression in front of a noun phrase might not be | |
| complex by itself, it may change how you have to, for example, inflect | |
| a verb elsewhere in the sentence. This is what in syntax is called | |
| "long-distance dependencies".) | |
| This talk of parameters and arbitrary complexity is just another way | |
| to say that an entry in a phrasebook is what in a programming language | |
| would be called a "function". Just so you don't miss it, this is the | |
| crux of this article: I<A phrase is a function; a phrasebook is a | |
| bunch of functions.> | |
| The reason that using gettext runs into walls (as in the above | |
| second-person horror story) is that you're trying to use a string (or | |
| worse, a choice among a bunch of strings) to do what you really need a | |
| function for -- which is futile. Preforming (s)printf interpolation | |
| on the strings which you get back from gettext does allow you to do I<some> | |
| common things passably well... sometimes... sort of; but, to paraphrase | |
| what some people say about C<csh> script programming, "it fools you | |
| into thinking you can use it for real things, but you can't, and you | |
| don't discover this until you've already spent too much time trying, | |
| and by then it's too late." | |
| =head2 Replacing gettext | |
| So, what needs to replace gettext is a system that supports lexicons | |
| of functions instead of lexicons of strings. An entry in a lexicon | |
| from such a system should I<not> look like this: | |
| "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires" | |
| [\xE9 is e-acute in Latin-1. Some pod renderers would | |
| scream if I used the actual character here. -- SB] | |
| but instead like this, bearing in mind that this is just a first stab: | |
| sub I_found_X1_files_in_X2_directories { | |
| my( $files, $dirs ) = @_[0,1]; | |
| $files = sprintf("%g %s", $files, | |
| $files == 1 ? 'fichier' : 'fichiers'); | |
| $dirs = sprintf("%g %s", $dirs, | |
| $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); | |
| return "J'ai trouv\xE9 $files dans $dirs."; | |
| } | |
| Now, there's no particularly obvious way to store anything but strings | |
| in a gettext lexicon; so it looks like we just have to start over and | |
| make something better, from scratch. I call my shot at a | |
| gettext-replacement system "Maketext", or, in CPAN terms, | |
| Locale::Maketext. | |
| When designing Maketext, I chose to plan its main features in terms of | |
| "buzzword compliance". And here are the buzzwords: | |
| =head2 Buzzwords: Abstraction and Encapsulation | |
| The complexity of the language you're trying to output a phrase in is | |
| entirely abstracted inside (and encapsulated within) the Maketext module | |
| for that interface. When you call: | |
| print $lang->maketext("You have [quant,_1,piece] of new mail.", | |
| scalar(@messages)); | |
| you don't know (and in fact can't easily find out) whether this will | |
| involve lots of figuring, as in Russian (if $lang is a handle to the | |
| Russian module), or relatively little, as in Chinese. That kind of | |
| abstraction and encapsulation may encourage other pleasant buzzwords | |
| like modularization and stratification, depending on what design | |
| decisions you make. | |
| =head2 Buzzword: Isomorphism | |
| "Isomorphism" means "having the same structure or form"; in discussions | |
| of program design, the word takes on the special, specific meaning that | |
| your implementation of a solution to a problem I<has the same | |
| structure> as, say, an informal verbal description of the solution, or | |
| maybe of the problem itself. Isomorphism is, all things considered, | |
| a good thing -- it's what problem-solving (and solution-implementing) | |
| should look like. | |
| What's wrong the with gettext-using code like this... | |
| printf( $file_count == 1 ? | |
| ( $directory_count == 1 ? | |
| "Your query matched %g file in %g directory." : | |
| "Your query matched %g file in %g directories." ) : | |
| ( $directory_count == 1 ? | |
| "Your query matched %g files in %g directory." : | |
| "Your query matched %g files in %g directories." ), | |
| $file_count, $directory_count, | |
| ); | |
| is first off that it's not well abstracted -- these ways of testing | |
| for grammatical number (as in the expressions like C<foo == 1 ? | |
| singular_form : plural_form>) should be abstracted to each language | |
| module, since how you get grammatical number is language-specific. | |
| But second off, it's not isomorphic -- the "solution" (i.e., the | |
| phrasebook entries) for Chinese maps from these four English phrases to | |
| the one Chinese phrase that fits for all of them. In other words, the | |
| informal solution would be "The way to say what you want in Chinese is | |
| with the one phrase 'For your question, in Y directories you would | |
| find X files'" -- and so the implemented solution should be, | |
| isomorphically, just a straightforward way to spit out that one | |
| phrase, with numerals properly interpolated. It shouldn't have to map | |
| from the complexity of other languages to the simplicity of this one. | |
| =head2 Buzzword: Inheritance | |
| There's a great deal of reuse possible for sharing of phrases between | |
| modules for related dialects, or for sharing of auxiliary functions | |
| between related languages. (By "auxiliary functions", I mean | |
| functions that don't produce phrase-text, but which, say, return an | |
| answer to "does this number require a plural noun after it?". Such | |
| auxiliary functions would be used in the internal logic of functions | |
| that actually do produce phrase-text.) | |
| In the case of sharing phrases, consider that you have an interface | |
| already localized for American English (probably by having been | |
| written with that as the native locale, but that's incidental). | |
| Localizing it for UK English should, in practical terms, be just a | |
| matter of running it past a British person with the instructions to | |
| indicate what few phrases would benefit from a change in spelling or | |
| possibly minor rewording. In that case, you should be able to put in | |
| the UK English localization module I<only> those phrases that are | |
| UK-specific, and for all the rest, I<inherit> from the American | |
| English module. (And I expect this same situation would apply with | |
| Brazilian and Continental Portugese, possibly with some I<very> | |
| closely related languages like Czech and Slovak, and possibly with the | |
| slightly different "versions" of written Mandarin Chinese, as I hear exist in | |
| Taiwan and mainland China.) | |
| As to sharing of auxiliary functions, consider the problem of Russian | |
| numbers from the beginning of this article; obviously, you'd want to | |
| write only once the hairy code that, given a numeric value, would | |
| return some specification of which case and number a given quantified | |
| noun should use. But suppose that you discover, while localizing an | |
| interface for, say, Ukrainian (a Slavic language related to Russian, | |
| spoken by several million people, many of whom would be relieved to | |
| find that your Web site's or software's interface is available in | |
| their language), that the rules in Ukrainian are the same as in Russian | |
| for quantification, and probably for many other grammatical functions. | |
| While there may well be no phrases in common between Russian and | |
| Ukrainian, you could still choose to have the Ukrainian module inherit | |
| from the Russian module, just for the sake of inheriting all the | |
| various grammatical methods. Or, probably better organizationally, | |
| you could move those functions to a module called C<_E_Slavic> or | |
| something, which Russian and Ukrainian could inherit useful functions | |
| from, but which would (presumably) provide no lexicon. | |
| =head2 Buzzword: Concision | |
| Okay, concision isn't a buzzword. But it should be, so I decree that | |
| as a new buzzword, "concision" means that simple common things should | |
| be expressible in very few lines (or maybe even just a few characters) | |
| of code -- call it a special case of "making simple things easy and | |
| hard things possible", and see also the role it played in the | |
| MIDI::Simple language, discussed elsewhere in this issue [TPJ#13]. | |
| Consider our first stab at an entry in our "phrasebook of functions": | |
| sub I_found_X1_files_in_X2_directories { | |
| my( $files, $dirs ) = @_[0,1]; | |
| $files = sprintf("%g %s", $files, | |
| $files == 1 ? 'fichier' : 'fichiers'); | |
| $dirs = sprintf("%g %s", $dirs, | |
| $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); | |
| return "J'ai trouv\xE9 $files dans $dirs."; | |
| } | |
| You may sense that a lexicon (to use a non-committal catch-all term for a | |
| collection of things you know how to say, regardless of whether they're | |
| phrases or words) consisting of functions I<expressed> as above would | |
| make for rather long-winded and repetitive code -- even if you wisely | |
| rewrote this to have quantification (as we call adding a number | |
| expression to a noun phrase) be a function called like: | |
| sub I_found_X1_files_in_X2_directories { | |
| my( $files, $dirs ) = @_[0,1]; | |
| $files = quant($files, "fichier"); | |
| $dirs = quant($dirs, "r\xE9pertoire"); | |
| return "J'ai trouv\xE9 $files dans $dirs."; | |
| } | |
| And you may also sense that you do not want to bother your translators | |
| with having to write Perl code -- you'd much rather that they spend | |
| their I<very costly time> on just translation. And this is to say | |
| nothing of the near impossibility of finding a commercial translator | |
| who would know even simple Perl. | |
| In a first-hack implementation of Maketext, each language-module's | |
| lexicon looked like this: | |
| %Lexicon = ( | |
| "I found %g files in %g directories" | |
| => sub { | |
| my( $files, $dirs ) = @_[0,1]; | |
| $files = quant($files, "fichier"); | |
| $dirs = quant($dirs, "r\xE9pertoire"); | |
| return "J'ai trouv\xE9 $files dans $dirs."; | |
| }, | |
| ... and so on with other phrase => sub mappings ... | |
| ); | |
| but I immediately went looking for some more concise way to basically | |
| denote the same phrase-function -- a way that would also serve to | |
| concisely denote I<most> phrase-functions in the lexicon for I<most> | |
| languages. After much time and even some actual thought, I decided on | |
| this system: | |
| * Where a value in a %Lexicon hash is a contentful string instead of | |
| an anonymous sub (or, conceivably, a coderef), it would be interpreted | |
| as a sort of shorthand expression of what the sub does. When accessed | |
| for the first time in a session, it is parsed, turned into Perl code, | |
| and then eval'd into an anonymous sub; then that sub replaces the | |
| original string in that lexicon. (That way, the work of parsing and | |
| evaling the shorthand form for a given phrase is done no more than | |
| once per session.) | |
| * Calls to C<maketext> (as Maketext's main function is called) happen | |
| thru a "language session handle", notionally very much like an IO | |
| handle, in that you open one at the start of the session, and use it | |
| for "sending signals" to an object in order to have it return the text | |
| you want. | |
| So, this: | |
| $lang->maketext("You have [quant,_1,piece] of new mail.", | |
| scalar(@messages)); | |
| basically means this: look in the lexicon for $lang (which may inherit | |
| from any number of other lexicons), and find the function that we | |
| happen to associate with the string "You have [quant,_1,piece] of new | |
| mail" (which is, and should be, a functioning "shorthand" for this | |
| function in the native locale -- English in this case). If you find | |
| such a function, call it with $lang as its first parameter (as if it | |
| were a method), and then a copy of scalar(@messages) as its second, | |
| and then return that value. If that function was found, but was in | |
| string shorthand instead of being a fully specified function, parse it | |
| and make it into a function before calling it the first time. | |
| * The shorthand uses code in brackets to indicate method calls that | |
| should be performed. A full explanation is not in order here, but a | |
| few examples will suffice: | |
| "You have [quant,_1,piece] of new mail." | |
| The above code is shorthand for, and will be interpreted as, | |
| this: | |
| sub { | |
| my $handle = $_[0]; | |
| my(@params) = @_; | |
| return join '', | |
| "You have ", | |
| $handle->quant($params[1], 'piece'), | |
| "of new mail."; | |
| } | |
| where "quant" is the name of a method you're using to quantify the | |
| noun "piece" with the number $params[0]. | |
| A string with no brackety calls, like this: | |
| "Your search expression was malformed." | |
| is somewhat of a degenerate case, and just gets turned into: | |
| sub { return "Your search expression was malformed." } | |
| However, not everything you can write in Perl code can be written in | |
| the above shorthand system -- not by a long shot. For example, consider | |
| the Italian translator from the beginning of this article, who wanted | |
| the Italian for "I didn't find any files" as a special case, instead | |
| of "I found 0 files". That couldn't be specified (at least not easily | |
| or simply) in our shorthand system, and it would have to be written | |
| out in full, like this: | |
| sub { # pretend the English strings are in Italian | |
| my($handle, $files, $dirs) = @_[0,1,2]; | |
| return "I didn't find any files" unless $files; | |
| return join '', | |
| "I found ", | |
| $handle->quant($files, 'file'), | |
| " in ", | |
| $handle->quant($dirs, 'directory'), | |
| "."; | |
| } | |
| Next to a lexicon full of shorthand code, that sort of sticks out like a | |
| sore thumb -- but this I<is> a special case, after all; and at least | |
| it's possible, if not as concise as usual. | |
| As to how you'd implement the Russian example from the beginning of | |
| the article, well, There's More Than One Way To Do It, but it could be | |
| something like this (using English words for Russian, just so you know | |
| what's going on): | |
| "I [quant,_1,directory,accusative] scanned." | |
| This shifts the burden of complexity off to the quant method. That | |
| method's parameters are: the numeric value it's going to use to | |
| quantify something; the Russian word it's going to quantify; and the | |
| parameter "accusative", which you're using to mean that this | |
| sentence's syntax wants a noun in the accusative case there, although | |
| that quantification method may have to overrule, for grammatical | |
| reasons you may recall from the beginning of this article. | |
| Now, the Russian quant method here is responsible not only for | |
| implementing the strange logic necessary for figuring out how Russian | |
| number-phrases impose case and number on their noun-phrases, but also | |
| for inflecting the Russian word for "directory". How that inflection | |
| is to be carried out is no small issue, and among the solutions I've | |
| seen, some (like variations on a simple lookup in a hash where all | |
| possible forms are provided for all necessary words) are | |
| straightforward but I<can> become cumbersome when you need to inflect | |
| more than a few dozen words; and other solutions (like using | |
| algorithms to model the inflections, storing only root forms and | |
| irregularities) I<can> involve more overhead than is justifiable for | |
| all but the largest lexicons. | |
| Mercifully, this design decision becomes crucial only in the hairiest | |
| of inflected languages, of which Russian is by no means the I<worst> case | |
| scenario, but is worse than most. Most languages have simpler | |
| inflection systems; for example, in English or Swahili, there are | |
| generally no more than two possible inflected forms for a given noun | |
| ("error/errors"; "kosa/makosa"), and the | |
| rules for producing these forms are fairly simple -- or at least, | |
| simple rules can be formulated that work for most words, and you can | |
| then treat the exceptions as just "irregular", at least relative to | |
| your ad hoc rules. A simpler inflection system (simpler rules, fewer | |
| forms) means that design decisions are less crucial to maintaining | |
| sanity, whereas the same decisions could incur | |
| overhead-versus-scalability problems in languages like Russian. It | |
| may I<also> be likely that code (possibly in Perl, as with | |
| Lingua::EN::Inflect, for English nouns) has already | |
| been written for the language in question, whether simple or complex. | |
| Moreover, a third possibility may even be simpler than anything | |
| discussed above: "Just require that all possible (or at least | |
| applicable) forms be provided in the call to the given language's quant | |
| method, as in:" | |
| "I found [quant,_1,file,files]." | |
| That way, quant just has to chose which form it needs, without having | |
| to look up or generate anything. While possibly not optimal for | |
| Russian, this should work well for most other languages, where | |
| quantification is not as complicated an operation. | |
| =head2 The Devil in the Details | |
| There's plenty more to Maketext than described above -- for example, | |
| there's the details of how language tags ("en-US", "i-pwn", "fi", | |
| etc.) or locale IDs ("en_US") interact with actual module naming | |
| ("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the | |
| details of how to record (and possibly negotiate) what character | |
| encoding Maketext will return text in (UTF8? Latin-1? KOI8?). There's | |
| the interesting fact that Maketext is for localization, but nowhere | |
| actually has a "C<use locale;>" anywhere in it. For the curious, | |
| there's the somewhat frightening details of how I actually | |
| implement something like data inheritance so that searches across | |
| modules' %Lexicon hashes can parallel how Perl implements method | |
| inheritance. | |
| And, most importantly, there's all the practical details of how to | |
| actually go about deriving from Maketext so you can use it for your | |
| interfaces, and the various tools and conventions for starting out and | |
| maintaining individual language modules. | |
| That is all covered in the documentation for Locale::Maketext and the | |
| modules that come with it, available in CPAN. After having read this | |
| article, which covers the why's of Maketext, the documentation, | |
| which covers the how's of it, should be quite straightforward. | |
| =head2 The Proof in the Pudding: Localizing Web Sites | |
| Maketext and gettext have a notable difference: gettext is in C, | |
| accessible thru C library calls, whereas Maketext is in Perl, and | |
| really can't work without a Perl interpreter (although I suppose | |
| something like it could be written for C). Accidents of history (and | |
| not necessarily lucky ones) have made C++ the most common language for | |
| the implementation of applications like word processors, Web browsers, | |
| and even many in-house applications like custom query systems. Current | |
| conditions make it somewhat unlikely that the next one of any of these | |
| kinds of applications will be written in Perl, albeit clearly more for | |
| reasons of custom and inertia than out of consideration of what is the | |
| right tool for the job. | |
| However, other accidents of history have made Perl a well-accepted | |
| language for design of server-side programs (generally in CGI form) | |
| for Web site interfaces. Localization of static pages in Web sites is | |
| trivial, feasible either with simple language-negotiation features in | |
| servers like Apache, or with some kind of server-side inclusions of | |
| language-appropriate text into layout templates. However, I think | |
| that the localization of Perl-based search systems (or other kinds of | |
| dynamic content) in Web sites, be they public or access-restricted, | |
| is where Maketext will see the greatest use. | |
| I presume that it would be only the exceptional Web site that gets | |
| localized for English I<and> Chinese I<and> Italian I<and> Arabic | |
| I<and> Russian, to recall the languages from the beginning of this | |
| article -- to say nothing of German, Spanish, French, Japanese, | |
| Finnish, and Hindi, to name a few languages that benefit from large | |
| numbers of programmers or Web viewers or both. | |
| However, the ever-increasing internationalization of the Web (whether | |
| measured in terms of amount of content, of numbers of content writers | |
| or programmers, or of size of content audiences) makes it increasingly | |
| likely that the interface to the average Web-based dynamic content | |
| service will be localized for two or maybe three languages. It is my | |
| hope that Maketext will make that task as simple as possible, and will | |
| remove previous barriers to localization for languages dissimilar to | |
| English. | |
| __END__ | |
| Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics | |
| from Northwestern University; he specializes in language technology. | |
| Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of | |
| Linguistics at the University of New Mexico; he specializes in | |
| morphology and pedagogy of North American native languages. | |
| =head2 References | |
| Alvestrand, Harald Tveit. 1995. I<RFC 1766: Tags for the | |
| Identification of Languages.> | |
| C<L<http://www.ietf.org/rfc/rfc1766.txt>> | |
| [Now see RFC 3066.] | |
| Callon, Ross, editor. 1996. I<RFC 1925: The Twelve | |
| Networking Truths.> | |
| C<L<http://www.ietf.org/rfc/rfc1925.txt>> | |
| Drepper, Ulrich, Peter Miller, | |
| and FranE<ccedil>ois Pinard. 1995-2001. GNU | |
| C<gettext>. Available in C<L<ftp://prep.ai.mit.edu/pub/gnu/>>, with | |
| extensive docs in the distribution tarball. [Since | |
| I wrote this article in 1998, I now see that the | |
| gettext docs are now trying more to come to terms with | |
| plurality. Whether useful conclusions have come from it | |
| is another question altogether. -- SMB, May 2001] | |
| Forbes, Nevill. 1964. I<Russian Grammar.> Third Edition, revised | |
| by J. C. Dumbreck. Oxford University Press. | |
| =cut | |
| #End | |