PHP Hello l10n

This is a small tutorial on how to internationalize some PHP code and localize it. For simplicity reasons, let’s consider our beloved “Hello world”. Here’s the original script:

index.php:

<?php
echo "Hello, world!";
?>

Its output is pretty straightforward:

Hello, world!

Your aim is to internationalize this little script so that your visitors/clients can enjoy your website in their native language. PHP offers 3 main ways to do so:

  1. PHP Array
  2. PHP DEFINE statements
  3. Gettext

PHP Array

In this first scenario, you need to maintain an associative array per language which will map keys or source strings to localized strings. To display those strings, all you need to do is to select the proper array and get the localized text by using the appropriate key:

locale/en.php:

    <?php
    $LANG = array(
        "hello_world" => "Hello, world!",
    );
    ?>

index.php:

    <?php
    $locale = 'en';

    if (isset($_GET['lang']))
        $locale = $_GET['lang'];
    include('locale/'. $locale . '.php');

    echo $LANG['hello_world'];
    ?>

You can now set the locale by assigning a language code to the ‘lang’ GET parameter when visiting your website e.g. http://l10n.hello.world.org/?lang=en.

As you may have noticed, the default locale is ‘en’, so you don’t need to set the ‘lang’ parameter explicitly to get the english version. It would be great though if you could support a Hindi version too, wouldn’t be?

locale/hi.php:

<?php
$LANG = array(
    "hello_world" => "नमस्ते, दुनिया!",
);
?>

Guess what the output of http://l10n.hello.world.org/?lang=hi will be:

नमस्ते, दुनिया!

PHP Define

Internationalizing your website using the define() method is pretty straight forward, too. You just need to edit the locale files and index.php as shown below:

locale/en.php:

    <?php
    define("hello_world", "Hello, world!");
    ?>

locale/hi.php:

    <?php
    define("hello_world", "नमस्ते, दुनिया!");
    ?>

index.php:

    <?php
    $locale = 'en';

    if (isset$_GET['lang']))
        $locale = $_GET['lang'];
    include('locale/'. $locale . '.php');

    echo hello_world;
    ?>

Gettext

Internalization of PHP with arrays and define statements is pretty simple and straightforward, yet those methods share a major downside: as your website grows, it’s getting harder and harder to update the locale files. There’s no way to know which strings were added and if the strings are present in all the language files.

Gettext is one of the most popular internationalization and localization systems. It works very nicely with PHP as it does with a bunch of other programming languages like C, C++, Python, etc. With gettext, syncing the locale files with changes in the code base is extremely easy.

Let’s internationalize your website once more, using gettext this time.

First, you need to edit index.php as shown below and mark strings to be localized by enclosing them inside _() or gettext().

index.php:

<?php
$locale = 'en';

if (isset($_GET['lang']))
    $locale = $_GET['lang'];

putenv("LANGUAGE=".$locale);
setlocale(LC_ALL, $locale);

$domain = 'messages';
bindtextdomain($domain, "./locale");
textdomain($domain);

//Mark up text for localization
echo _('Hello, world!');
?>

Gettext expects a locale directory where all the translated strings will be kept.

locale/
    en/
        LC_MESSAGES/
            messages.po
            messages.mo
    hi/
        LC_MESSAGES/
            messages.po
            messages.mo

You can extract marked up strings from code in the following way:

$ xgettext -n *.php -o messages.pot

This generates a POT file named messages.pot:

# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR , YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2012-05-06 23:32+0530\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME \n"
"Language-Team: LANGUAGE \n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"

#: index.php:12
msgid "Hello, world!"
msgstr ""

At the bare minimum, you need to to specify the charset in the messages.po files to compile them successfully. Set it to “UTF-8″, then generate translation files from messages.pot as follows:

msginit -l en -o locale/en/LC_MESSAGES/messages.po -i messages.pot
msginit -l hi -o locale/hi/LC_MESSAGES/messages.po -i messages.pot

The PO file in the source language, i.e, English (“en”) does not need to be translated. In this case, you translate the PO file for Hindi (“hi”) only. After the translation is done, the PO files must be compiled using msgfmt to generate messages.mo files which are used to show the localized text in your website.

$ msgfmt locale/en/LC_MESSAGES/messages.po
$ msgfmt locale/hi/LC_MESSAGES/messages.po

As expected, when we visit http://l10n.hello.world.org/?lang=hi we see:

नमस्ते, दुनिया!

Many people have the opinion that using Gettext for localization is slow compared to localization using PHP arrays and PHP define statements. But, since Apache caches the localization data, the difference in speed is not that big. It finally comes down to a matter of personal taste.

You can find out more details on using gettext with PHP here.

Localization gotchas

That was a simple application with a single piece of text translated to a single language. Keep in mind though that there is an extremely high probability the framework you use to build your website provides one of the mentioned localization mechanisms. The real problem arises when the number of strings grow and you have to provide translated content to a larger number of languages. Then, it’s getting really hard to

  • maintain the locale files by hand,
  • hand them over to translators,
  • get them back from each translator, and
  • deploy.

Localization shouldn’t be that hard and Transifex has helped lots of project maintainers see their work getting easily localized and being accepted by a much wider user base. So, what are you waiting for? =)

Faster system tests in Django

There are countless posts out there evangelizing the importance of testing in the development process. This is not one of those posts. Just to make sure we are all on the same page though, as a team we strongly believe you should first write your tests, then (re)write the actual code again and again, until all tests pass and finally enjoy a (more) peaceful night. If you don’t do so, you’d better have Jack Sparrow‘s improvisation skills and love caffeine.

Now, time to get technical. Here’s how we managed to speed up our test suite by a 3x factor.

System tests

We are not talking about “Unit test vs System test” here. Unit tests are fast, granular and localized. They should be used to test as much code as possible. However, they are not a replacement for system tests or integration tests and vice versa. We need system tests to ensure that the separate units fit together nicely to make the entire application work. Since, system tests tend to be slower, their count should be very low compared to unit tests. A reasonable ratio between unit and system tests would be 9:1.

I feel we are being too harsh on system tests, ain’t we? Wouldn’t it be wonderful if you could make system tests faster? The faster the better. Let’s see how we did it in Transifex.

Test setup in Transifex

  • Simple test cases subclass from transifex.txcommon.tests.base.BaseTestCase, a subclass of django.test.TestCase and other helper classes.
  • BaseTestCase is responsible for loading fixtures and setting up test data like sample projects, resources, permissions, user, clients, etc.
  • A test case contains only related test methods.
  • Fixture based.
  • Very few instances of TransactionTestCase, most of them are subclasses of TestCase.
  • Most tests subclass from a transifex.txcommon.tests.base.BaseTestCase (a subclass of TestCase) to load fixtures and setup initial data (like users, projects, resources, teams, etc.) needed by most tests in Transifex.

The way Django runs instances of TestCase

  • Load fixtures (if any) for each test method
  • Setup url map, test outbox and test client
  • Set up initial data for test method in setUp() method.
  • Run test method
  • Rollback changes made in database if database (like postgresql) supports rollback, else truncate tables (in case of MySQL like databases).
  • Reset url map, fixtures, test outbox and test client

Causes of concern

  1. Setting up initial test data for each test method of a test case can add a lot of overhead if there’s a lot of initialization done in the setUp method of the test case (as in case of our test cases subclassed from BaseTestCase).
  2. That overhead gets even worse if there are fixtures included in the test case. Django loads them for each test method. Loading fixtures has a considerable overhead and makes the test suite a lot less maintainable. Small changes in model will break fixture importing.

You may be thinking that “Why the hell do I need to setup a lot of data for each test? I can just setup what data I need.”

Yes, you are correct in that. [1] has got a lot of latency the usual way. But there are other things to consider too. It helps a developer spend less time setting up the world during writing a test. It’s an overkill to setup the world for each test case separately. Also, it leads to redundancy of setup code. About fixtures, we plan to get rid of them in due course of time.

It seems like it’s trade off between the ease of writing tests and test speed. Well, we are kind of greedy in these cases and want to have both :D

All we needed was to find a way to do away with the latency of setting up the world for the BaseTestCase.

What did we need?

  • Load fixtures once during a run of the entire test suite
  • Setup initial test data once every test case (subclass of BaseTestCase or TestCase)
  • Initial test data setup should do database write as minimum as possible

Solution

  1. Load fixtures in the test runner to ensure that this process runs once for the entire test suite run.
  2. class TxTestSuiteRunner(DjangoTestSuiteRunner):
        def setup_databases(self, **kwargs):
            return_val = super(TxTestSuiteRunner, self).setup_databases(
                              **kwargs)
            databases = connections
            for db in databases:
                management.call_command('loaddata', *fixtures,
                        **{'verbosity': 0, 'database': db})
            return return_val
  3. Initialize test data in “setUpClass“ method of BaseTestCase“. Data setup in “setUpClass“ will be persistent throughout the run of the entire test case. Until and unless required, data initialization in “setUp()“ method of a test case can be skipped. For a simple “TestCase“, Django anyways rolls back all changes done within a test method.
  4. Set up code uses “Model.objects.get_or_create()“ method to fetch/initialize data to minimize database write
  5. Rolling back transactions or truncating tables resets the data before running a test method. But how to reset the variables initialized in setUpClass method? Well, in “setUp()“ method, we copy the class wide variables using “copy.copy()“ to some temporary variables. The test method works with these temporary variables. This leaves the original class wide variables intact.
  6. from copy import copy
    class BaseTestCase(Languages, NoticeTypes, Translations, TestCase):
        @classmethod
        def setUpClass(cls):
            super(BaseTestCase, cls).setUpClass(cls)
            # Only showing a code snippet...
    
            # Create teams
            cls._team = Team.objects.get_or_create(language=cls._language,
                project=cls._project, creator=cls._user['maintainer'])[0]
            cls._team_private = Team.objects.get_or_create(
                language=cls._language, project=cls._project_private,
                creator=cls._user['maintainer'])[0]
    
            # ...
    
        def setUp(self):
            super(BaseTestCase, self).setUp(self)
            # Only copy test case wide variables
            # to temporary ones to work with in a
            # test method.
    
            # Only showing a code snippet...
    
            # test method operate on self.team instead of self._team
            # and similarly for other variables too
            self.team = copy(self._team)
            self.team_private = copy(self._team_private)
    
            # ...
  7. Don’t set url map, fixtures in _pre_setup() or reset url map, fixtures in _post_teardown method. This needs a bit of tweaking in the _pre_setup() and _post_teardown() methods inherited from django.test.TestCase
    class BaseTestCase(Languages, NoticeTypes, Translations, TestCase):
        # Only showing a code snippet...
    
        def _pre_setup(self):
            if not connections_support_transactions():
                # truncate tables, load initial date
                # in case database does not support
                # transactions. Hence, no optimization
                # in such cases.
                fixtures = ["sample_users", "sample_site",
                               "sample_languages", "sample_data"]
                if getattr(self, 'multi_db', False):
                    databases = connections
                else:
                    databases = [DEFAULT_DB_ALIAS]
                for db in databases:
                    call_command('flush', verbosity=0, interactive=False,
                                  database=db)
                    call_command('loaddata', *fixtures, **{'verbosity': 0,
                                 'database': db})
    
            else:
                # Optimization achieved if database
                # supports transactions
                if getattr(self, 'multi_db', False):
                    databases = connections
                else:
                    databases = [DEFAULT_DB_ALIAS]
    
                for db in databases:
                    transaction.enter_transaction_management(using=db)
                    transaction.managed(True, using=db)
                disable_transaction_methods()
            mail.outbox = []
    
        def _post_teardown(self):
            if connections_support_transactions():
                # If the test case has a multi_db=True flag, teardown all
                # databases. Otherwise, just teardown default.
                if getattr(self, 'multi_db', False):
                    databases = connections
                else:
                    databases = [DEFAULT_DB_ALIAS]
    
                restore_transaction_methods()
                for db in databases:
                    transaction.rollback(using=db)
                    transaction.leave_transaction_management(using=db)
            for connection in connections.all():
                connection.close()

Results

The results were quite satisfying. With the custom test runner and the new test suite, tests got around 2-3 times faster. The new test suite’s speed up factor is proportional to the number of test methods in a test case when compared to its older counterpart. The new test suite, although not yet perfect , is working quite well. As kbairak said here:

holy shit! @rtnpro ‘s modifications make @transifex ‘s test-suite run like a hamster on coffee !!!

The Hub and Child project types

Depending on the type of your project, you can use Transifex in many ways to get the best workflow. Very often companies have many products that are handled under a single umbrella, what we call a ‘Translation Hub’ on Transifex. The main components of a hub are usually the human resources and the release process, and child projects re-use these elements from the parent project.

Let’s take the Fedora Project for example. The Fedora project on Transifex is a hub that hosts the community’s resources, such as the people involved in the translation and the release process.

Hub projects structure

The maintainers of the child projects, like Anaconda and Firstboot, have full control of their projects and can update their translation resources as needed. The people working on the translations really belong to the hub. Ideally these resources could follow the hub’s release cycle and get shipped under specific release versions (F16, F17, devel). This will help with having more control of what’s necessary to get translated and at each period of time.

So, basically a hub on Transifex is a project that holds the logistics of access control, usually behind structured language teams, and makes it available to child projects. Now, the question is:

How can I actually set this on Transifex?

I would say it’s dead simple. If you maintain a project on Transifex, you probably already saw that your project can be categorized under 3 types:

  • Typical: A typical standalone project. It has its own access control rules and no other project.

  • Hub: A project set as a Hub will aggregate information from other projects. The language table will include the translations of all its child projects.

  • Child: Projects which re-use the translation teams of a hub project.

Just a couple of check boxes! Straight forward, right? Here are some more information which can help:

  • You can only outsource access to a Hub — outsourcing access to a Typical project is not allowed – Kinda obvious, but worth mentioning.
  • A Hub can’t outsource its access to another hub.
  • Outsourcing team control to a Hub needs to be approved by one of the Hub maintainers, unless both are maintained by the same user.
  • Hubs can have their own sub-domains like https://fedora.transifex.net and https://opentranslators.transifex.net. Get in touch with us if you want to set one.

Ilias Vrachnis is a Transifexian

Ilias ‘vrachil’ Vrachnis joins the Transifex team as a Systems and Security Engineer. A long-term sysadmin monkey, Ilias was managing his university’s most critical servers prior to joining our team. Ilias will be working on making sure Transifex’s availability is top-notch and will be responsible for our dev team support services such as build servers and continuous integration systems.

You can follow Ilias on Twitter and Google+.

PS: Yup, we also love GitHub‘s introductions of new team members. ;-)

The life cycle of a translation resource

One of the core features Transifex provides is handling files with translatable content (resources) in various localization formats, like XML or PO files.

Part of that functionality is to be able to import such files to its internal storage and export them, whenever the user requests them, either to ship them with his software or to translate the file to another language with his local computer. Although both operations might use some customized code to handle certain formats (especially for importing resources), there are specific steps that are followed in each case.

Importing a file

Whenever you upload a file with strings in it, Transifex will try to parse it, extract the necessary information and then store that information in the database.

Parsing

Since each format is different, there are specialized parsers for each one. In some cases, Transifex uses a third-party parser, like polib for PO files. In other cases, we have developed custom parsers.

Extracting the information

The main responsibility of a parser is to extract the necessary information from the imported file.

In case the file is the source file (that is, it is the file with the strings in the source language), we are interested in three things:

  • The keys for the translatable strings (like the msgid entries in a PO file). The keys are used to uniquely match the strings in the source language with those in translations. We also generate a unique hash for each key as an identifier.
  • The translatable strings in the source language, if there are any (like the msgstr entries in PO file). These are the actual strings of the source language.
  • The template of the file. The template is a skeleton of the source file: it is mostly the same, except that the translatable strings have been replaced with the hashes of the corresponding keys, acting as placeholders. This is necessary for the export operation.

In case the file is a translation of the resource in a language, we are only interested in the translations (this means that any changes in the file are ignored).

Storing

As soon as we have the necessary information from the previous step, we store it in the database as source entities, translations and templates.

Exporting a file

Whenever a user asks to download a translation file in a particular language, the file has to be exported from the database.

The procedure is quite standard for all formats. After fetching the template and the translation strings in the requested language, we do a search-&-replace in the template, replacing the hashes in it with the actual strings that correspond to each hash. Next, any format-specific operations are performed (like adding the translator copyrights in PO files) and the result is delivered to the user.

You can find more details for the storage engine of Transifex in the docs.

Collecting clothes for the less privileged

Tomorrow we’ll be collecting clothes and food for the less privileged people around us. If you’re close to one of our offices and want to participate, please feel free to drop by and give us any extra clothes you might have. We’ll be handling them over to the local church to share with families which really need them.

For more information, do not hesitate to contact us. Thanks!

Switching to Gravatar

One of the personalization features Transifex offers is the support for avatars; each user is able to associate a small picture with his account, which makes it easier for other users to identify him.

Currently, there are two ways that avatars are supported: by uploading your own image when editing your profile or by using Gravatar.

However, on Thursday, April 5th, we will drop support for user-uploaded avatars and switch completely to Gravatar.

Our goal behind this decision is to make things as simple and easy as possible. We think Gravatar is a very good service; it does one thing and does it well: providing a web-friendly, globally recognized avatar for you across all the websites you visit. So, we feel that there is no point in serving custom avatars for our users anymore.

Setting a Gravatar

If you do not already have one, here is how you can set your Gravatar:

  • Go to the Gravatar signup form at https://en.gravatar.com/site/signup and enter the e-mail address you use on Transifex.
  • If you have registered in the past, a red box will tell you so. Otherwise, continue with your registration.
  • After you have activated your account, you will be able to upload an image from your computer or a URL.

If you do not want to get into that — and that is totally cool with us — Gravatar will render an Identicon for you as a fallback (this is what happens right now as well). An Identicon is a visual representation of your IP address, a digital fingerprint.

Translation between programming languages

You have probably already used Transifex successfully to translate apps from English to other languages, like Spanish, French and Chinese. Our technology allows developers to take content from one language and translate it to another one, by splitting the work in small tasks, which can be independently translated, thus, crowdsourced. It is working very well for spoken languages, but there are also other areas where it could be useful.

Every now and then, when we explain what Transifex does, we get the following question: “When you say ‘translating’, do you mean between spoken languages or programming languages, like C and Python?”

The answer has been the former, but the latter always sounded uber-cool. So, because we like making people happy, we decided to add support for translation between programming languages. Transifex users will now be able to translate not only from English to Spanish, but also from Python to C, Perl or PHP.

Here’s an example input and output of the initial run of the machine translation module from Python to Ruby:

PythonRuby
def fib(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1)+fib(n-2)

i = 0
while i < 35:
    print fib(i)
    i = i + 1
def fib(n)
    if n == 0
        return 0
    elsif n == 1
        return 1
    else
        fib(n-1)+fib(n-2)
    end
end 

i = 0 
while (i < 35)
    puts fib(i)
    i = i + 1 
end

Technology

Using technologies like Natural Language Processing, which is already available in Transifex, and a combination of compiler technology, finite-state automata and genetic algoritmhs, Transifex offers a rough translation between the two languages. Then, the user can review and correct the translation using our web-based editor.

Here is a blueprint of the under-the-hood technologies used:

  • Lexical Analysis: The source language is defined using certain rules, which are fed to the lexer. These are mostly defined using BNF. So, the lexer can identify the tokens, delimiters and keywords. In order to support many languages as input, we have a different set of rules for each language. Once the lexer tokenizes the content, it passes the result to the parser, which combines the tokens together.

  • Syntax Analyxis: The output from the lexer is parsed in order to build the Abstract Syntax Tree of the source code, which is a simple representation of the original source code. The parsed output is saved to the database.

When the output is requested in a particular programming language, we use the stored AST of the program and apply the reverse procedure to generate the source code in that language which would correspond to that AST. Custom functions have been developed for this reverse procedure that try to generate as simple and readable code as possible. However, since efficiency is often important, too, the generated code can be hand-edited with lotte, the web-based editor.

Language pairs supported

We are launching our first version with the following language pairs:

  • Python ↔ Ruby
  • Python ↔ Javascript
  • Ruby ↔ Javascript
  • Python → Perl
  • Python → PHP
  • PHP → C

We are rolling this in Beta for a selected group of users. If you would like to use it, drop us a note in the comments section of this post.

Note: This page was posted for April Fool's Day 2012. :D

Update on HTML parser, file types and suggestions

Transifex got another minor update yesterday. Here’s a list of some new nifty features:

HTML Parser – Upload translated files

You can now bootstrap the localization process of your project by uploading already translated HTML files. The HTML parser has received an overhaul and now supports importing translated HTML files, assuming the files are 100% translated.

Upload strings from translated HTML file

Keep an eye on the devil lying in the details: The new version of the HTML parser will only be used for newly created resources. Resources that were handled with the old parser will be informed about the migration of their resources, until they all gradually take advantage of our new pet. =)

File types – Create and update

Carefully listening to what your needs are, we now support two more file types:

  1. Plaintext with .txt extension.
  2. Property list files with .plist extension, mostly used by Apple.

Furthermore, the Android parser now recognizes comments from developers and stores them. As a result, translators can see those comments, when translating the files through the web editor, which adds more context to the translation making their work easier. The convention Transifex follows for comments is that each comment is associated with the next translation string.

Suggestions – Visible by default

No need for extra clicks or keystrokes. From now on, suggestions are visible by default, making the life of translators a little bit easier. Keep in mind that we are already working on making it more eye-candy.

Visible suggestions

More to come. Stay tuned!

API version 1 removed

In one week from now the version 1 of the API will be completely removed from Transifex (it was deprecated 9 months ago). It has been replaced with the second version of the API. Everyone who had used the old version of the API in the last few months has already been notified by email.

This also means that versions of the Transifex client lower than 0.6 will not work any more, since previous versions used the version 1 of the API. For the record, the current version of the client is 0.7.2.

Instructions on how to install or upgrade to the newest version of the client can be found in the docs section.