MULTILINGUALWEB

Standards and best practices for the Multilingual Web

Madrid Workshop

2010 Madrid Workshop Program

World Wide Web Consortium (W3C)

Universidad Politécnica de Madrid

European Commission

W3C Workshop Program:
The Multilingual Web: Where Are We?
26-27 October 2010, Madrid, Spain

The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. Coordinated by the W3C, the project aims to raise the visibility of existing best practices and standards and identify gaps. This first workshop in Madrid, Spain, was hosted by the Universidad Politécnica de Madrid.

Each main session begins with a half-hour 'anchor' presentation. This is followed by a series of 15 minute talks. Questions & answers are saved for a (typically) half hour discussion slot at the end of each session. All attendees participated in all sessions.

The IRC log is the raw scribe log, which has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC was used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following on IRC can also add contributions to the flow of text themselves.

Some video links are missing, either because the speaker requested it or in some cases due to technical problems. In one case an audio track is available, rather than a video. Thanks to the Universidad Politécnica de Madrid for recording and hosting the videos.

Related links: Workshop report • About W3C

26 October

0915

Welcome

Guillermo Cisneros

Director, Escuela Técnica Superior de Ingenieros de Telecomunicación (ETSIT- UPM)

Workshop opening and welcome

alt

Kimmo Rossi

European Commission - DG INFSO E1

EC language programs and hopes for the future

alt

alt

0955

Keynote

Reinhard Schäler

LRC, University of Limerick

The Multilingual Web, Policy Making and Access to Digital Knowledge for All

abstract

open

Access to digital knowledge is no longer just a "nice-to-have", it is a fundamental human right as important as access to food and water, to appropriate educational and health services. The World Health Organisation has reported that thousands of people die every day because they do not have access to appropriate health information. Content and languages currently ignored by mainstream localisation efforts – because there is no "business case" for them - can realistically only be tackled using leading edge component technologies linked together in standardised and interoperable frameworks. Efforts under the umbrella of The Rosetta Foundation and the United Nations' Internet Governance Forum to create such an open framework will be outlined, and their potential highlighted to reach billions of users currently being excluded from the digital world.

alt

alt

alt

alt

1045

Break

1015

Developers

Richard Ishida

W3C (World Wide Web Consortium)

The Multilingual Web: Latest developments at the W3C/IETF

abstract

open

The World Wide Web Consortium (W3C) develops base standards for the Web, such as HTML, CSS, SVG, XML, the Semantic Web and so on. Since the beginning, "Web for All" has been a fundamental goal of the W3C. Richard's talk will look at the work of the W3C and some other key organizations that are helping to develop standards and best practices that make the World Wide Web international — what has been done and what is currently in progress.

alt

alt

IRC

Axel Hecht

Mozilla

Localizing the web from the Mozilla perspective

abstract

open

Axel will present on the achievements and challenges Mozilla is facing both as a browser vendor and as a variety of websites. Firefox is available in over 70 languages, with over 40 locales participating early in the beta program for Firefox 4. We host a variety of websites in a variety of languages, based on a variety of infrastructures. We'll present what works, and where we're still researching and developing. What is a multilingual web differ for static sites, for web applications, and live multilingual documents? Technically, what is "Localizing HTML"? Also, how can we serve our users the web in their language and respect their privacy at the same time? Can we improve locale choice?

alt

alt

IRC

alt

Charles McCathieNevile

Opera

The Web everywhere: Multilingualism at Opera

abstract

open

This presentation is about approaches for localization of various Opera browser versions and other assets.

alt

alt

IRC

alt

Jan Nelson, Peter Constable

Microsoft

Bridging languages, cultures, and technology

abstract

open

Microsoft's products span a very wide range of applications that depend on advancing Web technologies. This talk will provide a high level overview of how our currently shipping products reach across regions, languages, markets and to what extent our "connected" products are available. We will also look at some areas of progress and some outstanding challenges in globalization of Web applications.

alt

alt

IRC

alt

[Chair, Adriane Rinsche • Scribe, Jirka Kosek]

1230

Q&A

alt

IRC

alt

1300

Lunch

1400

Creators

Roberto Belo Rovella, David Vella

BBC World Service

Challenges for a multilingual news provider: pursuing best practices and standards for BBC World Service

abstract

open

The BBC World Service operates in 32 languages and many of them still present challenges to be properly displayed in various web platforms, particularly on mobile devices. Roberto Belo-Rovella (Interactive Editor) and David Vella (Software Engineer) explore some of these issues, and the implemented solutions, aimed at reaching the audience in whichever platform they used to access BBC content.

alt

alt

IRC

alt

Paolo Baggia

Loquendo

Multilingual Aspects in Speech and Multimodal Interfaces

abstract

open

How multiliguality affects standards for voice and (more in general) multimodal applications? Is speech different from text as regards languages? Tricky issues, current standard solutions, best practices, and open questions when you shift the focus from written/visual to spoken/auditory domains.

alt

Luis Bellido

Universidad Politécnica de Madrid

Experiences in creating multilingual web sites

abstract

open

Luis talked about the "Lingu@net Europa" web site, a multilingual center for language learning. 32 different languages are available. The content is created not by professional translators, but by language teaching professionals. For this group the usage of technologies like translation memories is not easily learned. Hence, for them a site is created in a workflow which they find easy to use. In terms of standards, utf-8 character encoding, HTML, CSS and XML play a crucial role. But other technologies like MS-Office or the text indexing framework Lucene are applied, too. An issue in creating multilingual sites is how to handle "multilingual links". In the project, XSLT was used to create such links. Luis suggested that having such facilities in a CMS would be helpful.

alt

alt

IRC

alt

Pedro L. Díez Orzas, Giuseppe Deriard, Pablo Badía Mas

Linguaserve

Key Aspects of Multilingual Web Content Life Cycles: Present and Future

abstract

open

Pedro emphasized that the creation of multilingual content is a process. Important information in that process is for example what needs not to be translated. To express that information, the tDTD format ("translatability data type definition") has been developed. However, in a common CMS and the content life cycle, multilinguality (i.e. such kind of information) is not regarded as important. The methodologies and workflows are getting more and more "hybrid": traditional human translation, combination of MT with post editing, "MT only". To make the creation of such workflows easy, CMS have to take the requirements of multilingual content management into account.

alt

alt

IRC

alt

Max Froumentin

World Wide Web Foundation

The Remaining Five Billion: Why is Most of The World's Population Not Online and What Mobile Phones Can Do About It

abstract

open

This presentation explains how the huge penetration rate of mobile phones is making them the prime tool to bridge the digital divide, and shows that there is a lot to be done before it happens: information should be made available using voice or messaging, it should be relevant to its users (and in their language), and everybody should be able to contribute to it.

alt

alt

IRC

alt

[Chair, Charles McCathieNevile • Scribe, Felix Sasaki]

1530

Q&A

alt

IRC

alt

1600

Break

1630

Localizers

Christian Lieske

SAP

Best Practices and Standards for Improving Globalization-related Processes

abstract

open

Today's globalization-related processes such as translation can benefit from a number of best practices and standards developed by a number standards bodies. Starting from a sketch of enterprise scale globalization-related processes, the presentation will touch on the material developed by the aforementioned organizations. In addition, relations and gaps in the ecosystem will be discussed.

alt

alt

IRC

Josef van Genabith

Centre for Next Generation Localisation (CNGL)

Next Generation Localisation

abstract

open

Next Generation Localisation can be conceptualised in terms of a spacial metaphor consisting of a cube with three axes: volume, access and personalisation. In the talk I will show which part of this "Localisation Cube" is addressed by current state-of-the-art technologies as used in the industry, and how Next Generation Localisation Technologies will allow us address each point in the cube at configurable quality and speed.

alt

alt

IRC

alt

Daniel Grasmick

Lucy Software

Applying Standards –Benefits and Drawbacks

abstract

open

Daniel briefly reviewed the history of standards such as TMX, TBX and XLIFF. He then asked whether standards are always the best approach, and argued that they may be overkill for some projects. For other projects, where you have to exchange large volumes of content and data is accessed by many, a standard format like XLIFF is absolutely appropriate.

alt

IRC

Marko Grobelnik

Institut Jozef Stefan

Cross-lingual document similarity for Wikipedia languages

abstract

open

Marko explained that cross-lingual information retrieval can be seen in a scenario where a user initiates a search in one language but expects results in more than one language. There are many areas involved in text-related search and each represents data about text in a slightly different way. Marko focused in on the correlated vector space model and described how the system they are currently building works using Wikipedia correlated texts.

alt

alt

IRC

alt

[Chair, Felix Sasaki • Scribe, Elliot Nedas]

1745

Q&A

alt

IRC

alt

1815

End

27 October

0930

Machines

Felix Sasaki

DFKI

Language resources, language technologies, text mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web

abstract

open

Felix introduced applications concerning summarization, Machine Translation (MT) and text mining, and showed what is needed in terms of resources. For this he identified different types of language resources, and distinguished between linguistic approaches and statistical approaches. Machines need three types of data: input, resources and workflow, and currently there are the following types of gaps that exist in this data scenario: metadata, process, purpose. These gaps were exemplified with an MT application. The purpose gap specifically concerns the identification of metadata, process flows and the employed resources. Any identification must be facilitated across applications with a common understanding, and therefore different communities have to join in and share the information that has to be employed in the descriptive part of the identification task. A particular solution that can provide a machine-readable information foundation was provided by the semantic technologies of the Semantic Web (SW). A more shallow approach than the complex fully-fledged approach of the SW for web pages is available through microformats or RDFa. With some few examples some insights were presented on how the SW actually contributes to closing the introduced gaps.

alt

alt

IRC

alt

Nicoletta Calzolari Zamorani

CNR-ILC

Language Resources: a pillar of Language Technology

abstract

open

The traditional production process is too costly. It is urgent to create a framework that enables effective cooperation of many groups on common tasks, adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, such as biology, astronomy and physics. This requires a change in the paradigm, and the design of a new generation of language resources, based on open content interoperability standards. The semantic web notion may help in determining the shape of the language resources of the future, consistent with the vision of an open distributed space of sharable knowledge available on the web for processing. This enables building on each other achievements, integrating results, and having them accessible to various systems and applications. This is the only way to make a great leap forward.

alt

alt

IRC

alt

Thierry Declerck

DFKI

lemon: An Ontology-Lexicon model for the Multilingual Semantic Web

abstract

open

In order to allow ontologies to interact with multilingual text in both the analysis and the generation mode, it is necessary to model the relation that natural language expressions have with language-independent knowledge representation systems. Most of the latter use a label attribute to encode the natural language expressions that correspond to a concept. And often, such labels exist only in English, or only in the language of the country for which a taxonomy or ontology has been designed. Such labels correspond in fact to terms, which are not explicitly linked to other terms/labels of other concepts. As such, a lot of information about possible linguistic realizations of concepts is left out. The Semantic Web, and in particular the Linked Data project, proposes solutions that allow for the re-use of lexical and terminological resources by semantic interlinking. However, currently, there is no standard for describing the relationship between natural language expressions and ontology elements. Therefore a central aspect in the Monnet project on Multilingual Ontologies for Networked Knowledge (http://www.monnet-project.eu/) is in the design and development of a model that associates linguistic information with domain semantics. This model, which we call lemon (lexicon model for ontologies), is built on existing work, in particular LMF, ISOcat, SKOS, LexInfo and LIR. Lemon is an RDF model that allows for lexical data to be shared and interlinked on the Web and is a central endeavor towards standardizing lexicalized, multilingual knowledge representation on the Semantic Web.

alt

alt

IRC

alt

José Carlos González

Universidad Politécnica de Madrid / DAEDALUS

Turning multilingual resources into applications: a market perspective

abstract

open

The talk will show how language resources and tools can evolve to satisfy the present and future demands coming from clients across industries. The line of argumentation will be supported by a experience of 13 years since the starting of a university spin-off around language technologies.

alt

alt

IRC

alt

Q&A

alt

1100

Break

1130

Jörg Schutz

bioloom group

Semantic Technologies in Multilingual Business Intelligence

abstract

open

Recently Business Intelligence (BI) has gained a new momentum with the increased penetration of cloud computing and open source software developments in the field of business analytics, forecasting and business process optimization. Even SMEs as well as micro companies are now in the position to employ BI tools that previously have been reserved to large enterprises with enormous license costs and additional human resources.

alt

alt

IRC

alt

Piek Vossen

VU University Amsterdam

KYOTO: a platform for anchoring textual meaning across languages

abstract

open

The KYOTO project builds a platform that can be used by social groups to model the meaning of their terms and to use this model to mine facts from the textual sources in their community. Since the KYOTO project uses a generic architecture that can be used for any set of languages, the modeling of terms and the extraction of facts from text is interoperable across these languages as well. The core semantic technology of KYOTO thus enables social groups with a common interest to access their knowledge using formal systems, thus connecting Semantic Web2.0 communities to Semantic Web3 technology, but it also allows to create communities across language borders. Likewise, knowledge that is implicit in these social communities can be exchanged across different language communities, thus creating a more global understanding and exchange.

alt

alt

IRC

alt

Christian Lieske

SAP

W3C Internationalization Tag Set

abstract

open

ITS is a W3C Recommendation that helps to internationalize XML-based contents. Content that has been internationalized with ITS can more easily be processed by humans and machines. ITS also plays an important role in the W3C Best Practice Note: "Best Practices for XML Internationalization". Christian explained that seven so-called data categories are the heart of ITS. They cover topics such as a marker that a range of content must not be translated. Thus, ITS helps humans and machines since ITS information for example can help to configure a spell checker or to communicate with a translator. ITS data categories are valuable in themselves – you do not need to work with the ITS namespace. They are therefore useful also for RDF or for other non-XML data. Although ITS is a relatively new standard, Christian was able to point to existing implementations (e.g. the Okapi framework) that support ITS-based processing. In addition, he sketched first scenarios, and visions for the possible pivotal role of ITS in the creation of multilingual, Web-based resources: clients (such as Web-browsers) that interpret ITS and thus can feed more adequate content to machine translation systems.

alt

alt

IRC

[Chair, Dan Tufis • Scribe, Jörg Schütz]

1215

Q&A

alt

IRC

alt

1300

Lunch

1400

Users

Ghassan Haddad

Facebook

Facebook Translation Technology and the Social Web

abstract

open

The interactive nature of the social web presents both unique challenges and opportunities unseen in traditional translation models. The current approaches to creating user interfaces and the translation technologies available are inadequate when it comes to providing fast, high quality translations, neither do they take advantage of the capabilities inherent in the social web. This talk will describe the challenges and opportunities and show how Facebook's technology deals with both.

alt

alt

IRC

alt

Denis Gikunda

Google

Google's community Translation in Sub Saharan Africa

abstract

open

Sub-Saharan Africa has around 14% of the world's population yet only 2% of the world's internet users. Low representation of African languages & relevant content remain among of the biggest barriers to this discrepancy. Google is serious about Africa, and our strategy is to get users online by developing an accessible, relevant and sustainable internet ecosystem. This talk will explore the relevance facet by sharing insights from two recent community translation initiatives aimed at increasing african language web content.

alt

Emmanuelle Gutiérrez y Restrepo, Loïc Martínez Normand

Sidar Foundation

Localization and web accessibility

abstract

open

The localization of web content is a creative and evolving undertaking that can maintain and even increase the accessibility of the original content. In any case, the localized version should never be less accessible than the original version. For this reason it is essential for web content localization practitioners to understand and apply the web accessibility guidelines (WCAG 2.0) with the goal of providing high quality results that also respect the rights of all users.

alt

alt

IRC

alt

Swaran Lata

Department of Information Technology, Government of India

Challenges for Multilingual Web in India : Technology development and Standardization perspective

abstract

open

After setting the scene with an overview of the challenges in India and the complexity of Indian scripts, Swaran Lata talked about various technical challenges they are facing, and some key initiatives aimed at addressing those challenges. For example, there are e-government initiatives in the local languages of the various states, and a national ID project, that brings together multilingual databases, on-line services and web interfaces. She then mentioned various standardization related challenges, and initiatives that are in place to address those.

alt

alt

IRC

alt

[Chair, Chiara Pacello • Scribe, Charles McCathieNevile]

1515

Q&A

alt

IRC

alt

1600

Break

1015

Developers (2)

Mark Davis

Google / Unicode Consortium

Software for the world: Latest developments in Unicode and CLDR (videocast)

abstract

open

Mark started with information about the extent of Unicode use on the Web, and the recent release of Unicode 6.0. He then talked about International Domain Names and recent developments in that area, such as top level IDNs and new specifications related to IDNA and Unicode IDNA compatibility processing. For the remainder of his talk, Mark described CLDR (the Common Locale Data Repository), which aims to provide locale-based data formats for the whole world so that applications can be written in a language independent way. This included a description of BCP 47 language tag structure, and the new Unicode locale extension.

alt

alt

IRC

alt

[Chair, Richard Ishida • Scribe, Charles McCathieNevile]

1650

Q&A

alt

1720

Wrap up

Workshop close

1730

End