Having previously suggested that Google might be doing some interesting things with regards to minority languages, I was delighted to receive the following press release about the Apertium Welsh-English translator.
I know that they have had a few set-backs in the past, and they seem like nice guys, so it is great to see them getting a bit of a boost.
I think Fran’s comments about not getting any Welsh students applying for the post are interesting – many people have commented that Wales should be well-placed to be a leader in bilingual software design, localisation, translation technology and so on, and we have some great people doing some excellent work around the country – maybe we need to think about how this might be more directly fed into the computing and other curricula in universities to really build a knowledge/skill base and develop an industry.
Press Release
12 Mai 2009
Automatic translation from Welsh gets a boost from France!
High-quality Welsh-English machine translation will come a step closer when a
new initiative gets underway this month.
The multinational Apertium team, which released their Welsh-English translator
(http://www.cymraeg.org.uk [1]) in August 2008, has been accepted into the
fifth Google Summer of CodeTM [2], and one of the projects to be funded will
be an improvement to that translator.
Apertium (http://www.apertium.org) is a Free Software [3] machine translation
platform. It was first developed to handle translation between related
languages in Spain, but over the last few years it has been extended to deal
with other languages. To date, translators for 17 language pairs have been
released, covering languages spoken by 1.1bn people, from English (est. 500m
speakers) to Aranese (est. 4,000 speakers). A similar number of other
language pairs are in development – these include Indian languages like Hindi
and Bengali, and Scandinavian languages like Norwegian and Sami.
Google Summer of Code offers student developers stipends to write code for
open-source projects, advised by mentors already working on the projects, and
has helped create millions of lines of code for dozens of projects. This was
the first year that Apertium applied for the program, and 9 Apertium projects
are being supported.
The Apertium Welsh-English translator works by applying grammatical rules to a
Welsh sentence to turn it into an English sentence. An alternative approach
(adopted by software like Moses [4]) is to use a large body of text to work
out what the likely translation of a given phrase is.
The Summer of Code student, Gabriel Synnaeve from Grenoble, France [5], will
be working on combining these two approaches, using techniques developed at
Carnegie-Mellon University in the USA [6]. The aim is to improve the quality
of the translation – in effect, the Apertium and Moses translations will be
compared, and the best bits of each will be used in the final translation.
For instance, take the Welsh sentence:
“Mae Heddlu’r De yn ymchwilio i farwolaeth dyn 41 oed o Abertawe.”
(South Wales Police are investigating the death of a 41-year old man from
Swansea.)
Apertium currently produces:
“South Wales Police is investigating death man 41 years old from Swansea.”
Moses currently produces:
“the south wales police investigation into the death of a man 41 years
of age of abertawe.”
The aim is to combine the best chunks from each program, so that we get
something like:
- [is investigating] +[the death of a man] *[41
years old] *[from Swansea]
Here, the chunks marked * come from Apertium, and the one marked + from
Moses, and combining both improves the quality of the translation.
This is cutting-edge stuff, and has rarely been tried before. Prof Harold
Somers, in a 2004 report for the Welsh Language Board [7], suggested that a
medium-term goal for machine translation in Welsh would be “to integrate …
different [machine translation] engines into a single system”. Nothing has
been done on that to date, and Gabriel’s work will be the first attempt to
bring this vision of “multi-engine machine translation” for Welsh closer to
reality.
Francis Tyers [8], who will be mentoring Gabriel, said, “I was quite surprised
that we didn’t get any Welsh students applying, but this is a fantastic
opportunity to improve Welsh language technology. I have no doubt we’ll see
some real gains in the translation quality.”
Gabriel has already started work. “At the minute I’m fine-tuning the Moses
Welsh-English translator to make it as efficient as possible. The Apertium
community is very friendly, and I wanted to participate in a big open
source project, so I’m glad I went for it.”
Kevin Donnelly [9], who co-developed the Apertium Welsh-English translator
with Francis, noted that this was a big step forward for Welsh. “It is
wonderful that so many talented people are working on Apertium, and that they
are giving Welsh such a high priority. What we need now is for bodies
promoting Welsh here in Wales to step up to the plate and give whatever
enouragement and other support they can.”
Notes
[1] http://ufal.mff.cuni.cz/pbml-91-100.html. Francis Tyers and Kevin
Donnelly (2009): “apertium-cy – a collaboratively-developed free RBMT system
for Welsh to English”, Prague Bulletin of Mathematical Linguistics, 91.
[2] http://code.google.com/soc
[3] http://www.fsf.org/about/what-is-free-software. The Free Software
Foundation’s definition of “Free Software” is software that the user is free
to use, copy, change, and distribute.
[4] http://www.statmt.org/moses. Moses is an open-source statistical machine
translation system.
[5] Gabriel Synnaeve is a student at the École Nationale Supérieure
d’Informatique et de Mathématiques (http://ensimag.grenoble-inp.fr), a
leading informatics and mathematics centre. He will graduate in September
2009 and will then begin work on a doctorate on Bayesian machine learning.
[6] Alon Lavie (http://www.cs.cmu.edu/alavie) is leading this work. See
also: http://www.cs.cmu.edu/alavie/papers/EAMT-2005-MEMT.pdf. S. Jayaraman
and A. Lavie (2005): “Multi-Engine Machine Translation Guided by Explicit
Word Matching”, Proceedings of EAMT-2005.
[7] http://www.byig-wlb.org.uk/english/publications/publications/2302.doc.
Harold Somers (2004): “Machine translation and Welsh: the way forward.”,
Report for the WLB.
[8] Francis Tyers studied computer science at Aberystwyth, and is now a
language engineer for Prompsit Language Engineering, S.L. and a PhD student
at the Universitat d’Alacant. He is a key Apertium developer, with a special
interest in extending it to handle the Celtic languages.
[9] Kevin Donnelly has been working on Free Software in Welsh since 2003, and
developed the online Welsh dictionary Eurfa (http://www.eurfa.org.uk).
Contact:
Kevin Donnelly, 01248-715925, kevin@dotmon.com
=====
Datganiad i’r Wasg
12 Mai 2009
Cyfieithu awtomatig o’r Gymraeg yn cael hwb o Ffrainc!
Bydd cyfieithu peirianyddol o ansawdd da o Gymraeg i Saesneg yn dod yn agosach
pan gychwynnir ar broject newydd y mis yma.
Mae’r tîm rhyngwladol Apertium, a ryddhaodd eu cyfieithydd Cymraeg-Saesneg
(http://www.cymraeg.org.uk [1]) ym mis Awst 2008, wedi cael ei dderbyn i mewn
i’r pumed Google Summer of CodeTM [2], a bydd gwelliannau i’r cyfieithydd hwn
yn cael ei ariannu fel un o’r projectau.
Platfform cyfieithu peirianyddol yw Apertium (http://www.apertium.org), sy’n
Feddalwedd Rhydd [3]. Datblygwyd yn y dechrau i gyfieithu rhwng ieithoedd
sy’n perthyn i’w gilydd yn Sbaen, ond dros y blynyddoedd diweddar estynnwyd y
rhagleni drin iaethoedd eraill.
yn cynrychioli 1.1bn o bobl, o Saesneg (tua 500m o lefarwyr) i Araneg (tua
4,000 o lefarwyr). Mae nifer tebyg o barau eraill yn cael eu datblygu, sy’n
cynnwys ieithoedd Indeg megis Hindi a Bengaleg, ac ieithoedd Scandinafaidd
megis Norwyeg a Sami.
Hyd yn hyn, mae cyfieithyddion ar gyfer 17 pâr o ieithoedd wedi eu rhyddhau,
Mae Google Summer of Code yn cynnig lwfans i fyfyrwyr i ysgrifennu cod ar
gyfer projectau cod-agored, gyda chyngor gan fentoriaid sy’n gweithio esoes
ar y projectau, ac mae o wedi helpu i greu miliynau o linellau o god ar gyfer
dwsinau o brojectau. Dyma’r flwyddyn cyntaf i Apertium wneud cais i’r
rhaglen, ac ariannir 9 o brojectau Apertium.
Mae’r cyfieithydd Cymraeg-Saesneg Apertium yn gweithio gan weithredu rheolau
gramadegol i frawddeg Gymraeg i’w throi hi’n frawddeg Saesneg. Ffordd arall
o wneud hyn (a ddefnyddir gan feddalwedd megis Moses [4]) yw defnyddio corff
mawr o destun i weithio allan beth yw’r cyfieithiad tebygol am unrhyw
ymadrodd.
Bydd y myfyriwr, Gabriel Synnaeve o Grenoble, Ffrainc [5], yn ceisio cyfuno’r
ddwy ffordd yma o weithio, gan ddefnyddio technegau a ddatblygwyd ym
Mhrifysgol Carnegie-Mellon yn yr UDA [6]. Yr amcan yw gwella ansawdd y
cyfieithiad – bydd y cyfieithiadau Apertium a Moses yn cael eu cymharu, a’r
darnau gorau o bob un yn cael eu defnyddio yn y cyfeithiad terfynol.
Er enghraifft, gweler y frawddeg Gymraeg:
“Mae Heddlu’r De yn ymchwilio i farwolaeth dyn 41 oed o Abertawe.”
Mae Apertium ar hyn o bryd yn cynhyrchu:
“South Wales Police is investigating death man 41 years old Swansea.”
Mae Moses ar hyn o bryd yn cynhyrchu:
“the south wales police investigation into the death of a man 41 years
of age of abertawe.”
Y bwriad yw cyfuno’r darnau gorau o bob rhaglen, i gynhyrchu rhywbeth fel:
- [is investigating] +[the death of a man] *[41
years old] +[of] *[Swansea]
Yma, mae’r darnau a nodir gan * yn dod o Apertium, a’r rhai a nodir gan + o
Moses, ac mae cyfuno’r ddau yn gwella ansawdd y cyfieithiad.
Dyma waith arloesol, heb ei wneud o’r blaen. Awgrymodd yr Athro Harold
Somers, mewn adroddiad ym 2004 ar gyfer Bwrdd yr Iaith [7], y dylai amcan
tymor-canol ar gyfer cyfieithu peirianyddol yn Gymraeg fod “to integrate …
different [machine translation] engines into a single system”. Nid oes unrhyw
beth wedi ei wneud hyd yn hyn, a gwaith Gabriel fydd y cais cyntaf i ddod â’r
syniad yma o “multi-engine machine translation” ar gyfer y Gymraeg yn agosach
i fodolaeth.
Dywedodd Francis Tyers [8], fydd yn rhoi cyngor i Gabriel, “Dipyn o siom oedd
hi nad oedden ni’n cael cais gan fyfyriwr Cymreig, ond mae hyn yn gyfle gwych
i wella technoleg iaith yn Gymraeg. Rydym ni’n siŵr o weld cynnydd o
safbwynt ansawdd y cyfieithu.”
Mae Gabriel wedi cychwyn ar y gwaith eisoes. “Ar hyn o bryd dwi’n gwneud
newidiadau mân i’r cyfieithydd Moses i’w wneud mor effeithlon â phosib.
Mae’r gymuned Apertium yn gyfeillgar iawn, ac roeddwn i eisiau cyfrannu i
broject mawr cod-agored, felly dwi’n falch nes i’r cais.”
Dywedodd Kevin Donnelly [9], a weithiodd gyda Francis i greu’r cyfieithydd
Cymraeg -Saesneg Apertium, fod hwn yn gam mawr i’r Gymraeg. “Mae’n
ardderchog cael cymaint o bobl dalentog yn gweithio ar Apertium, a braf yw hi
gweld eu bod nhw’n ystyried Cymraeg fel blaenoriaeth. Yr hyn sydd angen rŵan
yw ymdrech gan y mudiadau sy’n hybu Cymraeg yma yng Nghymru i annog a rhoi
cefnogaeth i’r gwaith yma.”
Notes
[1] http://ufal.mff.cuni.cz/pbml-91-100.html. Francis Tyers and Kevin
Donnelly (2009): “apertium-cy – a collaboratively-developed free RBMT system
for Welsh to English”, Prague Bulletin of Mathematical Linguistics, 91.
[2] http://code.google.com/soc
[3] http://www.fsf.org/about/what-is-free-software. Mae’r Free Software
Foundation yn diffinio “Meddalwedd Rhydd” fel meddalwedd y gellir ei
ddefnyddio, copïo, newid a dosbarthu gan y defnyddiwr.
[4] http://www.statmt.org/moses. System cyfieithu peirianyddol ystadegol yw
Moses – mae’n god-agored.
[5] Gabriel Synnaeve yw myfyriwr yn yr École Nationale Supérieure
d’Informatique et de Mathématiques (http://ensimag.grenoble-inp.fr), canolfan
bwysig ar gyfer mathemateg ac thechnoleg gwybodaeth. Bydd o’n graddio ym mis
Medi 2009, ac yn cychwyn gwaith wedyn ar ddoethuriaeth ar ddysgu peirianyddol
Bayesaidd.
[6] Alon Lavie (http://www.cs.cmu.edu/alavie) is leading this work. See
also: http://www.cs.cmu.edu/alavie/papers/EAMT-2005-MEMT.pdf. S. Jayaraman
and A. Lavie (2005): “Multi-Engine Machine Translation Guided by Explicit
Word Matching”, Proceedings of EAMT-2005.
[7] http://www.byig-wlb.org.uk/english/publications/publications/2302.doc.
Harold Somers (2004): “Machine translation and Welsh: the way forward.”,
Report for the WLB.
[8] Astudiodd Francis Tyers wyddoniaeth cyfrifiadurol yn Aberystwyth, ac ar
hyn o bryd mae’n beiriannwr iaith gyda Prompsit Language Engineering, S.L. ac
yn fyfyriwr PhD ym Mhrifysgol Alacant. Mae’n un o’r datblygwyr blaenorol
Apertium, gyda diddordeb arbennig yn ei estyn i drin yr ieithoedd Celtaidd.
[9] Mae Kevin Donnelly wedi bod yn gweithio ar Feddalwedd Rhydd yn Gymraeg ers
2003, a datblygodd Eurfa, geiriadur arlein Cymraeg (http://www.eurfa.org.uk).
Cysyltwch â:
Kevin Donnelly, 01248-715925, kevin@dotmon.com