Programme
Monday, May 20, 2024
9:00-09:10 Welcome and Introduction
9:10-10:30 Session 1: Corpus Building
- 9:10-9:30 (online) Italian-Ligurian Machine Translation in its Cultural Context (paper | slides)
- Christopher R. Haberland, Stefano Lusito and Jean Maillard
- 9:30-9:50 Labadain-30k+: A Monolingual Tetun and Document-Level Audited Dataset (paper | slides)
- Gabriel de Jesus and Sérgio Nunes
- 9:50-10:10 (online) Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages (paper | slides)
- Rhandley D. Cajote, Rowena Cristina L. Guevara, Michael Gringo Angelo R. Bayona and Crisron Rudolf G. Lucas
- 10:10-10:30 (online) A Novel Corpus for Automated Sexism Identification on Social Media (paper | slides)
- Lutfiye Seda Mut Altin and Horacio Saggion
10:30-11:00 Coffee Break
11:00-12:00 Session 2: Language Tools
- 11:00-11:20 TELP — Text Extraction with Linguistic Patterns (paper | slides)
- João Cordeiro, Purificação Moura Silvano, António Leal and Sebastião Pais
- 11:20-11:40 (online) Improving Language Coverage on HeLI-OTS (paper | slides)
- Tommi Jauhiainen and Krister Lindén
- 11:40-12:00 (online) Nepal Script Text Recognition using CRNN CTC Architecture (paper | slides)
- Swornim Nakarmi, Sarin Sthapit, Arya Shakya, Rajani Chulyadyo and Bal Krishna Bal
12:00-13:00 Session 3: Regional Languages of Europe
- 12:00-12:20 Managing Fine-grained Metadata for Text Bases in Extremely Low Resource Languages: the Cases of Two Regional Languages of France (paper | slides)
- Marianne Vergez-Couret, Delphine Bernhard, Michael Nauge, Myriam Bras, Pablo Ruiz Fabo and Carole Werner
- 12:20-12:40 NLP for Arbresh: How an Endangered Language Learns to Write in the 21st Century (paper | slides)
- Giulio Cusenza and Çağrı Çöltekin
- 12:40-13:00 CorpusArija: Building an Annotated Corpus with Variation in Occitan (paper | slides)
- Clamença Poujade, Myriam Bras and Assaf Urieli
13:00-13:10 Online Poster Session I
- Bidirectional English-Nepali Machine Translation(MT) System for Legal Domain (paper | poster)
- Shabdapurush Poudel, Bal Krishna Bal and Praveen Acharya
- Multilingual Self-supervised Visually Grounded Speech Models (paper | poster)
- Huynh Phuong Thanh Nguyen and Sakriani Sakti
13:10-14:10 Lunch Break & Poster Session I
- A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages (paper | poster)
- Catherine Arnett, Tyler A. Chang and Benjamin Bergen
- Beyond Error Categories: A Contextual Approach of Evaluating Emerging Spell and Grammar Checkers (paper | poster)
- Þórunn Arnardóttir, Svanhvít Lilja Ingólfsdóttir, Haukur Barri Símonarson, Hafsteinn Einarsson, Anton Karl Ingason and Vilhjálmur Þorsteinsson
- Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish (paper | poster)
- Fred Philippy, Shohreh Haddadan and Siwen Guo
- Mixat: A Data Set of Bilingual Emirati-English Speech (paper | poster)
- Maryam Khalifa Al Ali and Hanan Aldarmaki
- Resource Acquisition for Understudied Languages: Extracting Wordlists from Dictionaries for Computer-assisted Language Comparison (paper | poster)
- Frederic Blum, Johannes Englisch, Alba Hermida Rodriguez, Rik van Gijn and Johann-Mattis List
- Tracing Linguistic Heritage: Constructing a Somali-Italian Terminological Resource Through Explorers’ Notebooks and Contemporary Corpus Analysis (paper | poster)
- Silvia Piccini, Giuliana Elizabeth Vilela Ruiz, Andrea Bellandi and Enrico Carniani
- ViHealthNLI: A Dataset for Vietnamese Natural Language Inference in Healthcare (paper | poster)
- Huyen Nguyen, Quyen The Ngo, Thanh-Ha Do and Tuan-Anh Hoang
14:10-14:50 Keynote Speech (online)
- Co-creating a Road Map for Indigenous Language Digital Activism
- Eddie Avila (Director of Global Voices)
14:50-15:50 Session 4: Machine Translation
- 14:50-15:10 Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study (paper | slides)
- Wan-hua Her and Udo Kruschwitz
- 15:10-15:30 The First Parallel Corpus and Neural Machine Translation Model of Western Armenian and English
- Ari Nubar Boyacıoğlu and Jan Niehues (paper | slides)
- 15:30-15:50 (online) Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation (paper | slides)
- Seunghyun Ji, Hagai Raja Sinulingga and Darongsae Kwon
15:50-16:30 Coffee Break
16:30-17:50 Session 5: Large Language Models
- 16:30-16:50 Advancing Generative AI for Portuguese with Open Decoder Gervásio PT* (paper | slides)
- Rodrigo Santos, João Ricardo Silva, Luís Gomes, João Rodrigues and António Branco
- 16:50-17:10 Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family (paper | slides)
- Rodrigo Santos, João Rodrigues, Luís Gomes, João Ricardo Silva, António Branco, Henrique Lopes Cardoso, Tomás Freitas Osório and Bernardo Leite
- 17:10-17:30 (online) BERTbek: A Pretrained Language Model for Uzbek (paper | slides)
- Elmurod Kuriyozov, David Vilares and Carlos Gómez-Rodríguez
- 17:30-17:50 (online) Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining (paper | slides)
- Nikola Ljubešič, Vít Suchomel, Peter Rupnik, Taja Kuzman and Rik van Noord
Tuesday, May 21, 2024
9:00-10:00 Session 6: Quality and Evaluation
- 9:00-9:20 Man or Machine: Evaluating Spelling Error Detection in Danish Newspaper Corpora (paper | slides)
- Eckhard Bick, Jonas Nygaard Blom, Marianne Rathje and Jørgen Schack
- 9:20-9:40 (online) Unsupervised Outlier Detection for Language-Independent Text Quality Filtering (paper | slides)
- Jón Daðason and Hrafn Loftsson
- 9:40-10:00 (online) Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data (paper | slides)
- Ólafur A. Jóhannsson, Birkir H. Arndal, Eysteinn Ö. Jónsson, Stefán Ólafsson and Hrafn Loftsson
10:00-10:30 Session 7: Position Papers
- 10:00-10:15 Seeding Alignment Between Language Technology and Indigenous Methodologies: A Decolonizing Framework for Endangered Language Revitalization (paper | slides)
- Craig John Carpenter, John Lyon, Miles Thorogood and Jeannette C. Armstrong
- 10:15-10:30 Solving Failure Modes in the Creation of Trustworthy Language Technologies (paper | slides)
- Gianna Leoni, Lee Steven, Tūreiti Keith, Keoni Mahelona, Peter-Lucas Jones and Suzanne Duncan
10:30-11:00 Coffee Break
11:00-11:40 Keynote Speech
- Massively Multilingual Language Technologies
- Jean Maillard (AI Researcher (FAIR team) at META)
11:40-12:40 Session 8: Language Resources
- 11:40-12:00 Uncovering Social Changes of the Basque Speaking Twitter Community During COVID-19 Pandemic (paper | slides)
- Joseba Fernandez de Landa, Iker García-Ferrero, Ander Salaberria and Jon Ander Campos
- 12:00-12:20 (online) BK3AT: Bangsamoro K-3 Children’s Speech Corpus for Developing Assessment Tools in the Bangsamoro Languages (paper | slides)
- Kiel Gonzales, Jazzmin Maranan, Nissan Macale, Edsel Jedd Renovalles, Nicole Anne Palafox, Francis Paolo Santelices and Jose Marie Mendoza
- 12:20-12:40 UzABSA: Aspect-Based Sentiment Analysis for the Uzbek Language (paper | slides)
- Sanatbek Matlatipov, Jaloliddin Rajabov, Elmurod Kuriyozov and Mersaid Aripov
12:40-14:00 Lunch Break & Poster Session II
- Assessing Pre-Built Speaker Recognition Models for Endangered Language Data (paper | poster)
- Gina-Anne Levow
- Improving Legal Judgement Prediction in Romanian with Long Text Encoders (paper | poster)
- Mihai Masala, Traian Rebedea and Horia Velicu
- Inter-language Transfer Learning for Visual Speech Recognition toward Under-resourced Environments (paper | poster)
- Fumiya Kondo and Satoshi Tamura
- Residual Dropout: A Simple Approach to Improve Transformer’s Data Efficiency (paper | poster)
- Carlos Escolano, Francesca De Luca Fornaciari and Maite Melero
- Exploring Text Classification for Enhancing Digital Game-Based Language Learning for Irish (paper | poster)
- Leona Mc Cahill, Thomas Baltazar, Sally Bruen, Liang Xu, Monica Ward, Elaine Uí Dhonnchadha and Jennifer Foster
- UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology (paper | poster)
- Agata Savary, Daniel Zeman, Verginica Barbu Mititelu, Anabela Barreiro, Olesea Caftanatov, Marie-Catherine de Marneffe, Kaja Dobrovoljc, Gülşen Eryiğit, Voula Giouli, Bruno Guillaume, Stella Markantonatou, Nurit Melnik, Joakim Nivre, Atul Kr. Ojha, Carlos Ramisch, Abigail Walsh, Beata Wójtowicz and Alina Wróblewska
- Work in Progress: Text-to-speech on Edge Devices for te Reo Maōri and ‘Ōlelo Hawai’i (paper | poster)
- Tūreiti Keith, Gianna Leoni, Keoni Mahelona, Hina Puamohala Kneubuhl, Stephanie Huriana Fong and Peter-Lucas Jones
- Developing Infrastructure for Low-Resource Language Corpus Building (paper | poster)
- Hedwig Sekeres, Wilbert Heeringa, Wietse de Vries, Oscar Yde Zwagers, Martijn Wieling and Goffe Th. Jensma
14:00-15:20 Session 9: Speech Technologies
- 14:00-14:20 Tandem Long-Short Duration-based Modeling for Automatic Speech Recognition (paper | slides)
- Dalai Mengke, Yan Meng and Péter Mihajlik
- 14:20-14:40 Bi-dialectal ASR of Armenian from Naturalistic and Read Speech (paper | slides)
- Arthur Malajyan, Victoria Khurshudyan, Karen Avetisyan,Hossep Dolatian and Damien Nouvel
- 14:40-15:00 (online) Indonesian-English Code-Switching Speech Recognition Using the Machine Speech Chain Based Semi-Supervised Learning (paper | slides)
- Rais Vaza Man Tazakka, Dessi Lestari, Ayu Purwarianti, Dipta Tanaya, Kurniawati Azizah and Sakriani Sakti
- 15:00-15:20 (online) Improving Noisy Student Training for Low-resource Languages in End-to-End ASR Using CycleGAN and Inter-domain Losses (paper | slides)
- Chia-Yu Li and Ngoc Thang Vu
15:20-16:00 Panel Discussion
- “In a post-ChatGPT world, what are the Challenges and Opportunities for Under-resourced Languages?”
16:00-16:30 Coffee Break
16:30-16:40 Online Poster Session II
- PersianEmo: Enhancing Farsi-Dari Emotion Analysis with a Hybrid Transformer and Recurrent Neural Network Model (paper | poster)
- Mohammad Ali Hussiny, Mohammad Arif Payenda and Lilja Øvrelid
- Why the Unexpected? Dissecting the Political and Economic Bias in Persian Small and Large Language Models (paper | poster)
- Ehsan Barkhordar, Surendrabikram Thapa, Ashwarya Maratha and Usman Naseem
16:40-17:20 Session 10: Data Scarcity-related Issues
- 16:40-17:00 Quantifying the Ethical Dilemma of Using Culturally Toxic Training Data in AI Tools for Indigenous Languages (paper | slides)
- Pedro Henrique Domingues, Claudio Santos Pinhanez, Paulo Cavalin and Julio Nogima
- 17:00-17:20 (online) Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot (paper | slides)
- Michelle Terblanche, Kayode Olaleye and Vukosi Marivate
17:20 -17:30 Closing