Programme
Monday, May 20, 2024
9:00-09:10 Welcome and Introduction
9:10-10:30 Session 1: Corpus Building
- 9:10-9:30 CorpusArija: Building an Annotated Corpus with Variation in Occitan
- Clamenca Poujade, Myriam Bras and Assaf Urieli
- 9:30-9:50 Labadain-30k+: A Monolingual Tetun and Document-Level Audited Dataset
- Gabriel de Jesus and Sérgio Nunes
- 9:50-10:10 (online) Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages
- Rowena Cristina L. Guevara, Rhandley D. Cajote, Michael Gringo Angelo R. Bayona and Crisron Rudolf G. Lucas
- 10:10-10:30 (online) A Novel Corpus for Automated Sexism Identification on Social Media
- Lutfiye Seda Mut Altin and Horacio Saggion
10:30-11:00 Coffee Break
11:00-12:00 Session 2: Language Tools
- 11:00-11:20 TELP — Text Extraction with Linguistic Patterns
- João Cordeiro, Purificação Moura Silvano, António Leal and Sebastião Pais
- 11:20-11:40 (online) Improving Language Coverage on HeLI-OTS
- Tommi Jauhiainen and Krister Lindén
- 11:40-12:00 (online) Nepal Script Text Recognition using CRNN CTC Architecture
- Swornim Nakarmi, Sarin Sthapit, Arya Shakya, Rajani Chulyadyo and Bal Krishna Bal
12:00-13:00 Session 3: Regional Languages of Europe
- 12:00-12:20 Managing Fine-grained Metadata for Text Bases in Extremely Low Resource Languages: the Cases of Two Regional Languages of France
- Marianne Vergez-Couret, Delphine Bernhard, Michael Nauge, Myriam Bras, Pablo Ruiz Fabo and Carole Werner
- 12:20-12:40 NLP for Arbresh: How an Endangered Language Learns to Write in the 21st Century
- Giulio Cusenza and Çağrı Çöltekin
- 12:40-13:00 (online) Italian-Ligurian Machine Translation in its Cultural Context
- Christopher R. Haberland, Jean Maillard and Stefano Lusito
13:00-13:10 Online Poster Session I
- Bidirectional English-Nepali Machine Translation(MT) System for Legal Domain
- Shabdapurush Poudel, Bal Krishna Bal and Praveen Acharya
- Multilingual Self-supervised Visually Grounded Speech Models
- Huynh Phuong Thanh Nguyen and Sakriani Sakti
13:10-14:10 Lunch Break & Poster Session I
- A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
- Catherine Arnett, Tyler A. Chang and Benjamin Bergen
- Beyond Error Categories: A Contextual Approach of Evaluating Emerging Spell and Grammar Checkers
- Þórunn Arnardóttir, Svanhvít Lilja Ingólfsdóttir, Haukur Barri Símonarson, Hafsteinn Einarsson, Anton Karl Ingason and Vilhjálmur Þorsteinsson
- Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish
- Fred Philippy, Shohreh Haddadan and Siwen Guo
- Mixat: A Data Set of Bilingual Emirati-English Speech
- Maryam Khalifa Al Ali and Hanan Aldarmaki
- Resource acquisition for understudied languages: Extracting wordlists from dictionaries for computer-assisted language comparison
- Frederic Blum, Johannes Englisch, Alba Hermida Rodriguez, Rik van Gijn and Johann-Mattis List
- Tracing Linguistic Heritage: Constructing a Somali-Italian Terminological Resource Through Explorers’ Notebooks and Contemporary Corpus Analysis
- Silvia Piccini, Giuliana Elizabeth Vilela Ruiz, Andrea Bellandi and Enrico Carniani
- ViHealthNLI: A Dataset for Vietnamese Natural Language Inference in Healthcare
- Huyen Nguyen, Quyen The Ngo, Thanh-Ha Do and Tuan-Anh Hoang
14:10-14:50 Keynote Speech (online)
- Eddie Avila (Director of Global Voices)
14:50-15:50 Session 4: Machine Translation
- 14:50-15:10 Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study
- Wan-hua Her and Udo Kruschwitz
- 15:10-15:30 The First Parallel Corpus and Neural Machine Translation Model of Western Armenian and English
- Ari Nubar Boyacıoğlu and Jan Niehues
- 15:30-15:50 (online) Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation
- Seunghyun Ji, Hagai Raja Sinulingga and Darongsae Kwon
15:50-16:30 Coffee Break
16:30-17:50 Session 5: Large Language Models
- 16:30-16:50 Advancing Generative AI for Portuguese with Open Decoder Gervásio PT*
- Rodrigo Santos, João Ricardo Silva, Luís Gomes, João Rodrigues and António Branco
- 16:50-17:10 Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family
- Rodrigo Santos, João Rodrigues, Luís Gomes, João Ricardo Silva, António Branco, Henrique Lopes Cardoso, Tomás Freitas Osório and Bernardo Leite
- 17:10-17:30 (online) BERTbek: A pretrained language model for Uzbek
- Elmurod Kuriyozov, David Vilares and Carlos Gómez-Rodríguez
- 17:30-17:50 (online) Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
- Nikola Ljubeši´c, Vít Suchomel, Peter Rupnik, Taja Kuzman and Rikvan Noord
Tuesday, May 21, 2024
9:00-10:00 Session 6: Quality and Evaluation
- 9:00-9:20 Man or Machine: Evaluating Spelling Error Detection in Danish Newspaper Corpora
- Eckhard Bick, Jonas Nygaard Blom, Marianne Rathje and Jørgen Schack
- 9:20-9:40 (online) Unsupervised Outlier Detection for Language-Independent Text Quality Filtering
- Jón Daðason and Hrafn Loftsson
- 9:40-10:00 (online) Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data
- Ólafur A. Jóhannsson, Birkir H. Arndal, Eysteinn Ö. Jónsson, Stefan Olafsson and Hrafn Loftsson
10:00-10:30 Session 7: Position Papers
- 10:00-10:15 Seeding Alignment Between Language Technology and Indigenous Methodologies: a decolonizing framework for endangered language revitalization
- Craig John Carpenter, John lyon, Miles Thorogood and Jeannette C. Armstrong
- 10:15-10:30 Solving Failure Modes in the Creation of Trustworthy Language Technologies
- Gianna Leoni, Lee Steven, T¯ ureiti Keith, Keoni Mahelona, Peter-Lucas Jones and Suzanne Duncan
10:30-11:00 Coffee Break
11:00-11:40 Keynote Speech
- “Massively multilingual language technologies”
- Jean Maillard (AI Researcher (FAIR team) at META)
11:40-12:40 Session 8: Language Resources
- 11:40-12:00 Uncovering Social Changes of the Basque Speaking Twitter Community During COVID-19 Pandemic
- Joseba Fernandez de Landa, Iker García-Ferrero, Ander Salaberria and Jon Ander Campos
- 12:00-12:20 (online) BK3AT: Bangsamoro K-3 Children’s Speech Corpus for Developing Assessment Tools in the Bangsamoro Languages
- Kiel D. Gonzales, Jazzmin R. Maranan, Francis Paolo D. Santelices, Edsel Jedd M. Renovalles, Nissan D. Macale, Nicole Anne A. Palafox and Jose Marie A. Mendoza
- 12:20-12:40 (online) UzABSA: Aspect-Based Sentiment Analysis for the Uzbek Language
- Sanatbek Gayratovich Matlatipov, Jaloliddin Rajabov, Elmurod Kuriyozov and Mersaid Aripov
12:40-14:00 Lunch Break & Poster Session II
- Assessing Pre-Built Speaker Recognition Models for Endangered Language Data
- Gina-Anne Levow
- Improving Legal Judgement Prediction in Romanian with Long Text Encoders
- Mihai Masala, Traian Rebedea and Horia Velicu
- Inter-language Transfer Learning for Visual Speech Recognition toward Under-resourced Environments
- Fumiya Kondo and Satoshi Tamura
- Residual Dropout: A Simple Approach to Improve Transformer’s Data Efficiency
- Carlos Escolano, Francesca De Luca Fornaciari and Maite Melero
- Text Classification Tools for Enhancing Digital Game-Based Language Learning for Irish
- Leona Mc Cahill, Thomas Baltazar, Sally Bruen, Liang Xu, Monica Ward, Elaine Uí Dhonnchadha and Jennifer Foster
- UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology
- Agata Savary, Daniel Zeman, Verginica Barbu Mititelu, Anabela Barreiro, Olesea Caftanatov, Marie-Catherine de Marneffe, Kaja Dobrovoljc, Gülşen Eryiğit, Voula Giouli, Bruno Guillaume, Stella Markantonatou, Nurit Melnik, Joakim Nivre, Atul Kr. Ojha, Carlos Ramisch, Abigail Walsh, Beata Wójtowicz and Alina Wróblewska
- Work in Progress: Text-to-speech on Edge Devices for te Reo Mori and lelo Hawaii
- Tūreiti Keith
- Developing Infrastructure for Low-Resource Language Corpus Building
- Hedwig G. Sekeres, Wilbert Heeringa, Wietse de Vries, Oscar Yde Zwagers, Martijn Wieling and Goffe Th. Jensma
14:00-15:20 Session 9: Speech Technologies
- 14:00-14:20 Tandem Long-Short Duration-based Modeling for Automatic Speech Recognition
- Dalai Mengke, Yan Meng and Peter Mihajlik
- 14:20-14:40 Multi-dialectal ASR of Armenian from naturalistic and read speech
- Malajyan Arthur, Victoria Khurshudyan, Karen Avetisyan, Hossep Dolatian and Damien Nouvel
- 14:40-15:00 Indonesian-English Code-Switching Speech Recognition using the Machine Speech Chain based Semi-Supervised Learning
- Rais Vaza Man Tazakka, Dessi Lestari, Ayu Purwarianti, Dipta Tanaya, Kurniawati Azizah and Sakriani Sakti
- 15:00-15:20 (online) Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses
- Chia-Yu Li and Ngoc Thang Vu
15:20-16:00 Panel Discussion
16:00-16:30 Coffee Break
16:30-16:40 Online Poster Session II
- PersianEmo: Enhancing Farsi-Dari Emotion Analysis with a Hybrid Transformer and Recurrent Neural Network Model
- Mohammad Ali Hussiny, Mohammad Arif Payenda and Lilja Øvrelid
- Why the Unexpected? Dissecting the Political and Economic Bias in Persian Small and Large Language Models
- Ehsan Barkhordar, Surendrabikram Thapa, Ashwarya Maratha and Usman Naseem
16:40-17:20 Session 10: Data Scarcity-related Issues
- 16:40-17:00 Quantifying the Ethical Dilemma of Using Culturally Toxic Training Data in AI Tools for Indigenous Languages
- Pedro Henrique Domingues, Claudio Santos Pinhanez, Paulo Cavalin and Julio Nogima
- 17:00-17:20 (online) Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot
- Michelle Terblanche, Kayode Olaleye and Vukosi Marivate
17:20 -17:30 Closing