Programme

Monday, May 20, 2024

9:00-09:10 Welcome and Introduction

9:10-10:30 Session 1:  Corpus Building 

  • 9:10-9:30 CorpusArija: Building an Annotated Corpus with Variation in Occitan
    • Clamenca Poujade, Myriam Bras and Assaf Urieli
  • 9:30-9:50 Labadain-30k+: A Monolingual Tetun and Document-Level Audited Dataset
    • Gabriel de Jesus and Sérgio Nunes
  • 9:50-10:10  (online) Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages
    • Rowena Cristina L. Guevara, Rhandley D. Cajote, Michael Gringo Angelo R. Bayona and Crisron Rudolf G. Lucas
  • 10:10-10:30 (online) A Novel Corpus for Automated Sexism Identification on Social Media
    • Lutfiye Seda Mut Altin and Horacio Saggion 

10:30-11:00 Coffee Break

11:00-12:00 Session 2:  Language Tools

  • 11:00-11:20 TELP — Text Extraction with Linguistic Patterns
    • João Cordeiro, Purificação Moura Silvano, António Leal and Sebastião Pais
  • 11:20-11:40 (online) Improving Language Coverage on HeLI-OTS
    • Tommi Jauhiainen and Krister Lindén
  • 11:40-12:00 (online) Nepal Script Text Recognition using CRNN CTC Architecture
    • Swornim Nakarmi, Sarin Sthapit, Arya Shakya, Rajani Chulyadyo and Bal Krishna Bal

12:00-13:00 Session 3: Regional Languages of Europe 

  • 12:00-12:20 Managing Fine-grained Metadata for Text Bases in Extremely Low Resource Languages: the Cases of Two Regional Languages of France
    • Marianne Vergez-Couret, Delphine Bernhard, Michael Nauge, Myriam Bras, Pablo Ruiz Fabo and Carole Werner
  • 12:20-12:40 NLP for Arbresh: How an Endangered Language Learns to Write in the 21st Century
    • Giulio Cusenza and Çağrı Çöltekin
  • 12:40-13:00  (online) Italian-Ligurian Machine Translation in its Cultural Context
    • Christopher R. Haberland, Jean Maillard and Stefano Lusito

13:00-13:10 Online Poster Session I

  • Bidirectional English-Nepali Machine Translation(MT) System for Legal Domain
    • Shabdapurush Poudel, Bal Krishna Bal and Praveen Acharya
  • Multilingual Self-supervised Visually Grounded Speech Models
    • Huynh Phuong Thanh Nguyen and Sakriani Sakti

13:10-14:10 Lunch Break & Poster Session I

  • A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
    • Catherine Arnett, Tyler A. Chang and Benjamin Bergen
  • Beyond Error Categories: A Contextual Approach of Evaluating Emerging Spell and Grammar Checkers
    • Þórunn Arnardóttir, Svanhvít Lilja Ingólfsdóttir, Haukur Barri Símonarson, Hafsteinn Einarsson, Anton Karl Ingason and Vilhjálmur Þorsteinsson
  • Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish
    • Fred Philippy, Shohreh Haddadan and Siwen Guo
  • Mixat: A Data Set of Bilingual Emirati-English Speech
    • Maryam Khalifa Al Ali and Hanan Aldarmaki
  • Resource acquisition for understudied languages: Extracting wordlists from dictionaries for computer-assisted language comparison
    • Frederic Blum, Johannes Englisch, Alba Hermida Rodriguez, Rik van Gijn and Johann-Mattis List
  • Tracing Linguistic Heritage: Constructing a Somali-Italian Terminological Resource Through Explorers’ Notebooks and Contemporary Corpus Analysis
    • Silvia Piccini, Giuliana Elizabeth Vilela Ruiz, Andrea Bellandi and Enrico Carniani
  • ViHealthNLI: A Dataset for Vietnamese Natural Language Inference in Healthcare
    • Huyen Nguyen, Quyen The Ngo, Thanh-Ha Do and Tuan-Anh Hoang

14:10-14:50 Keynote Speech (online)

  • Eddie Avila (Director of Global Voices

14:50-15:50 Session 4: Machine Translation 

  • 14:50-15:10  Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study
    • Wan-hua Her and Udo Kruschwitz
  • 15:10-15:30  The First Parallel Corpus and Neural Machine Translation Model of Western Armenian and English
    • Ari Nubar Boyacıoğlu and Jan Niehues
  • 15:30-15:50 (online) Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation
    • Seunghyun Ji, Hagai Raja Sinulingga and Darongsae Kwon

15:50-16:30 Coffee Break

16:30-17:50 Session 5:  Large Language Models

  • 16:30-16:50  Advancing Generative AI for Portuguese with Open Decoder Gervásio PT*
    • Rodrigo Santos, João Ricardo Silva, Luís Gomes, João Rodrigues and António Branco
  • 16:50-17:10  Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family
    • Rodrigo Santos, João Rodrigues, Luís Gomes, João Ricardo Silva, António Branco, Henrique Lopes Cardoso, Tomás Freitas Osório and Bernardo Leite
  • 17:10-17:30 (online) BERTbek: A pretrained language model for Uzbek
    • Elmurod Kuriyozov, David Vilares and Carlos Gómez-Rodríguez
  • 17:30-17:50 (online) Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
    • Nikola Ljubeši´c, Vít Suchomel, Peter Rupnik, Taja Kuzman and Rikvan Noord

Tuesday, May 21, 2024 

9:00-10:00 Session 6: Quality and Evaluation

  • 9:00-9:20  Man or Machine: Evaluating Spelling Error Detection in Danish Newspaper Corpora
    • Eckhard Bick, Jonas Nygaard Blom, Marianne Rathje and Jørgen Schack
  • 9:20-9:40 (online) Unsupervised Outlier Detection for Language-Independent Text Quality Filtering
    • Jón Daðason and Hrafn Loftsson
  • 9:40-10:00  (online) Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data
    • Ólafur A. Jóhannsson, Birkir H. Arndal, Eysteinn Ö. Jónsson, Stefan Olafsson and Hrafn Loftsson

10:00-10:30 Session 7: Position Papers 

  • 10:00-10:15  Seeding Alignment Between Language Technology and Indigenous Methodologies: a decolonizing framework for endangered language revitalization
    • Craig John Carpenter, John lyon, Miles Thorogood and Jeannette C. Armstrong
  • 10:15-10:30 Solving Failure Modes in the Creation of Trustworthy Language Technologies
    • Gianna Leoni, Lee Steven, T¯ ureiti Keith, Keoni Mahelona, Peter-Lucas Jones and Suzanne Duncan

10:30-11:00 Coffee Break

11:00-11:40 Keynote Speech

  • “Massively multilingual language technologies”
    • Jean Maillard (AI Researcher (FAIR team) at META)  

11:40-12:40 Session 8: Language Resources 

  • 11:40-12:00  Uncovering Social Changes of the Basque Speaking Twitter Community During COVID-19 Pandemic
    • Joseba Fernandez de Landa, Iker García-Ferrero, Ander Salaberria and Jon Ander Campos
  • 12:00-12:20 (online) BK3AT: Bangsamoro K-3 Children’s Speech Corpus for Developing Assessment Tools in the Bangsamoro Languages
    • Kiel D. Gonzales, Jazzmin R. Maranan, Francis Paolo D. Santelices, Edsel Jedd M. Renovalles, Nissan D. Macale, Nicole Anne A. Palafox and Jose Marie A. Mendoza 
  • 12:20-12:40 (online) UzABSA: Aspect-Based Sentiment Analysis for the Uzbek Language
    • Sanatbek Gayratovich Matlatipov, Jaloliddin Rajabov, Elmurod Kuriyozov and Mersaid Aripov 

12:40-14:00 Lunch Break & Poster Session II

  • Assessing Pre-Built Speaker Recognition Models for Endangered Language Data
    • Gina-Anne Levow
  • Improving Legal Judgement Prediction in Romanian with Long Text Encoders
    • Mihai Masala, Traian Rebedea and Horia Velicu
  • Inter-language Transfer Learning for Visual Speech Recognition toward Under-resourced Environments
    • Fumiya Kondo and Satoshi Tamura
  • Residual Dropout: A Simple Approach to Improve Transformer’s Data Efficiency
    • Carlos Escolano, Francesca De Luca Fornaciari and Maite Melero
  • Text Classification Tools for Enhancing Digital Game-Based Language Learning for Irish
    • Leona Mc Cahill, Thomas Baltazar, Sally Bruen, Liang Xu, Monica Ward, Elaine Uí Dhonnchadha and Jennifer Foster
  • UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology
    • Agata Savary, Daniel Zeman, Verginica Barbu Mititelu, Anabela Barreiro, Olesea Caftanatov, Marie-Catherine de Marneffe, Kaja Dobrovoljc, Gülşen Eryiğit, Voula Giouli, Bruno Guillaume, Stella Markantonatou, Nurit Melnik, Joakim Nivre, Atul Kr. Ojha, Carlos Ramisch, Abigail Walsh, Beata Wójtowicz and Alina Wróblewska
  • Work in Progress: Text-to-speech on Edge Devices for te Reo Mori and lelo Hawaii
    • Tūreiti Keith
  • Developing Infrastructure for Low-Resource Language Corpus Building
    • Hedwig G. Sekeres, Wilbert Heeringa, Wietse de Vries, Oscar Yde Zwagers, Martijn Wieling and Goffe Th. Jensma

14:00-15:20 Session 9:  Speech Technologies 

  • 14:00-14:20 Tandem Long-Short Duration-based Modeling for Automatic Speech Recognition
    • Dalai Mengke, Yan Meng and Peter Mihajlik
  • 14:20-14:40 Multi-dialectal ASR of Armenian from naturalistic and read speech
    • Malajyan Arthur, Victoria Khurshudyan, Karen Avetisyan, Hossep Dolatian and Damien Nouvel
  • 14:40-15:00 Indonesian-English Code-Switching Speech Recognition using the Machine Speech Chain based Semi-Supervised Learning
    • Rais Vaza Man Tazakka, Dessi Lestari, Ayu Purwarianti, Dipta Tanaya, Kurniawati Azizah and Sakriani Sakti
  • 15:00-15:20 (online) Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses
    • Chia-Yu Li and Ngoc Thang Vu 

15:20-16:00 Panel Discussion  

16:00-16:30 Coffee Break

16:30-16:40 Online Poster Session II

  • PersianEmo: Enhancing Farsi-Dari Emotion Analysis with a Hybrid Transformer and Recurrent Neural Network Model
    • Mohammad Ali Hussiny, Mohammad Arif Payenda and Lilja Øvrelid
  • Why the Unexpected? Dissecting the Political and Economic Bias in Persian Small and Large Language Models
    • Ehsan Barkhordar, Surendrabikram Thapa, Ashwarya Maratha and Usman Naseem

16:40-17:20 Session 10:  Data Scarcity-related Issues 

  • 16:40-17:00 Quantifying the Ethical Dilemma of Using Culturally Toxic Training Data in AI Tools for Indigenous Languages
    • Pedro Henrique Domingues, Claudio Santos Pinhanez, Paulo Cavalin and Julio Nogima
  • 17:00-17:20 (online)  Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot
    • Michelle Terblanche, Kayode Olaleye and Vukosi Marivate

17:20 -17:30 Closing