In this work, we present a multimodal Small Language Model (SLM) architecture designed for multilingual Speech Emotion Recognition (SER). Our approach integrates a transformer- based audio encoder with a SLM, using a linear projection layer that bridges audio inputs with textual comprehension. This in- tegration enables the SLM to effectively process and understand spoken language, enhancing its capability to recognize emo- tional nuances. We experiment with various state-of-the-art (SoTA) SLMs and evaluate them across five different datasets representing a variety of European languages: German, Por- tuguese, Italian, Spanish, and English. By leveraging both au- dio signals and their corresponding transcriptions, our model achieves comparable performance in SER tasks for each lan- guage with respect to SoTA models. Our results demonstrate the robustness of our archit
In this work, we present a multimodal Small Language Model (SLM) architecture designed for multilingual Speech Emotion Recognition (SER). Our approach integrates a transformer- based audio encoder with a SLM, using a linear projection layer that bridges audio inputs with textual comprehension. This in- tegration enables the SLM to effectively process and understand spoken language, enhancing its capability to recognize emo- tional nuances. We experiment with various state-of-the-art (SoTA) SLMs and evaluate them across five different datasets representing a variety of European languages: German, Por- tuguese, Italian, Spanish, and English. By leveraging both au- dio signals and their corresponding transcriptions, our model achieves comparable performance in SER tasks for each lan- guage with respect to SoTA models. Our results demonstrate the robustness of our archit Read More