Who hasn’t had a frustrating experience with a chatbot? If you haven’t, either you’ve never used one, or you’ve interacted in American English or any other of the preferred languages of the big tech companies.

It may not be obvious, but developing a conversational agent could be a powerful way to promote inclusion. This is precisely one of the goals of the consortium Accelerat.ai, created to advance the digital transformation of the public and private sectors in Portugal, with active participation from INESC-ID. “We intend to develop technology in Portugal to support sectors that are particularly important, such as a conversational system for SNS24 (Portuguese National Health Service’s helpline) and customer support solutions for businesses”, describes Alberto Abad from Scientific Area Human Language Technologies and the leader of INESC-ID’s participation in the consortium, which is supported under the Recovery and Resilience Plan (RRP).

As seen in other areas of technology, the need to improve the quality and reliability of conversational agents was stressed during the Covid-19 pandemic, due to the transformation of services that previously provided in-person assistance and had to close. Consequently, there was a shortage of personnel to handle requests coming through contact centres, which were often not designed for that purpose. “Such circumstances created a need for, whenever possible, automated assistance,” notes Abad, a professor at Técnico, which is also a member of the consortium.

In this world of ours, there are about 7 000 spoken languages. According to Defined.ai, an AI marketplace for tools, data and models, and the leader of Accelerat.ai consortium, “29% of business has lost clients due to a lack of multilingual support and 70% of end-users express greater loyalty to companies offering support in their native language.” And so, the mission of this ambitious project, with a budget of 35 million euros, 2.18 million of which is allocated to INESC-ID, is to develop a conversational assistant for languages outside the top 15 language roadmaps of the big 5 tech companies, starting with European Portuguese.

The solutions in development are based on Conversational Artificial Intelligence Agents and CCaaS (Contact Centre as a Service). At INESC-ID, we will investigate and explore the capacity to mutually convert from speech to text and text to speech – an area known as Automatic Speech Recognition and Speech Synthesis. “It is a technology that has existed for many years, with several components – speech-to-text, and the reverse, text-to-speech”, Abad notes. “Additionally, there is the ‘brain’ of the system, which involves dialogue management and task handling, and this has evolved significantly in recent times with the advent of large language models (LLMs) that have transformed the landscape.” 

To public and private organizations

Some of this technology is already on the market, typically dominated by major tech companies such as Microsoft, Google, and Apple. These corporations have a business vision and often overlook niche languages, such as European Portuguese, spoken by ten million people. Consequently, the level of maturity and the amount of data available to develop these systems for minority languages is lower. “Therefore, the goal of the project is to provide Portuguese companies with technology specifically tailored to the Portuguese context, including its variants”, like dialects or regional forms of the language.

Together on this mission, Defined.ai states that the project represents a strategic effort to cater the needs of the public and private sectors in Portugal and related markets, enhancing communication and accessibility in digital platforms. “If a developer intends to use voice systems in the Portuguese market, they must rely on generic models with a considerable margin of error, or in English and Brazilian Portuguese, languages that are not the preferred options for Portuguese speakers”, defended Daniela Braga, from Defined.ai, in a press release.

Over the past ten years, errors on this kind of systems have been drastically reduced. “The systems are improving so much that we may be approaching a situation where, in certain tasks, it will be difficult to distinguish a human from a machine. The components are the same, but they work much better”, explains Alberto Abad. And so synthetic speech may end up being indistinguishable from recordings in terms of naturalness and fluidity. “Today, it is possible to have dialogue systems that solve many problems”, he adds. But still there is margin for improvement, by pushing the boundaries of the state of the art.

Picking up on emotional cues

One of the research goals is to extract emotional cues from speech to create more empathetic, human-like responses, moving further away from a robotic response. Another purpose is to improve speech recognition in non-ideal conditions, such as low-resource languages (with relatively less data available for training), or atypical speech. “Systems are tested for normative speech – such as children, where the lexicon is different, pitch is different and there are more fluctuations”, says the researcher. Elderly and people with any type of disability that affects speech are also in the scope of Accelerat.ai. The system may be trained to speak slower when interacting with elderly, use youthful speech with youngsters or adapting the accent, to create more closeness to the user.

“Speech can be considered private information. If used maliciously, it is easy to create synthesis systems with our voice”, he alerts. “There can be a set of automatically extracted information that we might want to protect” – Alberto Abad.

It is becoming increasingly evident that voice data can contain valuable information that may be used to detect health conditions, such as Parkinson’s disease. “Depending on the type of client and the circumstances, this information can be useful for characterizing the patient, as in a hospital setting”, exemplifies Abad.

However, the more valuable the data, the greater the concern about privacy. “Speech can be considered private information. If used maliciously, it is easy to create synthesis systems with our voice”, he alerts. “There can be a set of automatically extracted information that we might want to protect.” To guarantee this, INESC-ID’s team is also working in ways to extract information while ensuring user privacy. And for this there are several approaches. “One idea is to use encryption. Another is to allow users to control which information they want to be leaked—such as being okay with their gender being known but not wanting anything else to be disclosed, or only allowing their speech to be used to understand what they are saying.”

Presence at Interspeech

Privacy is also a significant part of the project, with PhD students working in this area. “Currently, when we use such a system, the speech is stored, and we don’t know what might be done with it. We are working on improving speech recognition, on extracting features and health biomarkers, and in the future, on making interactions more private, addressing security and ethical concerns.”

Six senior researchers, five post-doc, three PhD students, a master student are part of the project, a team that had a notable presence in the last Interspeech conference (the largest one on speech and language technologies in the world), with the participation of several members, the presentation of three scientific works related to the improvement of automatic speech recognition in low-resource settings and the use of LLMs as speech annotators to characterize speakers, the participation of a junior PhD as an expert pannellist in the special session “Connecting Speech-science and Speech-technology for Children’s Speech”. and above all, with the recognition of Professor Isabel Trancoso who received the ISCA Medal for Scientific Achievement, an annual distinction that honors each year an individual who has made extraordinary contributions to the field of speech communication science and technology.

Projects like Accelerat.ai bring progress to conversational AI, prioritizing inclusiveness and accessibility. With applications ranging from healthcare to customer support, innovations in speech recognition and synthesis will increasingly become part of our everyday lives, bringing human-centered solutions at the base of AI development in Portugal and beyond.

Text by Sara Sá, Science Writer | Communications and Outreach Office, INESC-ID / © 2024 INESC-ID

Images | © 2024 INESC-ID, Accelerat.ai