| Emoji | Task Type | Flow Example |
|---|---|---|
| π TTS | Text-to-Speech | text (EN) β audio (EN) |
| π T2TT π MT |
Text-to-Text-Translation Machine Translation |
text (FR) β text (EN) |
| π€ T2ST | Text-to-Speech-Translation | text (DE) β audio (EN) |
| π£οΈ STT π£οΈ ASR |
Speech-to-Text Automatic-speech-recognition |
audio (PT) β text (PT) |
| π£οΈπ S2TT | Speech-to-Text-Translation | audio (FR) β text (EN) |
| π£οΈπ S2ST | Speech-to-Speech-Translation | audio (ES) β audio (EN) |
| π§ SLI | Spoken-Language-Identification | audio (unknown) β lang: "pt" |
| π§Ύ WLI | Written-Language-Identification | text (unknown) β lang: "en" |
| π― WW | Wake-Word Detection | passive audio β hotword β trigger |
| ποΈ VAD | Voice-Activity-Detection | audio stream β speech segmenting |
| π€ QA | Question-Answering | text (prompt) β text (generated) |
| βοΈ DT | Dialog-Transformer | text (generated) β text (modified) |
| βοΈ UT | Utterance-Transformer | text (prompt) β text (modified) |
| task | input type | output type | input language == output language |
|---|---|---|---|
| π€ Question Answering | text | text | yes |
| βοΈ Dialog Transformer | text | text | yes |
| π Text-to-text-translation (MT) | text | text | no |
| π Text-to-speech (TTS) | text | audio | yes |
| π€ Text-to-speech-translation (T2ST) | text | audio | no |
| π£οΈ Speech-To-Text (STT) | audio | text | yes |
| π£οΈπ Speech-to-speech-translation (S2ST) | audio | audio | no |
| π£οΈπ Speech-to-text-translation (S2TT) | audio | text | no |
| Combined Plugins | Task Description | Emoji Task |
|---|---|---|
| π MT + π TTS | Text-to-Speech Translation | π€ T2ST |
| π£οΈ STT + π MT | Speech-to-Text Translation | π£οΈπ S2TT |
| π£οΈ STT + π MT + π TTS | Speech-to-Speech Translation | π£οΈπ S2ST |
text (DE)
β π§Ύ detect written lang (DE)
β π translate (DE β EN)
β π TTS (EN)
= audio (EN)
audio (FR)
β π§ detect spoken lang (FR)
β π£οΈ STT (FR)
β π translate (FR β EN)
= text (EN)
audio (ES)
β π§ detect spoken lang (ES)
β π£οΈ STT (ES)
β π translate (ES β EN)
β π TTS (EN)
= audio (EN)
If input language is not known before inference it can be detected via SLI and WLI plugins allowing for dynamic language/multi-user/multilingual setups
| Combined Plugins | Task Description | Emoji Task |
|---|---|---|
| π§Ύ + π TTS | Text-to-Speech | π TTS (multilingual) |
| π§Ύ + π MT | Text Translation | π MT (multilingual) |
| π§Ύ + π MT + π TTS | Text-to-Speech Translation | π€ T2ST (multilingual) |
| π§ SLI + π£οΈ STT | Speech-to-Text | π£οΈ STT (multilingual) |
| π§ SLI + π£οΈ STT + π MT | Speech-to-Text Translation | π£οΈπ S2TT (multilingual) |
| π§ SLI + π£οΈ STT + π MT + π TTS | Speech-to-Speech Translation | π£οΈπ S2ST (multilingual) |
| Plugin | Purpose | Position |
|---|---|---|
| βοΈ UT | Normalize / rewrite user input (ex: βcan u pls tell me?β β βplease tell meβ) | Before π€ QA |
| π€ QA | Core NLU + response generation (LLM / skill selection / intent) | Middle |
| βοΈ DT | Rewrite generated response (ex: dry β humorous, formal β friendly) | After π€ QA |
for OVOS purposes consider π€ QA to be equivalent to ovos-core, in OVOS this step uses intent classification to select a skill that is responsible for executing some action and generating a dialog
βοΈ DT is used after ovos-core generated an answer to rewrite it and give it a personality
| Pipeline | Description |
|---|---|
| audio β π― WW + ποΈ VAD + π£οΈ STT β π€ QA β π TTS β audio | Direct spoken Q&A |
| audio β π― WW + ποΈ VAD + π£οΈ STT β βοΈ UT β π€ QA β π TTS β audio | Input cleanup for better NLU |
| audio β π― WW + ποΈ VAD + π£οΈ STT β βοΈ UT β π€ QA β βοΈ DT β π TTS β audio | Voice assistant with emotion/tone control |
| audio β π― WW + ποΈ VAD + π§ SLI + π£οΈ STT β π€ QA β π TTS | Multilingual support |
| audio β π― WW + ποΈ VAD + π§ SLI + π£οΈ STT + π MT + βοΈ UT β π€ QA β βοΈ DT + π MT+ π TTS | Fully featured multilingual, polyglot (bidirectional translation), personalized voice assistant |
audio (EN)
β π― WW + ποΈ VAD
β π£οΈ STT (EN)
β π€ QA (EN β EN)
β π TTS (EN)
= audio (EN)
audio (PT)
β π― WW + ποΈ VAD
β π§ SLI: lang="pt"
β π£οΈ STT (PT)
β βοΈ UT (normalize)
β π MT (PT β EN)
β π€ QA (EN β EN)
β βοΈ DT (personality/style)
β π MT (EN β PT)
β π TTS (PT)
= audio (PT)