Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction
Minsu Kim*, Rodrigo Mira*, Honglie Chen, Stavros Petridis, Maja Pantic
*Equal contribution
ICASSP 2025 [Paper] [Code]
ABSTRACT. In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker's face, spatial information, or other explicit cues to identify the target stream, our proposed method requires only a few turns of previous dialogue (or monologue) history. This approach is naturally feasible in mobile messaging environments where voice recordings are typically preceded by textual dialogue that can be leveraged implicitly. We present three CSE models and analyze their performances on three datasets. Through our experiments, we demonstrate that even when the model relies purely on dialogue history, it can achieve over 90% accuracy in identifying the correct target stream with only two previous dialogue turns. Furthermore, we show that by leveraging both textual context and enrollment utterances as cues during training, we further enhance our model's flexibility and effectiveness, allowing us to use either cue during inference, or combine both for improved performance.
<CSE (Cascaded CSE, ContSep, ContExt) results on DailyTalk>
Mixed Speech (Input) | Ground Truth | Cascaded CSE | ContSep | ContExt | |
---|---|---|---|---|---|
Previous Textual History |
...
Spk 1. Yes, that's our regular flight to Shanghai. What's your name, please? Spk 2. I am Jenny Armstrong. My first initial is J. Spk 1. One minute. Oh, yes, here we are. You are flying economy class, aren’t you? Spk 2. Umm…. Yes, that's right. |
||||
Audio | |||||
Previous Textual History |
...
Spk1. I've made a tough decision, sir. Here's my resignation.
Spk2. Well, I have to tell you that I'm quite surprised. Is there any possible way to change your mind? ... Spk 1. No, sir. Spk 2. Have you been given a better offer? Spk 1. Oh, no. I would never look for another job while working here. I think this is a fantastic place to work. Spk 2. Well, what's the problem then? |
||||
Audio | |||||
Previous Textual History |
...
Spk1. Are you crying? Spk2. Yes, I always cry at weddings. Spk1. Harris and Anne are perfect for each other. Spk2. Yes, they are. |
||||
Audio | |||||
Previous Textual History |
...
Spk1. Yes, I always cry at weddings. Spk2. Harris and Anne are perfect for each other. Spk1. Yes, they are. Spk2. You and Tom also make a great couple. |
||||
Audio | |||||
Previous Textual History |
...
Spk1. I think Euros will be the best, as I'll only be in Mainland Europe. Spk2. Of course. Do you have your passport with you? How much would you like to purchase? Spk1. Five-thousand Euros will be sufficient, thanks. Spk2. Please fill in this form. How would you like it? In hundred Euro cheques? |
||||
Audio |
<CSE (Cascaded CSE, ContSep, ContExt) results on SpokenWOZ>
Mixed Speech (Input) | Ground Truth | Cascaded CSE | ContSep | ContExt | |
---|---|---|---|---|---|
Previous Textual History |
...
Spk1. and there is a restaurant called ask restaurant can meet your requirement Spk2. okay, great i want to book a table Spk1. of course, how many people |
||||
Audio | |||||
Previous Textual History |
...
Spk1. the train will arrive by 18:01
Spk2. okay. and what are the train fires Spk1. the train costs 14.32 pounds for each person Spk2. ah yes, thank you |
||||
Audio | |||||
Previous Textual History |
...
Spk1. yes, i'm also looking for a place to go in town and i really want to you know have the thing that we could all sit together on the river Spk2. okay. how about. the type of boat Spk1. yes, yes, great then make it in the west Spk2. okay. please hold on. i'm searching for |
||||
Audio | |||||
Previous Textual History |
...
Spk1. oh, yeah, i got a pop that wendell park. Spk2. okay, that is perfect. matthews, get the postcode. Spk1. so the post code cd223ae. |
||||
Audio | |||||
Previous Textual History |
...
Spk1. okay. can i pick h. id. number. Spk2. yeah. yes, i 3, 0, 3. Spk1. yes. Spk2. 4, 4, 9. Spk1. yes. Spk2. 4 9 2 Spk1. yes. |
||||
Audio |
<CSE (Cascaded CSE, ContSep, ContExt) results on TEDLIUM-3>
Mixed Speech (Input) | Ground Truth | Cascaded CSE | ContSep | ContExt | |
---|---|---|---|---|---|
Previous Textual History |
...
when michael bloomberg asked me to be his planning commissioner and put me in charge of shaping the entire city of new york he said to me on that very day he said that new york was projected to grow from eight to nine million people and he asked me so where are you going to put one million additional new yorkers well i didn't have any idea now you know that new york does place a high value on attracting immigrants so we were excited about the prospect of growth but |
||||
Audio | |||||
Previous Textual History |
...
alone until i let them have a piece but it was not a good one
well it occurred to me that i should invite dr robicsek to lecture at wofford college on what else i should invite him to meet my oldest trustee who had majored in french history at yale some largest privately owned textile empire with an iron hand his name is roger milliken |
||||
Audio | |||||
Previous Textual History |
...
i'm talking of course about living organisms living organisms are created by chemistry we are huge packages of chemicals so chemistry is dominated by the electromagnetic force that operates over smaller scales than gravity which explains why you and i are smaller than stars or planets now what are the ideal conditions for chemistry |
||||
Audio | |||||
Previous Textual History |
...
dna accumulates information through random errors some of which just happen to work but dna had actually generated a faster way of learning it had produced organisms with brains and those organisms can learn learn in real time they accumulate information they learn the sad thing is when they die the information dies with them now what makes humans different |
||||
Audio | |||||
Previous Textual History |
...
they got on a boat and yet another divergence the boat was either going to canada or to australia they got on and didn't know where they were going and ended up in to make a long story short they came to canada my grandmother was a chemist she worked at the banting institute in toronto and at forty four she died of stomach cancer i never met my grandmother but i carry on her name |
||||
Audio |
< H-ContExt results on DailyTalk, SpokenWOZ, TEDLIUM-3>
Mixed Speech (Input) | Ground Truth | Using only Context | Using only Voice | Using both Context & Voice | |
---|---|---|---|---|---|
Samples from DailyTalk | |||||
Previous Textual History |
...
Spk 1. Yes, that's our regular flight to Shanghai. What's your name, please? Spk 2. I am Jenny Armstrong. My first initial is J. Spk 1. One minute. Oh, yes, here we are. You are flying economy class, aren’t you? Spk 2. Umm…. Yes, that's right. |
||||
Audio | |||||
Previous Textual History |
...
Spk1. I've made a tough decision, sir. Here's my resignation.
Spk2. Well, I have to tell you that I'm quite surprised. Is there any possible way to change your mind? ... Spk 1. No, sir. Spk 2. Have you been given a better offer? Spk 1. Oh, no. I would never look for another job while working here. I think this is a fantastic place to work. Spk 2. Well, what's the problem then? |
||||
Audio | |||||
Previous Textual History |
...
Spk1. Are you crying? Spk2. Yes, I always cry at weddings. Spk1. Harris and Anne are perfect for each other. Spk2. Yes, they are. |
||||
Audio | |||||
Samples from SpokenWOZ | |||||
Previous Textual History |
...
Spk1. and there is a restaurant called ask restaurant can meet your requirement Spk2. okay, great i want to book a table Spk1. of course, how many people |
||||
Audio | |||||
Previous Textual History |
...
Spk1. the train will arrive by 18:01
Spk2. okay. and what are the train fires Spk1. the train costs 14.32 pounds for each person Spk2. ah yes, thank you |
||||
Audio | |||||
Previous Textual History |
...
Spk1. yes, i'm also looking for a place to go in town and i really want to you know have the thing that we could all sit together on the river Spk2. okay. how about. the type of boat Spk1. yes, yes, great then make it in the west Spk2. okay. please hold on. i'm searching for |
||||
Audio | |||||
Samples from TEDLIUM-3 | |||||
Previous Textual History |
...
when michael bloomberg asked me to be his planning commissioner and put me in charge of shaping the entire city of new york he said to me on that very day he said that new york was projected to grow from eight to nine million people and he asked me so where are you going to put one million additional new yorkers well i didn't have any idea now you know that new york does place a high value on attracting immigrants so we were excited about the prospect of growth but |
||||
Audio | |||||
Previous Textual History |
...
alone until i let them have a piece but it was not a good one
well it occurred to me that i should invite dr robicsek to lecture at wofford college on what else i should invite him to meet my oldest trustee who had majored in french history at yale some largest privately owned textile empire with an iron hand his name is roger milliken |
||||
Audio | |||||
Previous Textual History |
...
i'm talking of course about living organisms living organisms are created by chemistry we are huge packages of chemicals so chemistry is dominated by the electromagnetic force that operates over smaller scales than gravity which explains why you and i are smaller than stars or planets now what are the ideal conditions for chemistry |
||||
Audio |