Player is loading...

End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute

Thursday, April 19, 2018 · 11:09 a.m. · 08m 13s

We will present novel ideas to successfully build end-to-end speaker recognition on deep learning. The analysed approach aims to model both speaker and phonetic information of a speech utterance through specific hidden representations of deep neural network. Performance of this new approach will be measured on a standard (RSR 2015) task and compared to conventional speaker recognition systems. Large relative improvement of about 50% in equal error rate has been observed for a fixed-phrase condition.

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

the stock is going to be a speech processing so and that i don't know michael kay's

00:00:04

um and to end up or just for recognising speakers from audio so to just to get given idea what we're

00:00:11

going to solve it here is is a is a speaker verification task so in using all your data

00:00:18

so the oh problem would be like you'll be given an audio sample and you'll be asked with a does or does impose belong to the speaker or

00:00:24

not and then you have to come up with oh uh it is just saying that yes or no so as you can see this to go

00:00:33

oh and this guy from toad this guy is making got to the system

00:00:37

it can be a bank and um or any other application or to some aspect

00:00:42

of it we should system and he's going to lose ask well i'm bob

00:00:48

like this this is and then the system will be asked whether this actually belong to the bombs was or not

00:00:55

so the system will have our data or the our our ball stored about that i mean make comparison against

00:01:02

because the voice and this data to come up with or with the decisions saying yes or no

00:01:08

and here in this talk i'll i'll tacos and this text depends because vivian wet

00:01:15

in addition to this or stuff like with that i'm but we're not you'll be also asked to see that asked for

00:01:22

like something like in bank operation you are so okay tell me the password to gain gain access to the system

00:01:30

so if you say a password or something sometimes it's a phase in in in this in this talk all be concerned

00:01:35

displayed something like well my prices my possible this is one of the example pieces that will consider much easier to

00:01:44

uh we should go to think about and as you can see this

00:01:47

at the objection that we're interested in and this phase in our

00:01:53

it also means uh the content of the adjust the signal

00:02:01

so in speech to give a brief overview of speech testing what we do with that

00:02:07

of given already given audio we we segmented

00:02:10

into shock ottoman leases and um extract

00:02:14

features so an audio sample will be a converted to sequence of feature vectors

00:02:21

and to give the idea that all of the condition approaches that that they used to follow

00:02:27

oh i just beefy overview one popular approach which is call factor nice modern so what they do is that given

00:02:35

a feature vector it tries to decompose into fall into

00:02:38

speaker component another components there's other components can be

00:02:43

a channel and it can be language and what is the information that

00:02:48

the information like of that contain information something like the what

00:02:52

is the feedback yes we can and so and so forth so it tries to remove the speaker will produce from other villages

00:03:00

and once you have the speaker will read this you can easily compared to audio recordings

00:03:07

okay but this kind of system what it what it does is is that

00:03:12

it changes a lot of of an awful lot of components appear independently

00:03:17

so instead of doing that is this what we want to do is some kind of end to end of approaches

00:03:25

so that we have one single system we what we gain one system and on the go and just just go through it and then it so that

00:03:33

it is able to say discriminate as you give a reply that okay this to just draw the recording bit onto the things because or not

00:03:41

so to do that we employ something uh a last function which

00:03:47

is also probably defer to us to put last so this

00:03:53

uh in this approach will be given about three recordings says that to recording belong

00:03:59

to the things bigger and uh that that doesn't belong to the same speaker

00:04:05

as it gets this this is a slight with without this added

00:04:10

audio recordings also referred to as x. e. x. e.

00:04:14

an x. and yet yet also known as like a positive or negative

00:04:20

negative instances so that uh this x. a. x. e. n. x.

00:04:25

n. r. sequences this at the job of that is to

00:04:29

is to convert this audio audio this sequence of

00:04:33

features to one single that uh so that you can is leaking just seen this function

00:04:38

so as you can see to this uh this dido at initially though different class where's closer

00:04:45

compared to the simplest and after this money we expect than it for to be

00:04:49

to have this kind of property set the same speaker closer defence bigger diff of thought about okay to to perform this

00:04:56

kind of thing so what we do is that we have this kind of network but we have a biased india

00:05:04

which ought to which that out of a chip was doctrines as you

00:05:09

said before the utterances consist of if each sequence feature vectors

00:05:13

and this by listing gives you a sequence of vectors to and that the average you do i've which putting

00:05:20

and it finally passes through the m. l. p. as you can see the the images of this approach is that when you when you combine back that's come out

00:05:28

coming from different sound sound sources like my boys my possible it

00:05:33

it has different different of critical information got compare of

00:05:37

of the face some so if you have it averages out you lose the sequence information

00:05:43

something like oh this followed by this and so and so forth to to

00:05:47

incorporate this information also we have some kind of we develop one mechanism

00:05:53

which we refer to as content matching which of which tries

00:05:58

to match you already recording using similar sound units

00:06:02

so this is the formalism or basic basic idea of this power can't imagine what we try to do is that given to

00:06:09

audio recordings from this already recording find we try to find

00:06:13

the closest or close this match in the first recording

00:06:18

okay so and then you finally average out the schools are okay so when you

00:06:25

have this kind of distant function you can read it into the oven

00:06:29

into that the plot last function and a train the whole network using using to put noise criteria

00:06:36

so i'll just before you go to those results in some experiments so we have

00:06:40

we have experiment on the text dependent setup or and this is the standard

00:06:45

yeah that's that that that we have a a a and we have pain then you end

00:06:50

up using some very simple auditor someone by the steam yet and one and the

00:06:55

a number of a place that region that is that the one some the standards

00:07:00

well that we already have and performances major in terms of a distant mike

00:07:04

to call colour that is based on false alarm is detection do it

00:07:08

the ms detection as you know that of a it is that it is the

00:07:12

property that how often you miss their target speaker and also know how often

00:07:19

you falsely except or it on claim so the lower the beta that's as good to just just you feel would be given

00:07:26

overview of that is is is as you can see that

00:07:28

of the over propose approach significantly outperforms the baseline approach

00:07:36

and yeah and i did that this uh not important score knowledge and this is this is somehow norm

00:07:42

normalisation that you applied on the so to to make it in a in a proper range

00:07:47

so yep that's the conclusion i should say is the it means that we have to still

00:07:52

experiment with different kinds of architecture and different mechanism to to come up with better schools

00:07:58

so as to how to dispose of that is is that we have a good but all the thing that

00:08:04

we would like to scale it for different applications so that it works for any kind

Share this talk:

Recommended talks

22:18

Language Identification using Spectro-Temporal Patch features
Kamal Sahn, CMU
Sept. 8, 2012 · 11:06 a.m.

143 views

13:54

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems
Serena Soldo, Idiap Research Institute
Sept. 7, 2012 · 2:04 p.m.

End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute

Embed

Transcriptions

Recommended talks

Language Identification using Spectro-Temporal Patch features
Kamal Sahn, CMU
Sept. 8, 2012 · 11:06 a.m.

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems
Serena Soldo, Idiap Research Institute
Sept. 7, 2012 · 2:04 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

End-to-end approach for recognizing speakers from audio Subhadeep Dey, Idiap Research Institute

Embed

Transcriptions

Recommended talks

Language Identification using Spectro-Temporal Patch features Kamal Sahn, CMU Sept. 8, 2012 · 11:06 a.m.

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems Serena Soldo, Idiap Research Institute Sept. 7, 2012 · 2:04 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute

Language Identification using Spectro-Temporal Patch features
Kamal Sahn, CMU
Sept. 8, 2012 · 11:06 a.m.

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems
Serena Soldo, Idiap Research Institute
Sept. 7, 2012 · 2:04 p.m.