Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
the stock is going to be a speech processing so and that i don't know michael kay's
00:00:04
um and to end up or just for recognising speakers from audio so to just to get given idea what we're
00:00:11
going to solve it here is is a is a speaker verification task so in using all your data
00:00:18
so the oh problem would be like you'll be given an audio sample and you'll be asked with a does or does impose belong to the speaker or
00:00:24
not and then you have to come up with oh uh it is just saying that yes or no so as you can see this to go
00:00:33
oh and this guy from toad this guy is making got to the system
00:00:37
it can be a bank and um or any other application or to some aspect
00:00:42
of it we should system and he's going to lose ask well i'm bob
00:00:48
like this this is and then the system will be asked whether this actually belong to the bombs was or not
00:00:55
so the system will have our data or the our our ball stored about that i mean make comparison against
00:01:02
because the voice and this data to come up with or with the decisions saying yes or no
00:01:08
and here in this talk i'll i'll tacos and this text depends because vivian wet
00:01:15
in addition to this or stuff like with that i'm but we're not you'll be also asked to see that asked for
00:01:22
like something like in bank operation you are so okay tell me the password to gain gain access to the system
00:01:30
so if you say a password or something sometimes it's a phase in in in this in this talk all be concerned
00:01:35
displayed something like well my prices my possible this is one of the example pieces that will consider much easier to
00:01:44
uh we should go to think about and as you can see this
00:01:47
at the objection that we're interested in and this phase in our
00:01:53
it also means uh the content of the adjust the signal
00:02:01
so in speech to give a brief overview of speech testing what we do with that
00:02:07
of given already given audio we we segmented
00:02:10
into shock ottoman leases and um extract
00:02:14
features so an audio sample will be a converted to sequence of feature vectors
00:02:21
and to give the idea that all of the condition approaches that that they used to follow
00:02:27
oh i just beefy overview one popular approach which is call factor nice modern so what they do is that given
00:02:35
a feature vector it tries to decompose into fall into
00:02:38
speaker component another components there's other components can be
00:02:43
a channel and it can be language and what is the information that
00:02:48
the information like of that contain information something like the what
00:02:52
is the feedback yes we can and so and so forth so it tries to remove the speaker will produce from other villages
00:03:00
and once you have the speaker will read this you can easily compared to audio recordings
00:03:07
okay but this kind of system what it what it does is is that
00:03:12
it changes a lot of of an awful lot of components appear independently
00:03:17
so instead of doing that is this what we want to do is some kind of end to end of approaches
00:03:25
so that we have one single system we what we gain one system and on the go and just just go through it and then it so that
00:03:33
it is able to say discriminate as you give a reply that okay this to just draw the recording bit onto the things because or not
00:03:41
so to do that we employ something uh a last function which
00:03:47
is also probably defer to us to put last so this
00:03:53
uh in this approach will be given about three recordings says that to recording belong
00:03:59
to the things bigger and uh that that doesn't belong to the same speaker
00:04:05
as it gets this this is a slight with without this added
00:04:10
audio recordings also referred to as x. e. x. e.
00:04:14
an x. and yet yet also known as like a positive or negative
00:04:20
negative instances so that uh this x. a. x. e. n. x.
00:04:25
n. r. sequences this at the job of that is to
00:04:29
is to convert this audio audio this sequence of
00:04:33
features to one single that uh so that you can is leaking just seen this function
00:04:38
so as you can see to this uh this dido at initially though different class where's closer
00:04:45
compared to the simplest and after this money we expect than it for to be
00:04:49
to have this kind of property set the same speaker closer defence bigger diff of thought about okay to to perform this
00:04:56
kind of thing so what we do is that we have this kind of network but we have a biased india
00:05:04
which ought to which that out of a chip was doctrines as you
00:05:09
said before the utterances consist of if each sequence feature vectors
00:05:13
and this by listing gives you a sequence of vectors to and that the average you do i've which putting
00:05:20
and it finally passes through the m. l. p. as you can see the the images of this approach is that when you when you combine back that's come out
00:05:28
coming from different sound sound sources like my boys my possible it
00:05:33
it has different different of critical information got compare of
00:05:37
of the face some so if you have it averages out you lose the sequence information
00:05:43
something like oh this followed by this and so and so forth to to
00:05:47
incorporate this information also we have some kind of we develop one mechanism
00:05:53
which we refer to as content matching which of which tries
00:05:58
to match you already recording using similar sound units
00:06:02
so this is the formalism or basic basic idea of this power can't imagine what we try to do is that given to
00:06:09
audio recordings from this already recording find we try to find
00:06:13
the closest or close this match in the first recording
00:06:18
okay so and then you finally average out the schools are okay so when you
00:06:25
have this kind of distant function you can read it into the oven
00:06:29
into that the plot last function and a train the whole network using using to put noise criteria
00:06:36
so i'll just before you go to those results in some experiments so we have
00:06:40
we have experiment on the text dependent setup or and this is the standard
00:06:45
yeah that's that that that we have a a a and we have pain then you end
00:06:50
up using some very simple auditor someone by the steam yet and one and the
00:06:55
a number of a place that region that is that the one some the standards
00:07:00
well that we already have and performances major in terms of a distant mike
00:07:04
to call colour that is based on false alarm is detection do it
00:07:08
the ms detection as you know that of a it is that it is the
00:07:12
property that how often you miss their target speaker and also know how often
00:07:19
you falsely except or it on claim so the lower the beta that's as good to just just you feel would be given
00:07:26
overview of that is is is as you can see that
00:07:28
of the over propose approach significantly outperforms the baseline approach
00:07:36
and yeah and i did that this uh not important score knowledge and this is this is somehow norm
00:07:42
normalisation that you applied on the so to to make it in a in a proper range
00:07:47
so yep that's the conclusion i should say is the it means that we have to still
00:07:52
experiment with different kinds of architecture and different mechanism to to come up with better schools
00:07:58
so as to how to dispose of that is is that we have a good but all the thing that
00:08:04
we would like to scale it for different applications so that it works for any kind

Share this talk: 


Conference program

End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute
19 April 2018 · 11:09 a.m.

Recommended talks

Learning to Segment 3D Linear Structures Using Only 2D Annotations
Dr. Mateusz Kozinski, EPFL
19 April 2018 · 11:33 a.m.