Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:04
yeah it's
00:00:07
uh_huh
00:00:10
yes so oh so good morning and it isn't implemented so
00:00:14
um the main research chased team in my uh group which is perception an activity understanding it
00:00:21
yeah um he's a about how to pace group is
00:00:24
intended interpretation of the worse over many system
00:00:29
which are going to be around the uh artificial intelligence
00:00:32
systems under one important task for them is to
00:00:37
the stand what is a surrounding and you know team we are more concerned about like
00:00:41
people on human activity so let's say we have a situation like this one
00:00:46
and so the question for robot or system is to understand these are people around the whereas
00:00:51
a was a visitor on the children's all the adults in they want different things and
00:00:58
also why and what do they want is a round and in given the um
00:01:03
task of the system so it's about how to keep track of what's going on what's going on
00:01:08
like at a given moment what's happening and so on but if we think more longer terms for
00:01:13
the systems it would be more to understand the situation to remember what happened in the past
00:01:18
and to take advantage of that for instance to take new decisions for the rubble to ask the right questions and so on
00:01:24
and uh in our group so we are really into interested in
00:01:29
how to proceed and extract such representations from multimodal sensors
00:01:34
and so uh one example i could show is taken from a work at mike
00:01:39
and microsoft from that than bore so you haven't about a system with interacting with people didn't lobbies and the
00:01:45
uh the goal is to understand whether people once to visit
00:01:48
someone potentially to ask the person to come and yeah
00:01:52
and and get them and in the lobby or to uh um
00:01:56
and ask for shuttle to come and brings them outside and
00:02:01
as part of the system ca tie lights you know if you're functionality that you
00:02:05
as as as soon other new people you like to understand where they are
00:02:09
whitehouse typos and they're potentially what they are doing but you also want to
00:02:14
one allies some social aspects for instance was it from microsoft or not
00:02:18
i'll say a dress casually or more formally because this may have some implication on
00:02:23
the dialogue and because the system has to interact eyes also lots of variables
00:02:28
related to communications so for instance is a person speaking is
00:02:32
as a person as the floor as a person talked
00:02:36
a a a few minutes ago or not um and this is not enough because you might
00:02:41
be speaking but you might not be interacting with the system so race information about
00:02:45
these are person engage with the systems are not so here you
00:02:49
have a few variables related to to that aspect and because
00:02:53
uh the interfaces task oriented you have two was of course to understand what is the goal of the person is it
00:02:59
to reach a person to to ask for shuttle and in that case as a
00:03:04
person already asked for something and what is the status is a person waiting
00:03:08
to ask for a name and a service or is a person
00:03:13
waiting phones that for instance for the shuttle to come
00:03:15
right and so essentially this means that our goal is to receive the state
00:03:20
an activity of people and there might be several dimension physical social communications
00:03:25
if we go low longterm c. can be mood personality depending on the context where we're playing our work
00:03:31
and of course as was the status of the interaction or on the task so
00:03:36
so that's the main research team and objectives so but sensing interpreting and that
00:03:41
no understanding sen in terms of passed this maybe about designing
00:03:45
algorithms for low level it's a processing detecting people tracking
00:03:49
them over time to keep their id extracting information about the post orientation of the body or the ed
00:03:56
based on all this information to understand uh as the activities
00:04:00
in terms of gestures and the eight euros and this
00:04:02
might be for individuals but it might be about analysing groups and this is often there's that the context of
00:04:08
z. application may be useful to do this interpretation
00:04:13
in terms of message on models we borrow of course models from computer vision signal processing and also
00:04:19
social sociology for instances when it's relates to communication
00:04:24
and as many of a group cited yep we are machine learning
00:04:27
oriented so we built from statistical models and now days
00:04:31
for the planning as also showed in computer vision this has
00:04:35
had to make tremendous project but sorry judgement use um
00:04:40
progress uh in that field and the applications around
00:04:43
surveillance it's awry social g. multimedia content analysis
00:04:48
so to do this work in the group uh uh so depending
00:04:52
on funding seats oscillates between four and twelve people and
00:04:56
two days it's we have three post datsun seven research assistants in the team
00:05:00
and i want to add to this people uh only because it can it it was not
00:05:04
officially part of the team but is actually spending a lot of time working at um
00:05:09
with the people in my team uh and easy b. and many aspects of the work some
00:05:13
going to i i liked and you straight now with several of the research we're doing
00:05:18
um so in the passing the surveillance settings we did quite some work
00:05:23
about analysing people different activities finding left luggage is counting people um
00:05:29
it was checked by challenging depending on the type of application we often
00:05:33
work only with a single camera so they may be quite
00:05:36
a a good amount of cropping and for instance for tracking people may actually where similar clothes
00:05:43
uh like docks you would sense on so which makes the that that's quite difficult um
00:05:49
uh but now nowadays we are less into into this topic
00:05:53
um another part that we did in in this domain the context of
00:05:57
a european project that nine was about how to automatically discover activities
00:06:02
from a temple they pass so here we are like for instance videos
00:06:07
which are given and the main uh and task is to automatically find what does it mean
00:06:12
activities in this video and what we have is only the freaks of the videos
00:06:16
so essentially we did some work using a vision models and the rich the processes so these
00:06:22
data what transform into a set of small and local activity clustered
00:06:26
over time and then we had whether these whether pairing and then based on this kind of
00:06:32
temple buttons like a their goal was you step raw data as the goal was
00:06:37
to find the repeating patterns in that space and find a that was cycles
00:06:41
is some patterns where triggering other buttons and so on which in the case of the video corresponds to
00:06:47
what are the main trajectories and uh of the main
00:06:50
um a mobile elements in in the videos
00:06:55
and this was generate candice was used for instance fundamentally detections
00:06:59
to select for instance you this is was out there
00:07:02
to pay in paris so where was the one major station and it was used automatic switch streams to present
00:07:08
um but this was also a collaboration with the a. p. f. l. m. map where automatically if we have uh the microphone array
00:07:14
on the side of a road it was automatically counting uh the cars uh that where a driving for this for that
00:07:23
so in the context of surveillance so right now we're running up projects uh
00:07:28
uh funded by industries we've uh fast come a company and
00:07:32
the activities about a smart access and security access
00:07:36
so for instance in what cases analogue and people enter and you have to decide whether is a single person or not
00:07:42
and or or you may have also other systems where there might be several people bashing and you now
00:07:47
you need to very fact based on the the right amount of people in the alloc um
00:07:53
so the methodology that we use is about the planning and
00:07:58
so this is yeah local simulation where you have the dead sensors which extract like an image like this
00:08:03
and you'd like to learn a neural network that predicts whether race people and this
00:08:08
is transform into predicting whereas the body landmarks like the head the shoulders
00:08:12
well the person's in order to count how many people are we in the alloc um
00:08:19
and so one aspect of his neural networks is that often you need a lot of training data
00:08:23
and in that case we resorted to doing simulate generating simulated data synthetic data
00:08:29
so we i have a a representation of the virtual a lot
00:08:34
and then we have data from real people and we
00:08:37
generate automatically different shapes with different body different closing
00:08:41
and so on and his generates quite large amount of data that can be used for per course
00:08:46
and because they some discrepancy between real data and simulated data we also have to adapt
00:08:52
and so we explored a little bit but then a smaller amounts of
00:08:56
real data to adjust the parameters to to the real work
00:09:00
um and then for the deer detection results it predicts a non
00:09:03
marks and then it can counsel people and then in the
00:09:07
case there should be only one person it detects that you there's more than two percent detected it is on allow
00:09:13
so here it's michael was working mainly on this is doing a nice job and here you see that this
00:09:19
is examples where is trying to check the system so here is in his office in another morning
00:09:24
here it's in fact airways testing whether and a different configuration whether it's working and we
00:09:29
some other colleagues working here in is also trying here whether it can will
00:09:33
count as the right amount of people and actually the day more with is in
00:09:37
the back so i uh the pose you can get to see the demonstration
00:09:41
but as you can see here is a system is tested people check if you use cheers uh above you whether
00:09:47
it works or he's trying if you do pushups whether it still works and so on and so on
00:09:53
and then uh we have some work as well an interaction analysis
00:09:58
so and uh one of the main projects where we are developing our work is
00:10:02
a member european projects where the goal is to develop a human read robots
00:10:06
which is going to be active in a shopping mall and actually one of the partner is a shopping mall and it is located in
00:10:12
finland swear by like more than three hundred shops in the shopping
00:10:15
mall and the goal for the robot is to entertain people
00:10:20
and to give informations about the shopping mall opening hours whereas a different shops
00:10:25
for and where are what has a direction to reduce shops potentially
00:10:29
and obviously the robot needs to be on the ten numbers
00:10:32
and act as naturally as possible in this context
00:10:36
and in this project so at it yeah is our goal is to develop
00:10:41
a scene person management model so we are doing or the perception part
00:10:45
and so that the robot is understanding its surrounding respect to people mainly
00:10:50
and this kid needs to be done we've multimodal sensing about
00:10:53
which means was actually some a hand gestures from the robot to actually maintain the representation of the word
00:10:59
and here this is an illustration acted yap a paper is also in the
00:11:04
back so you can see this as well as the pose um
00:11:08
so here what you see is uh what the robots easy it's a small field of view and here is is like
00:11:14
we regularly we sample and we extract automatically is a different body landmarks of the people
00:11:20
and this is used to make the tracking over time and importantly for interaction systems we need to
00:11:26
be sure you always talking to the same process all here um
00:11:37
so here's a rubber so here it's it's i just seem to set as to what's going on
00:11:43
and when you can see that one point is that already because a sensor in the in the head as soon as the robot so
00:11:50
while giving obviously is a sensing is a bunch of you know it's not a static camera so we have to take account
00:11:57
and because if you we small people tend to leave the field of you and it's important to keep
00:12:02
track of the identity of the time so here we have a visual
00:12:05
trackers that works overtime and here we have the same identity
00:12:09
so as soon as people reappear we have two again
00:12:15
like the end of the interaction so you can see that when robot needs to show the direction so
00:12:25
i'll where are the people who people might leave at that time right so this is problematic and
00:12:30
when it's round this means that people are talking so why would i lie to fuel the research around this
00:12:37
and one of the past that uh uh we are working on um
00:12:42
it's about sound source localisation nondiscrimination so essentially given the robot
00:12:47
yeah multiple people and you'd like to know to detective i sound sources and where they are coming from
00:12:53
and you want to discriminate speech versus non speech because we're
00:12:56
interested in in people in this in this task
00:13:00
so a year it any distractions your view of the view of
00:13:03
paper and here you want to say whether here are some
00:13:07
people talking a sound sources in that direction which you have the bars but
00:13:11
we want to detect that here is a lot speakers doing some noise
00:13:14
so we are not interested in that versus here i might be some person
00:13:18
speaking and this is what uh the main task and also we
00:13:22
would like to be able to characterise the voice if it's a person speaking because we would like to keep track of the same identity
00:13:29
so uh quite a few challenge in that case we don't know how many sounds also they are
00:13:35
uh people went interactive that talk together in there is
00:13:38
overlapping speech and in the case you would see
00:13:42
in in the noise that you have heard already based on strong ago noise uh from the people
00:13:48
and uh a very short utterances during interactions so
00:13:53
to do that we rely on microphone arrays
00:13:56
and also we want to use a learning based approach instead of signal processing um
00:14:03
because a few assumption required and we can direct optimise for the tasks but there's still quite a few questions
00:14:09
in this context what should be used as inputs what's should how should when
00:14:12
cody on put what the architectures and how do we get training data
00:14:16
and here so this is a t. v. so whipping which is an 'cause supervising we've
00:14:20
but uh from the um what speech a group to hear this is an illustration
00:14:39
uh
00:14:52
right
00:14:54
yeah
00:14:59
yeah
00:15:04
oh okay one one task we have been working on now for like
00:15:08
almost line twelve years of thirteen years is about attention and gaze
00:15:13
so cases about finding the line outside versus visual attention and this is like the
00:15:18
video which that's thirteen years old where we were trying to get the
00:15:23
to know whereas a persona looking so which targets other people screens and so on
00:15:28
and uh this is a different task it's actually more challenging because in
00:15:32
order to do that you need to understand where the scene right
00:15:35
where are the scenes what attention targets and they're that context is important uh to sole
00:15:41
reason actually as humans we are better at solving that that then this task
00:15:46
so if you look at gays estimation we have addressed this more recently we want to estimate the line of sight
00:15:52
it's quite challenging because we work with low resolution we don't have people to we don't want
00:15:57
people to wear a glasses and so on so the eyes are really really small corrupted
00:16:02
and so on might not be visible depending on the pose and for that we use a out to be
00:16:08
the sensors so the depths is important to find the distance to of information about the head shapes
00:16:13
and the visual is important to get the eyes and to understand where the eyes are looking up
00:16:18
so one pass that is important in that context is about three depots estimation given the phase
00:16:24
we'd like to find the orientation and for that we can rely on three d. d. for mabel models
00:16:29
and given the an observation as point clouds we want to automatically find these models
00:16:35
are just the parameters to the shape of the person and find its orientations
00:16:39
but there are some issue we we reese model is only covers part
00:16:44
of the faith and the model may not fit well all faces
00:16:47
so even when we're tracking people under different poses this may face and
00:16:52
so one of the things we did is because we have this information why not reconstruct
00:16:56
the face of people over time so that we have information about the whole
00:17:01
face not only the frontal part and we better model the phase representation
00:17:07
and then we get better tracking so here is is an is an example of results so here you have an illustration
00:17:12
of how the faces we constructed and here you have some tracking example the context of that you'll be impressed project
00:17:19
we add with a social computing group and which was involving about his students work here in this building
00:17:25
and so as you can see here we are able to track the person on on the really adverse
00:17:30
uh conditions even use the movie there are some inclusions and so on and we have quite some good
00:17:36
results in this field and an extension of this is that we could add for instance like
00:17:41
other basis functions not only about the shape but also the expressions and here you could see that
00:17:47
in the in the in in that case it was not intended for that but we could moderate expressions of for instance we were able to track the
00:17:53
notes of people even if some of them are quite subdued in terms of
00:17:57
of gestures uh obvious these ultimate goal is to do gaze tracking
00:18:02
so here you have some examples here in the end interacting with the
00:18:05
robot analysing interactions between people was able to see whether how
00:18:10
people look at each other in these situations or here in n. h. yeah context and in here you have kinetic actually uh
00:18:16
start up here i wear and with doing a actually a business in this domain
00:18:24
so finally uh one application immediate processing that that gets because i like it so
00:18:30
sometimes the goal we were working we seen it this year is visitation of old movies
00:18:34
right and at that time there were not many like that belies hours
00:18:38
so the the challenge your weather data was of low quality
00:18:42
and uh of course there's lots of perturbations and we don't know
00:18:47
what to stab allies your people moving base some background so
00:18:50
here this is what we were doing reviews of visual motion and uh
00:18:53
try to study lies everything is is that it's quite stable but
00:18:57
you don't get to see everything so you have to maintain still image without
00:19:01
uh i'm actually uh keeping the content and services
00:19:06
that exist that belies and this is original

Share this talk: 


Conference Program

Introduction by Hervé Bourlard
BOURLARD, Hervé, Idiap Director, EPFL Full Professor
Aug. 29, 2018 · 9:03 a.m.
916 views
Presentation of the «Speech & Audio Processing» research group
MAGIMAI DOSS, Mathew, Idiap Senior Researcher
Aug. 29, 2018 · 9:22 a.m.
16949 views
Presentation of the «Robot Learning & Interaction» research group
CALINON, Sylvain, Idiap Senior Researcher
Aug. 29, 2018 · 9:43 a.m.
9721 views
Presentation of the «Machine Learning» research group
FLEURET, François, Idiap Senior Researcher, EPFL Maître d'enseignement et de recherche
Aug. 29, 2018 · 10:04 a.m.
14111 views
Presentation of the «Uncertainty Quantification and Optimal Design» research group
GINSBOURGER, David, Idiap Senior Researcher, Bern Titular Professor
Aug. 29, 2018 · 11:05 a.m.
3210 views
Presentation of the «Perception and Activity Understanding» research group
ODOBEZ, Jean-Marc, Idiap Senior Researcher, EPFL Maître d'enseignement et de recherche
Aug. 29, 2018 · 11:24 a.m.
5620 views
Presentation of the «Computational Bioimaging» research group
LIEBLING, Michael, Idiap Senior Researcher, UC Santa Barbara Adjunct Professor
Aug. 29, 2018 · 11:45 a.m.
4130 views
Presentation of the «Natural Language Understanding» research group
HENDERSON, James, Idiap Senior Researcher
Aug. 29, 2018 · 2:03 p.m.
8976 views
Presentation of the «Biometrics Security and Privacy» research group
MARCEL, Sébastien, Idiap Senior Researcher
Aug. 29, 2018 · 2:19 p.m.
6512 views
Presentation of the «Biosignal Processing» research group
RABELLO DOS ANJOS, André, Idiap Researcher
Aug. 29, 2018 · 2:43 p.m.
4027 views
Presentation of the «Social Computing» research group
GATICA-PEREZ, Daniel, Idiap Senior Researcher, EPFL Adjunct Professor
Aug. 29, 2018 · 2:59 p.m.
7158 views

Recommended talks

IP1: Integrated Multimodal Processing
A. Billard, EPFL
Sept. 3, 2012 · 9:27 a.m.
Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.